The Design Warrior's Guide To FPGAs Warriors
User Manual:
Open the PDF directly: View PDF .
Page Count: 560
Download | |
Open PDF In Browser | View PDF |
LICENSE INFORMATION: This is a single-user copy of this eBook. It may not be copied or distributed. Unauthorized reproduction or distribution of this eBook may result in severe criminal penalties. The Design Warrior’s Guide to FPGAs The Design Warrior’s Guide to FPGAs Clive “Max” Maxfield Newnes is an imprint of Elsevier 200 Wheeler Road, Burlington, MA 01803, USA Linacre House, Jordan Hill, Oxford OX2 8DP, UK Copyright © 2004, Mentor Graphics Corporation and Xilinx, Inc. All rights reserved. Illustrations by Clive “Max” Maxfield No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.com.uk. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Customer Support” and then “Obtaining Permissions.” Recognizing the importance of preserving what has been written, Elsevier prints its books on acid-free paper whenever possible. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress. ISBN: 0-7506-7604-3 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. For information on all Newnes publications visit our Web site at www.newnespress.com 04 05 06 07 08 09 10 9 8 7 6 5 4 3 2 1 Printed in the United States of America To my wife Gina—the yummy-scrummy caramel, chocolate fudge, and rainbow-colored sprinkles on the ice cream sundae of my life Also, to my stepson Joseph and my grandchildren Willow, Gaige, Keegan, and Karma, all of whom will be tickled pink to see their names in a real book! For your delectation and delight, the CD accompanying this book contains a fullysearchable copy of The Design Warrior’s Guide to FPGAs in Adobe® Acrobat® (PDF) format. You can copy this PDF to your computer so as to be able to access The Design Warrior’s Guide to FPGAs as required (this is particularly useful if you travel a lot and use a notebook computer). The CD also contains a set of Microsoft® PowerPoint® files—one for each chapter and appendix—containing copies of the illustrations that are festooned throughout the book. This will be of particular interest for educators at colleges and universities when it comes to giving lectures or creating handouts based on The Design Warrior’s Guide to FPGAs Last but not least, the CD contains a smorgasbord of datasheets, technical articles, and useful web links provided by Mentor and Xilinx. Contents Preface . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . xi Chapter 1 Introduction . . . . . . . . . . 1 What are FPGAs? . . . . . . . Why are FPGAs of interest? . . What can FPGAs be used for? . What’s in this book? . . . . . . What’s not in this book?. . . . Who’s this book for? . . . . . . . . . . . . . . . . . . 1 1 4 6 7 8 Chapter 2 Fundamental Concepts . . . . . 9 The key thing about FPGAs. . . . 9 A simple programmable function . 9 Fusible link technologies . . . . . 10 Antifuse technologies . . . . . . 12 Mask-programmed devices . . . . 14 PROMs . . . . . . . . . . . . . . 15 EPROM-based technologies . . . 17 EEPROM-based technologies . . 19 FLASH-based technologies . . . 20 SRAM-based technologies . . . . 21 Summary . . . . . . . . . . . . . 22 Chapter 3 The Origin of FPGAs . . . . . . 25 Related technologies . . . . . . . 25 Transistors . . . . . . . . . . . . 26 Integrated circuits . . . . . . . . 27 SRAMs, DRAMs, and microprocessors. . . . . SPLDs and CPLDs . . . . ASICs (gate arrays, etc.) . FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 42 49 Chapter 4 Alternative FPGA Architectures 57 A word of warning . . . . . . . . 57 A little background information . 57 Antifuse versus SRAM versus … . . . . . . . . . . . . 59 Fine-, medium-, and coarse-grained architectures . . . . . . . . . . 66 MUX- versus LUT-based logic blocks . . . . . . . . . . . 68 CLBs versus LABs versus slices. . 73 Fast carry chains . . . . . . . . . 77 Embedded RAMs . . . . . . . . . 78 Embedded multipliers, adders, MACs, etc. . . . . . . . . . . . 79 Embedded processor cores (hard and soft) . . . . . . . . . 80 Clock trees and clock managers . 84 General-purpose I/O . . . . . . . 89 Gigabit transceivers . . . . . . . 92 Hard IP, soft IP, and firm IP . . . 93 System gates versus real gates . . 95 FPGA years . . . . . . . . . . . . 98 viii ■ The Design Warrior's Guide to FPGAs Chapter 5 Programming (Configuring) an FPGA . . . . . . . . . . . . 99 Weasel words . . . . . . . . . . . 99 Configuration files, etc. . . . . . 99 Configuration cells . . . . . . . 100 Antifuse-based FPGAs . . . . . 101 SRAM-based FPGAs . . . . . . 102 Using the configuration port . . 105 Using the JTAG port . . . . . . 111 Using an embedded processor. . 113 Chapter 6 Who Are All the Players? . . . 115 Introduction. . . . . . . . . . . 115 FPGA and FPAA vendors . . . 115 FPNA vendors . . . . . . . . . 116 Full-line EDA vendors . . . . . 116 FPGA-specialist and independent EDA vendors . . . . . . . . . . . . 117 FPGA design consultants with special tools . . . . . . . . . . . . . . 118 Open-source, free, and low-cost design tools . . . . . . . . . . . . . . 118 Chapter 7 FPGA Versus ASIC Design Styles . . . . . . . . . 121 Introduction. . . . . . . . . . . 121 Coding styles . . . . . . . . . . 122 Pipelining and levels of logic . . 122 Asynchronous design practices . 126 Clock considerations . . . . . . 127 Register and latch considerations129 Resource sharing (time-division multiplexing) . . . . . . . . . . . . 130 State machine encoding . . . . 131 Test methodologies . . . . . . . 131 Chapter 8 Schematic-Based Design Flows 133 In the days of yore. . . . . . . . 133 The early days of EDA . . . . . 134 A simple (early) schematic-driven ASIC flow. . . . . . . . . . . 141 A simple (early) schematic-driven FPGA flow . . . . . . . . . . 143 Flat versus hierarchical schematics . . . . . . . . . . . . . . . . 148 Schematic-driven FPGA design flows today. . . . . . . 151 Chapter 9 HDL-Based Design Flows . . . 153 Schematic-based flows grind to a halt . . . . . . . . . The advent of HDL-based flows Graphical design entry lives on . A positive plethora of HDLs . . Points to ponder. . . . . . . . . 153 153 161 163 172 Chapter 10 Silicon Virtual Prototyping for FPGAs . . . . . . . . . . 179 Just what is an SVP? . . . . . . 179 ASIC-based SVP approaches . . 180 FPGA-based SVPs . . . . . . . 187 Chaper 11 C/C++ etc.–Based Design Flows . . . . . . . . . . . . . . . . 193 Problems with traditional HDL-based flows . . . . . . . C versus C++ and concurrent versus sequential . . . . . . . SystemC-based flows . . . . . . Augmented C/C++-based flows Pure C/C++-based flows . . . . Different levels of synthesis abstraction . . . . . . . . . . 193 196 198 205 209 213 ■ Contents Mixed-language design and verification environments . . 214 Chapter 12 DSP-Based Design Flows . . . 217 Introducing DSP . . . . . . . . 217 Alternative DSP implementations . . . . . . . . . . . . . . . . 218 FPGA-centric design flows for DSPs . . . . . . . . . . . . 225 Mixed DSP and VHDL/ Verilog etc. environments . . 236 Chapter 13 Embedded Processor-Based Design Flows . . . . . . . . . 239 Introduction. . . . . . . . . . . 239 Hard versus soft cores . . . . . . 241 Partitioning a design into its hardware and software components 245 Hardware versus software views of the world . . . . . . . . . . 247 Using an FPGA as its own development environment . . 249 Improving visibility in the design . . . . . . . . . . . 250 A few coverification alternatives 251 A rather cunning design environment . . . . . . . . . 257 Chapter 14 Modular and Incremental Design . . . . . . . . . . . . 259 Handling things as one big chunk . . . . . . . . . . . 259 Partitioning things into smaller chunks . . . . . . . . 261 There’s always another way . . . 264 ix Chapter 15 High-Speed Design and Other PCB Considerations . . . . . 267 Before we start. . . . . . . . . We were all so much younger then . . . . . . . . . . . . . The times they are a-changing Other things to think about . . 267 . 267 . 269 . 272 Chapter 16 Observing Internal Nodes in an FPGA . . . . . . . . . . . 277 Lack of visibility. . . . . . Multiplexing as a solution Special debugging circuitry Virtual logic analyzers. . . VirtualWires . . . . . . . . . . . . . . . . . . . . . . 277 278 280 280 282 Chapter 17 Intellectual Property . . . . . 287 Sources of IP . . . Handcrafted IP . . IP core generators . Miscellaneous stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 287 290 291 Chapter 18 Migrating ASIC Designs to FPGAs and Vice Versa . . . . . . . . 293 Alternative design scenarios . . 293 Chapter 19 Simulation, Synthesis, Verification, etc. Design Tools . . . . . . . 299 Introduction. . . . . . . . . Simulation (cycle-based, event-driven, etc.) . . . . Synthesis (logic/HDL versus physically aware) . . . . . Timing analysis (static versus dynamic) . . . . . . Verification in general . . . Formal verification . . . . . . . 299 . . 299 . . 314 . . 319 . . 322 . . 326 x ■ The Design Warrior's Guide to FPGAs Miscellaneous . . . . . . . . . . 338 Chapter 20 Choosing the Right Device . . 343 So many choices . . . . . . . . If only there were a tool. . . . . Technology . . . . . . . . . . . Basic resources and packaging . General-purpose I/O interfaces . Embedded multipliers, RAMs, etc. . . . . . . . . . . Embedded processor cores. . . . Gigabit I/O capabilities . . . . . IP availability . . . . . . . . . . Speed grades. . . . . . . . . . . On a happier note. . . . . . . . 343 343 345 346 347 348 348 349 349 350 351 Chapter 21 Gigabit Transceivers . . . . . 353 Introduction. . . . . . . . . . Differential pairs . . . . . . . Multiple standards . . . . . . 8-bit/10-bit encoding, etc. . . Delving into the transceiver blocks . . . . . . . . . . . . Ganging multiple transceiver blocks together . . . . . . . Configurable stuff . . . . . . . Clock recovery, jitter, and eye diagrams. . . . . . . . . . . . . 353 354 357 358 . 361 . 362 . 364 . 367 Chaper 22 Reconfigurable Computing . . 373 Dynamically reconfigurable logic 373 Dynamically reconfigurable interconnect . . . . . . . . . 373 Reconfigurable computing . . . 374 Chapter 23 Field-Programmable Node Arrays . . . . . . . . . 381 Introduction. . . . . . . . . . . 381 Algorithmic evaluation . . . . . 383 picoChip’s picoArray technology 384 QuickSilver’s ACM technology 388 It’s silicon, Jim, but not as we know it! . . . . . . . . . 395 Chapter 24 Independent Design Tools . . 397 Introduction. . . . . . . ParaCore Architect . . . The Confluence system design language . . . . Do you have a tool? . . . . . . . 397 . . . . 397 . . . . 401 . . . . 406 Chapter 25 Creating an Open-Source-Based Design Flow . . . . . . . . . 407 How to start an FPGA design shop for next to nothing . . . 407 The development platform: Linux . . . . . . . . . . . . . . . . 407 The verification environment . 411 Formal verification . . . . . . . 413 Access to common IP components . . . . . . . . . . . . . . . . 416 Synthesis and implementation tools . . . . . . . . . . . . . . 417 FPGA development boards . . . 418 Miscellaneous stuff . . . . . . . 418 Chapter 26 Future FPGA Developments . 419 Be afraid, be very afraid . . . . Next-generation architectures and technologies . . . . . . Don’t forget the design tools . Expect the unexpected . . . . . 419 . 420 . 426 . 427 Appendix A: Signal Integrity 101 . . . . . 429 Before we start. . . . . . . . . . 429 Contents ■ Capacitive and inductive coupling (crosstalk) . . . . . . . . . . 430 Chip-level effects . . . . . . . . 431 Board-level effects. . . . . . . . 438 Appendix B: Deep-Submicron Delay Effects 101 . . . . . . . . . . . . . . . . 443 Introduction. . . . . . . . . . . 443 The evolution of delay specifications . . . . . . . . . 443 A potpourri of definitions. . . . 445 Alternative interconnect models 449 DSM delay effects . . . . . . . . 452 Summary . . . . . . . . . . . . 464 Appendix C: Linear Feedback Shift Registers 101 . . . . . . . . 465 The Ouroboras . . . . . . . . . 465 Many-to-one implementations . 465 xi More taps than you know what to do with . . . . . . . . 468 Seeding an LFSR . . . . . . . . 470 FIFO applications . . . . . . . . 472 Modifying LFSRs to sequence 2n values . . . . . . . . . . . . 474 Accessing the previous value . . 475 Encryption and decryption applications . . . . . . . . . . 476 Cyclic redundancy check applications . . . . . . . . . . 477 Data compression applications . 479 Built-in self-test applications . . 480 Pseudorandom-number-generation applications . . . . . . . . . . 482 Last but not least . . . . . . . . 482 Glossary . . . . . . . . . . . . . . 485 About the Author . . . . . . . . . 525 Index . . . . . . . . . . . . . . . 527 Preface This is something of a curious, atypical book for the technical genre (and as the author, I should know). I say this because this tome is intended to be of interest to an unusually broad and diverse readership. The primary audience comprises fully fledged engineers who are currently designing with field programmable gate arrays (FPGAs) or who are planning to do so in the not-so-distant future. Thus, Section 2: Creating FPGABased Designs introduces a wide range of different design flows, tools, and concepts with lots of juicy technical details that only an engineer could love. By comparison, other areas of the book—such as Section 1: Fundamental Concepts—cover a variety of topics at a relatively low technical level. The reason for this dichotomy is that there is currently a tremendous amount of interest in FPGAs, especially from people who have never used or considered them before. The first FPGA devices were relatively limited in the number of equivalent logic gates they supported and the performance they offered, so any “serious” (large, complex, high-performance) designs were automatically implemented as application-specific integrated circuits (ASICs) or application-specific standard parts (ASSPs). However, designing and building ASICs and ASSPs is an extremely time-consuming and expensive hobby, with the added disadvantage that the final design is “frozen in silicon” and cannot be easily modified without creating a new version of the device. By comparison, the cost of creating an FPGA design is much lower than that for an ASIC or ASSP. At the same time, implementing design changes is much easier in FPGAs and the time-to-market for such designs is much faster. Of particular interest is the fact that new FPGA architectures xiv ■ The Design Warrior's Guide to FPGAs containing millions of equivalent logic gates, embedded processors, and ultra-high-speed interfaces have recently become available. These devices allow FPGAs to be used for applications that would—until now—have been the purview only of ASICs and ASSPs. With regard to those FPGA devices featuring embedded processors, such designs require the collaboration of hardware and software engineers. In many cases, the software engineers may not be particularly familiar with some of the nitty-gritty design considerations associated with the hardware aspects of these devices. Thus, in addition to hardware design engineers, this book is also intended to be of interest to those members of the software fraternity who are tasked with creating embedded applications for these devices. Further intended audiences are electronics engineering students in colleges and universities; sales, marketing, and other folks working for EDA and FPGA companies; and analysts and magazine editors. Many of these readers will appreciate the lower technical level of the introductory material found in Section 1 and also in the “101-style” appendices. Last but not least, I tend to write the sort of book that I myself would care to read. (At this moment in time, I would particularly like to read this book—upon which I’m poised to commence work—because then I would have some clue as to what I was going to write … if you see what I mean.) Truth to tell, I rarely read technical books myself anymore because they usually bore my socks off. For this reason, in my own works I prefer to mix complex topics with underlying fundamental concepts (“where did this come from” and “why do we do it this way”) along with interesting nuggets of trivia. This has the added advantage that when my mind starts to wander in my autumn years, I will be able to amaze and entertain myself by rereading my own works (it’s always nice to have something to look forward to). Clive “Max” Maxfield, June 2003—January 2004 Acknowledgments I’ve long wanted to write a book on FPGAs, so I was delighted when my publisher—Carol Lewis at Elsevier Science (which I’m informed is the largest English-language publisher in the world)—presented me with the opportunity to do so. There was one slight problem, however, in that I’ve spent much of the last 10 years of my life slaving away the days at my real job, and then whiling away my evenings and weekends penning books. At some point it struck me that it would be nice to “get a life” and spend some time hanging out with my family and friends. Hence, I was delighted when the folks at Mentor Graphics and Xilinx offered to sponsor the creation of this tome, thereby allowing me to work on it in the days and to keep my evenings and weekends free. Even better, being an engineer by trade, I hate picking up a book that purports to be technical in nature, but that somehow manages to mutate into a marketing diatribe while I’m not looking. So I was delighted when both sponsors made it clear that this book should not be Mentor-centric or Xilinxcentric, but should instead present any and all information I deemed to be useful without fear or favor. You really can’t write a book like this one in isolation, and I received tremendous amounts of help and advice from people too numerous to mention. I would, however, like to express my gratitude to all of the folks at Mentor and Xilinx who gave me so much of their time and information. Thanks also to Gary Smith and Daya Nadamuni from Gartner DataQuest and Richard Goering from EETimes, who always make the time to answer my e-mails with the dread subject line “Just one more little question ...” xvi ■ The Design Warrior's Guide to FPGAs I would also like to mention the fact that the folks at 0-In, AccelChip, Actel, Aldec, Altera, Altium, Axis, Cadence, Carbon, Celoxica, Elanix, InTime, Magma, picoChip, QuickLogic, QuickSilver, Synopsys, Synplicity, The MathWorks, Hier Design, and Verisity were extremely helpful.1 It also behooves me to mention that Tom Hawkins from Launchbird Design Systems went above and beyond the call of duty in giving me his sagacious observations into open-source design tools. Similarly, Dr. Eric Bogatin at GigaTest Labs was kind enough to share his insights into signal integrity effects at the circuit board level. Last, but certainly not least, thanks go once again to my publisher—Carol Lewis at Elsevier Science—for allowing me to abstract the contents of appendix B from my book Designus Maximus Unleashed (ISBN 0-7506-9089-5) and also for allowing me to abstract the contents of appendix C from my book Bebop to the Boolean Boogie (An Unconventional Guide to Electronics), Second Edition (ISBN 0-7506-7543-8). 1. If I’ve forgotten anyone, I’m really sorry (let me know, and I’ll add you into the book for the next production run). Chapter 1 Introduction What are FPGAs? Field programmable gate arrays (FPGAs) are digital integrated circuits (ICs) that contain configurable (programmable) blocks of logic along with configurable interconnects between these blocks. Design engineers can configure (program) such devices to perform a tremendous variety of tasks. Depending on the way in which they are implemented, some FPGAs may only be programmed a single time, while others may be reprogrammed over and over again. Not surprisingly, a device that can be programmed only one time is referred to as one-time programmable (OTP). The “field programmable” portion of the FPGA’s name refers to the fact that its programming takes place “in the field” (as opposed to devices whose internal functionality is hardwired by the manufacturer). This may mean that FPGAs are configured in the laboratory, or it may refer to modifying the function of a device resident in an electronic system that has already been deployed in the outside world. If a device is capable of being programmed while remaining resident in a higher-level system, it is referred to as being in-system programmable (ISP). Why are FPGAs of interest? There are many different types of digital ICs, including “jelly-bean logic” (small components containing a few simple, fixed logical functions), memory devices, and microprocessors (µPs). Of particular interest to us here, however, are program- FPGA is pronounced by spelling it out as “F-P-G-A.” IC is pronounced by spelling it out as “I-C.” OTP is pronounced by spelling it out as “O-T-P.” ISP is pronounced by spelling it out as “I-S-P.” Pronounced “mu” to rhyme with “phew,” the “µ” in “µP” comes from the Greek micros, meaning “small.” 2 ■ The Design Warrior's Guide to FPGAs PLD is pronounced by spelling it out as “P-L-D.” SPLD is pronounced by spelling it out as “S-P-L-D.” CPLD is pronounced by spelling it out as “C-P-L-D.” ASIC is pronounced “A-SIC.” That is, by spelling out the “A” to rhyme with “hay,” followed by “SIC” to rhyme with “tick.” ASSP is pronounced by spelling it out as “A-S-S-P.” mable logic devices (PLDs), application-specific integrated circuits (ASICs), application-specific standard parts (ASSPs), and—of course—FPGAs. For the purposes of this portion of our discussion, we shall consider the term PLD to encompass both simple programmable logic devices (SPLDs) and complex programmable logic devices (CPLDs). Various aspects of PLDs, ASICs, and ASSPs will be introduced in more detail in chapters 2 and 3. For the nonce, we need only be aware that PLDs are devices whose internal architecture is predetermined by the manufacturer, but which are created in such a way that they can be configured (programmed) by engineers in the field to perform a variety of different functions. In comparison to an FPGA, however, these devices contain a relatively limited number of logic gates, and the functions they can be used to implement are much smaller and simpler. At the other end of the spectrum are ASICs and ASSPs, which can contain hundreds of millions of logic gates and can be used to create incredibly large and complex functions. ASICs and ASSPs are based on the same design processes and manufacturing technologies. Both are custom-designed to address a specific application, the only difference being that an ASIC is designed and built to order for use by a specific company, while an ASSP is marketed to multiple customers. (When we use the term ASIC henceforth, it may be assumed that we are also referring to ASSPs unless otherwise noted or where such interpretation is inconsistent with the context.) Although ASICs offer the ultimate in size (number of transistors), complexity, and performance; designing and building one is an extremely time-consuming and expensive process, with the added disadvantage that the final design is “frozen in silicon” and cannot be modified without creating a new version of the device. Thus, FPGAs occupy a middle ground between PLDs and ASICs because their functionality can be customized in the Introduction field like PLDs, but they can contain millions of logic gates1 and be used to implement extremely large and complex functions that previously could be realized only using ASICs. The cost of an FPGA design is much lower than that of an ASIC (although the ensuing ASIC components are much cheaper in large production runs). At the same time, implementing design changes is much easier in FPGAs, and the time-to-market for such designs is much faster. Thus, FPGAs make a lot of small, innovative design companies viable because—in addition to their use by large system design houses—FPGAs facilitate “Fred-in-the-shed”–type operations. This means they allow individual engineers or small groups of engineers to realize their hardware and software concepts on an FPGA-based test platform without having to incur the enormous nonrecurring engineering (NRE) costs or purchase the expensive toolsets associated with ASIC designs. Hence, there were estimated to be only 1,500 to 4,000 ASIC design starts2 and 5,000 ASSP design starts in 2003 (these numbers are falling dramatically year by year), as opposed to an educated “guesstimate” of around 450,000 FPGA design starts3 in the same year. 1 The concept of what actually comprises a “logic gate” becomes a little murky in the context of FPGAs. This topic will be investigated in excruciating detail in chapter 4. 2 This number is pretty vague because it depends on whom you talk to (not surprisingly, FPGA vendors tend to proclaim the lowest possible estimate, while other sources range all over the place). 3 Another reason these numbers are a little hard to pin down is that it’s difficult to get everyone to agree what a “design start” actually is. In the case of an ASIC, for example, should we include designs that are canceled in the middle, or should we only consider designs that make it all the way to tape-out? Things become even fluffier when it comes to FPGAs due to their reconfigurability. Perhaps more telling is the fact that, after pointing me toward an FPGA-centric industry analyst’s Web site, a representative from one FPGA vendor added, “But the values given there aren’t very accurate.” When I asked why, he replied with a sly grin, “Mainly because we don’t provide him with very good data!” ■ 3 NRE is pronounced by spelling it out as “N-R-E.” 4 ■ The Design Warrior's Guide to FPGAs I/O is pronounced by spelling it out as “I-O.” SoC is pronounced by spelling it out as “S-O-C.” What can FPGAs be used for? When they first arrived on the scene in the mid-1980s, FPGAs were largely used to implement glue logic,4 mediumcomplexity state machines, and relatively limited data processing tasks. During the early 1990s, as the size and sophistication of FPGAs started to increase, their big markets at that time were in the telecommunications and networking arenas, both of which involved processing large blocks of data and pushing that data around. Later, toward the end of the 1990s, the use of FPGAs in consumer, automotive, and industrial applications underwent a humongous growth spurt. FPGAs are often used to prototype ASIC designs or to provide a hardware platform on which to verify the physical implementation of new algorithms. However, their low development cost and short time-to-market mean that they are increasingly finding their way into final products (some of the major FPGA vendors actually have devices that they specifically market as competing directly against ASICs). By the early-2000s, high-performance FPGAs containing millions of gates had become available. Some of these devices feature embedded microprocessor cores, high-speed input/output (I/O) interfaces, and the like. The end result is that today’s FPGAs can be used to implement just about anything, including communications devices and software-defined radios; radar, image, and other digital signal processing (DSP) applications; all the way up to system-on-chip (SoC)5 components that contain both hardware and software elements. 4 The term glue logic refers to the relatively small amounts of simple logic that are used to connect (“glue”)—and interface between—larger logical blocks, functions, or devices. 5 Although the term system-on-chip (SoC) would tend to imply an entire electronic system on a single device, the current reality is that you invariably require additional components. Thus, more accurate appellations might be subsystem-on-chip (SSoC) or part of a system-on-chip (PoaSoC). Introduction To be just a tad more specific, FPGAs are currently eating into four major market segments: ASIC and custom silicon, DSP, embedded microcontroller applications, and physical layer communication chips. Furthermore, FPGAs have created a new market in their own right: reconfigurable computing (RC). ■ ■ ■ ■ ASIC and custom silicon: As was discussed in the previous section, today’s FPGAs are increasingly being used to implement a variety of designs that could previously have been realized using only ASICs and custom silicon. Digital signal processing: High-speed DSP has traditionally been implemented using specially tailored microprocessors called digital signal processors (DSPs). However, today’s FPGAs can contain embedded multipliers, dedicated arithmetic routing, and large amounts of on-chip RAM, all of which facilitate DSP operations. When these features are coupled with the massive parallelism provided by FPGAs, the result is to outperform the fastest DSP chips by a factor of 500 or more. Embedded microcontrollers: Small control functions have traditionally been handled by special-purpose embedded processors called microcontrollers. These lowcost devices contain on-chip program and instruction memories, timers, and I/O peripherals wrapped around a processor core. FPGA prices are falling, however, and even the smallest devices now have more than enough capability to implement a soft processor core combined with a selection of custom I/O functions. The end result is that FPGAs are becoming increasingly attractive for embedded control applications. Physical layer communications: FPGAs have long been used to implement the glue logic that interfaces between physical layer communication chips and highlevel networking protocol layers. The fact that today’s high-end FPGAs can contain multiple high-speed transceivers means that communications and network- ■ 5 RC is pronounced by spelling it out as “R-C.” DSP is pronounced by spelling it out as “D-S-P.” RAM is pronounced to rhyme with “ham.” 6 ■ The Design Warrior's Guide to FPGAs ■ EDA is pronounced by spelling it out as “E-D-A.” ing functions can be consolidated into a single device. Reconfigurable computing: This refers to exploiting the inherent parallelism and reconfigurability provided by FPGAs to “hardware accelerate” software algorithms. Various companies are currently building huge FPGA-based reconfigurable computing engines for tasks ranging from hardware simulation to cryptography analysis to discovering new drugs. What’s in this book? Anyone involved in the electronics design or electronic design automation (EDA) arenas knows that things are becoming evermore complex as the years go by, and FPGAs are no exception to this rule. Life was relatively uncomplicated in the early days—circa the mid-1980s—when FPGAs had only recently leaped onto the stage. The first devices contained only a few thousand simple logic gates (or the equivalent thereof), and the flows used to design these components—predominantly based on the use of schematic capture—were easy to understand and use. By comparison, today’s FPGAs are incredibly complex, and there are more design tools, flows, and techniques than you can swing a stick at. This book commences by introducing fundamental concepts and the various flavors of FPGA architectures and devices that are available. It then explores the myriad of design tools and flows that may be employed depending on what the design engineers are hoping to achieve. Furthermore, in addition to looking “inside the FPGA,” this book also considers the implications associated with integrating the device into the rest of the system in the form of a circuit board, including discussions on the gigabit interfaces that have only recently become available. Last but not least, electronic conversations are jam-packed with TLAs, which is a tongue-in-cheek joke that stands for Introduction “three-letter acronyms.” If you say things the wrong way when talking to someone in the industry, you immediately brand yourself as an outsider (one of “them” as opposed to one of “us”). For this reason, whenever we introduce new TLAs—or their larger cousins—we also include a note on how to pronounce them.6 What’s not in this book? This tome does not focus on particular FPGA vendors or specific FPGA devices, because new features and chip types appear so rapidly that anything written here would be out of date before the book hit the streets (sometimes before the author had completed the relevant sentence). Similarly, as far as possible (and insofar as it makes sense to do so), this book does not mention individual EDA vendors or reference their tools by name because these vendors are constantly acquiring each other, changing the names of—or otherwise transmogrifying—their companies, or varying the names of their design and analysis tools. Similarly, things evolve so quickly in this industry that there is little point in saying “Tool A has this feature, but Tool B doesn’t,” because in just a few months’ time Tool B will probably have been enhanced, while Tool A may well have been put out to pasture. For all of these reasons, this book primarily introduces different flavors of FPGA devices and a variety of design tool concepts and flows, but it leaves it up to the reader to research which FPGA vendors support specific architectural constructs and which EDA vendors and tools support specific features (useful Web addresses are presented in chapter 6). 6 In certain cases, the pronunciation for a particular TLA may appear in multiple chapters to help readers who are “cherry-picking” specific topics, rather than slogging their way through the book from cover to cover. ■ 7 TLA is pronounced by spelling it out as “T-L-A.” 8 ■ The Design Warrior's Guide to FPGAs 2,400,000 BC: Hominids in Africa Who’s this book for? This book is intended for a wide-ranging audience, which includes ■ ■ ■ ■ ■ ■ ■ Small FPGA design consultants Hardware and software design engineers in larger system houses ASIC designers who are migrating into the FPGA arena DSP designers who are starting to use FPGAs Students in colleges and universities Sales, marketing, and other guys and gals working for EDA and FPGA companies Analysts and magazine editors Chapter 2 Fundamental Concepts The key thing about FPGAs The thing that really distinguishes an FPGA from an ASIC is … the crucial aspect that resides at the core of their reason for being is … embodied in their name: All joking aside, the point is that in order to be programmable, we need some mechanism that allows us to configure (program) a prebuilt silicon chip. A simple programmable function As a basis for these discussions, let’s start by considering a very simple programmable function with two inputs called a and b and a single output y (Figure 2-1). Logic 1 Potential links a Pull-up resistors NOT b & y = 1 (N/A) AND NOT Figure 2-1. A simple programmable function. 10 ■ The Design Warrior's Guide to FPGAs 25,000 BC: The first boomerang is used by people in what is now Poland, 13,000 years before the Australians. The inverting (NOT) gates associated with the inputs mean that each input is available in both its true (unmodified) and complemented (inverted) form. Observe the locations of the potential links. In the absence of any of these links, all of the inputs to the AND gate are connected via pull-up resistors to a logic 1 value. In turn, this means that the output y will always be driving a logic 1, which makes this circuit a very boring one in its current state. In order to make our function more interesting, we need some mechanism that allows us to establish one or more of the potential links. Fusible link technologies One of the first techniques that allowed users to program their own devices was—and still is—known as fusible-link technology. In this case, the device is manufactured with all of the links in place, where each link is referred to as a fuse (Figure 2-2). Fuses Logic 1 Fat a Pull-up resistors Faf NOT Fbt b & y = 0 (N/A) AND Fbf NOT Figure 2-2. Augmenting the device with unprogrammed fusible links. These fuses are similar in concept to the fuses you find in household products like a television. If anything untoward occurs such that the television starts consuming too much power, its fuse will burn out. This results in an open circuit (a break in the wire), which protects the rest of the unit from Fundamental Concepts harm. Of course, the fuses in a silicon chip are formed using the same processes that are employed to create the transistors and wires on the chip, so they are microscopically small. When an engineer purchases a programmable device based on fusible links, all of the fuses are initially intact. This means that, in its unprogrammed state, the output from our example function will always be logic 0. (Any 0 presented to the input of an AND gate will cause its output to be 0, so if input a is 0, the output from the AND will be 0. Alternatively, if input a is 1, then the output from its NOT gate—which we shall call !a—will be 0, and once again the output from the AND will be 0. A similar situation occurs in the case of input b.) The point is that design engineers can selectively remove undesired fuses by applying pulses of relatively high voltage and current to the device’s inputs. For example, consider what happens if we remove fuses Faf and Fbt (Figure 2-3). Logic 1 Fat a Pull-up resistors NOT & b y = a & !b AND Fbf NOT Figure 2-3. Programmed fusible links. Removing these fuses disconnects the complementary version of input a and the true version of input b from the AND gate (the pull-up resistors associated with these signals cause their associated inputs to the AND to be presented with logic 1 values). This leaves the device to perform its new function, which is y = a & !b. (The “&” character in this equation is ■ 11 2,500 BC: Soldering is invented in Mesopotamia, to join sheets of gold. 12 ■ The Design Warrior's Guide to FPGAs OTP is pronounced by spelling it out as “O-T-P.” used to represent the AND, while the “!” character is used to represent the NOT. This syntax is discussed in a little more detail in chapter 3). This process of removing fuses is typically referred to as programming the device, but it may also be referred to as blowing the fuses or burning the device. Devices based on fusible-link technologies are said to be one-time programmable, or OTP, because once a fuse has been blown, it cannot be replaced and there’s no going back. As fate would have it, although modern FPGAs are based on a wide variety of programming technologies, the fusiblelink approach isn’t one of them. The reasons for mentioning it here are that it sets the scene for what is to come, and it’s relevant in the context of the precursor device technologies referenced in chapter 3. Antifuse technologies As a diametric alternative to fusible-link technologies, we have their antifuse counterparts, in which each configurable path has an associated link called an antifuse. In its unprogrammed state, an antifuse has such a high resistance that it may be considered an open circuit (a break in the wire), as illustrated in Figure 2-4. Unprogrammed antifuses a Logic 1 Pull-up resistors NOT b & AND NOT Figure 2-4. Unprogrammed antifuse links. y = 1 (N/A) Fundamental Concepts This is the way the device appears when it is first purchased. However, antifuses can be selectively “grown” (programmed) by applying pulses of relatively high voltage and current to the device’s inputs. For example, if we add the antifuses associated with the complementary version of input a and the true version of input b, our device will now perform the function y = !a & b (Figure 2-5). Programmed antifuses a Logic 1 Pull-up resistors NOT b & y = !a & b AND NOT Figure 2-5. Programmed antifuse links. An antifuse commences life as a microscopic column of amorphous (noncrystalline) silicon linking two metal tracks. In its unprogrammed state, the amorphous silicon acts as an insulator with a very high resistance in excess of one billion ohms (Figure 2-6a). Figure 2-6. Growing an antifuse. ■ 260 BC: Archimedes works out the principle of the lever. 13 14 ■ The Design Warrior's Guide to FPGAs The act of programming this particular element effectively “grows” a link—known as a via—by converting the insulating amorphous silicon into conducting polysilicon (Figure 2-6b). Not surprisingly, devices based on antifuse technologies are OTP, because once an antifuse has been grown, it cannot be removed, and there’s no changing your mind. ROM is pronounced to rhyme with “bomb.” RAM is pronounced to rhyme with “ham.” The concept of photomasks and the way in which silicon chips are created are described in more detail in Bebop to the Boolean Boogie (An Unconventional Guide to Electronics), ISBN 0-7506-7543-8 The term bit (meaning “binary digit”) was coined by John Wilder Tukey, the American chemist, turned topologist, turned statistician in the 1940s. Mask-programmed devices Before we proceed further, a little background may be advantageous in order to understand the basis for some of the nomenclature we’re about to run into. Electronic systems in general—and computers in particular—make use of two major classes of memory devices: read-only memory (ROM) and random-access memory (RAM). ROMs are said to be nonvolatile because their data remains when power is removed from the system. Other components in the system can read data from ROM devices, but they cannot write new data into them. By comparison, data can be both written into and read out of RAM devices, which are said to be volatile because any data they contain is lost when the system is powered down. Basic ROMs are also said to be mask-programmed because any data they contain is hard-coded into them during their construction by means of the photo-masks that are used to create the transistors and the metal tracks (referred to as the metallization layers) connecting them together on the silicon chip. For example, consider a transistor-based ROM cell that can hold a single bit of data (Figure 2-7). The entire ROM consists of a number of row (word) and column (data) lines forming an array. Each column has a single pull-up resistor attempting to hold that column to a weak logic 1 value, and every row-column intersection has an associated transistor and, potentially, a mask-programmed connection. The majority of the ROM can be preconstructed, and the same underlying architecture can be used for multiple customers. When it comes to customizing the device for use by a Fundamental Concepts Logic 1 Mask-programmed connection Pull-up resistor Row (word) line ■ 15 Tukey had initially considered using “binit” or “bigit,” but thankfully he settled on “bit,” which is much easier to say and use. The term software is also attributed to Tukey. Transistor Logic 0 Column (data) line Figure 2-7. A transistor-based mask-programmed ROM cell. particular customer, a single photo-mask is used to define which cells are to include a mask-programmed connection and which cells are to be constructed without such a connection. Now consider what happens when a row line is placed in its active state, thereby attempting to activate all of the transistors connected to that row. In the case of a cell that includes a mask-programmed connection, activating that cell’s transistor will connect the column line through the transistor to logic 0, so the value appearing on that column as seen from the outside world will be a 0. By comparison, in the case of a cell that doesn’t have a mask-programmed connection, that cell’s transistor will have no effect, so the pull-up resistor associated with that column will hold the column line at logic 1, which is the value that will be presented to the outside world. PROMs The problem with mask-programmed devices is that creating them is a very expensive pastime unless you intend to produce them in extremely large quantities. Furthermore, such components are of little use in a development environment in which you often need to modify their contents. For this reason, the first programmable read-only memory (PROM) devices were developed at Harris Semiconductor in 1970. These devices were created using a nichrome-based PROM is pronounced just like the high school dance of the same name. 16 ■ The Design Warrior's Guide to FPGAs 15 BC: The Chinese invent the belt drive. fusible-link technology. As a generic example, consider a somewhat simplified representation of a transistor-andfusible-link–based PROM cell (Figure 2-8). Logic 1 Fusible link Pull-up resistor Row (word) line Transistor Logic 0 Column (data) line Figure 2-8. A transistor-and-fusible-link–based PROM cell. In its unprogrammed state as provided by the manufacturer, all of the fusible links in the device are present. In this case, placing a row line in its active state will turn on all of the transistors connected to that row, thereby causing all of the column lines to be pulled down to logic 0 via their respective transistors. As we previously discussed, however, design engineers can selectively remove undesired fuses by applying pulses of relatively high voltage and current to the device’s inputs. Wherever a fuse is removed, that cell will appear to contain a logic 1. It’s important to note that these devices were initially intended for use as memories to store computer programs and constant data values (hence the “ROM” portion of their appellation). However, design engineers also found them useful for implementing simple logical functions such as lookup tables and state machines. The fact that PROMs were relatively cheap meant that these devices could be used to fix bugs or test new implementations by simply burning a new device and plugging it into the system. Fundamental Concepts ■ 17 Over time, a variety of more general-purpose PLDs based on fusible-link and antifuse technologies became available (these devices are introduced in more detail in chapter 3). EPROM-based technologies As was previously noted, devices based on fusible-link or antifuse technologies can only be programmed a single time—once you’ve blown (or grown) a fuse, it’s too late to change your mind. (In some cases, it’s possible to incrementally modify devices by blowing, or growing, additional fuses, but the fates have to be smiling in your direction.) For this reason, people started to think that it would be nice if there were some way to create devices that could be programmed, erased, and reprogrammed with new data. One alternative is a technology known as erasable programmable read-only memory (EPROM), with the first such device—the 1702—being introduced by Intel in 1971. An EPROM transistor has the same basic structure as a standard MOS transistor, but with the addition of a second polysilicon floating gate isolated by layers of oxide (Figure 2-9). Source terminal Control gate terminal Drain terminal Source terminal Control gate terminal Drain terminal control gate Silicon dioxide control gate source drain (a) Standard MOS transistor floating gate source drain Silicon substrate (b) EPROM transistor Figure 2-9. Standard MOS versus EPROM transistors. In its unprogrammed state, the floating gate is uncharged and doesn’t affect the normal operation of the control gate. In order to program the transistor, a relatively high voltage (the order of 12V) is applied between the control gate and drain EPROM is pronounced by spelling out the “E” to rhyme with “bee,” followed by “PROM.” 18 ■ The Design Warrior's Guide to FPGAs 60 AD: Hero, an Alexandrian Greek, builds a toy powered by stream. terminals. This causes the transistor to be turned hard on, and energetic electrons force their way through the oxide into the floating gate in a process known as hot (high energy) electron injection. When the programming signal is removed, a negative charge remains on the floating gate. This charge is very stable and will not dissipate for more than a decade under normal operating conditions. The stored charge on the floating gate inhibits the normal operation of the control gate and, thus, distinguishes those cells that have been programmed from those that have not. This means we can use such a transistor to form a memory cell (Figure 2-10). Logic 1 Pull-up resistor Row (word) line EPROM Transistor Logic 0 Column (data) line Figure 2-10. An EPROM transistor-based memory cell. Observe that this cell no longer requires a fusible-link, antifuse, or mask-programmed connection. In its unprogrammed state, as provided by the manufacturer, all of the floating gates in the EPROM transistors are uncharged. In this case, placing a row line in its active state will turn on all of the transistors connected to that row, thereby causing all of the column lines to be pulled down to logic 0 via their respective transistors. In order to program the device, engineers can use the inputs to the device to charge the floating gates associated with selected transistors, thereby disabling those Fundamental Concepts transistors. In these cases, the cells will appear to contain logic 1 values. As they are an order of magnitude smaller than fusible links, EPROM cells are efficient in terms of silicon real estate. Their main claim to fame, however, is that they can be erased and reprogrammed. An EPROM cell is erased by discharging the electrons on that cell’s floating gate. The energy required to discharge the electrons is provided by a source of ultraviolet (UV) radiation. An EPROM device is delivered in a ceramic or plastic package with a small quartz window in the top, where this window is usually covered with a piece of opaque sticky tape. In order for the device to be erased, it is first removed from its host circuit board, its quartz window is uncovered, and it is placed in an enclosed container with an intense UV source. The main problems with EPROM devices are their expensive packages with quartz windows and the time it takes to erase them, which is in the order of 20 minutes. A foreseeable problem with future devices is paradoxically related to improvements in the process technologies that allow transistors to be made increasingly smaller. As the structures on the device become smaller and the density (number of transistors and interconnects) increases, a larger percentage of the surface of the die is covered by metal. This makes it difficult for the EPROM cells to absorb the UV light and increases the required exposure time. Once again, these devices were initially intended for use as programmable memories (hence the “PROM” portion of their name). However, the same technology was later applied to more general-purpose PLDs, which therefore became known as erasable PLDs (EPLDs). EEPROM-based technologies The next rung up the technology ladder appeared in the form of electrically erasable programmable read-only memories (EEPROMs or E2PROMs). An E2PROM cell is approximately 2.5 times larger than an equivalent EPROM cell because it ■ 19 UV is pronounced by spelling it out as “U-V.” EPLD is pronounced by spelling it out as “E-P-L-D.” EEPROM is pronounced by spelling out the “E-E” to rhyme with “bee-bee,” followed by “PROM.” 20 ■ The Design Warrior's Guide to FPGAs In the case of the alterna2 tive E PROM designation, the “E2” stands for “E to the power of two,” or “E-squared.” Thus, E2PROM is pronounced “E-squared-PROM.” comprises two transistors and the space between them (Figure 2-11). Normal MOS transistor E2PROM transistor E2 PROM Cell 2 Figure 2-11. An E PROM-–cell. EEPLD is pronounced by spelling it out as “E-E-P-L-D.” 2 E PLD is pronounced “E-squared-P-L-D.” The E2PROM transistor is similar to that of an EPROM transistor in that it contains a floating gate, but the insulating oxide layers surrounding this gate are very much thinner. The second transistor can be used to erase the cell electrically. E2PROMs first saw the light of day as computer memories, but the same technology was subsequently applied to PLDs, which therefore became known as electrically erasable PLDs (EEPLDs or E2PLDs). FLASH-based technologies A development known as FLASH can trace its ancestry to both the EPROM and E2PROM technologies. The name “FLASH” was originally coined to reflect this technology’s rapid erasure times compared to EPROM. Components based on FLASH can employ a variety of architectures. Some have a single floating gate transistor cell with the same area as an EPROM cell, but with the thinner oxide layers characteristic of an E2PROM component. These devices can be electrically erased, but only by clearing the whole device or large portions thereof. Other architectures feature a two-transistor cell similar to that of an E2PROM cell, thereby allowing them to be erased and reprogrammed on a word-by-word basis. Fundamental Concepts ■ 21 Initial versions of FLASH could only store a single bit of data per cell. By 2002, however, technologists were experimenting with a number of different ways of increasing this capacity. One technique involves storing distinct levels of charge in the FLASH transistor’s floating gate to represent two bits per cell. An alternative approach involves creating two discrete storage nodes in a layer below the gate, thereby supporting two bits per cell. SRAM-based technologies There are two main versions of semiconductor RAM devices: dynamic RAM (DRAM) and static RAM (SRAM). In the case of DRAMs, each cell is formed from a transistorcapacitor pair that consumes very little silicon real estate. The “dynamic” qualifier is used because the capacitor loses its charge over time, so each cell must be periodically recharged if it is to retain its data. This operation—known as refreshing—is a tad complex and requires a substantial amount of additional circuitry. When the “cost” of this refresh circuitry is amortized over tens of millions of bits in a DRAM memory device, this approach becomes very cost effective. However, DRAM technology is of little interest with regard to programmable logic. By comparison, the “static” qualifier associated with SRAM is employed because—once a value has been loaded into an SRAM cell—it will remain unchanged unless it is specifically altered or until power is removed from the system. Consider the symbol for an SRAM-based programmable cell (Figure 2-12). SRAM Figure 2-12. An SRAM-based programmable cell. DRAM is pronounced by spelling out the “D” to rhyme with “knee,” followed by “RAM” to rhyme with “spam.” SRAM is pronounced by spelling out the “S” to rhyme with “less,” followed by “RAM” to rhyme with “Pam.” 22 ■ The Design Warrior's Guide to FPGAs The entire cell comprises a multitransistor SRAM storage element whose output drives an additional control transistor. Depending on the contents of the storage element (logic 0 or logic 1), the control transistor will either be OFF (disabled) or ON (enabled). One disadvantage of having a programmable device based on SRAM cells is that each cell consumes a significant amount of silicon real estate because these cells are formed from four or six transistors configured as a latch. Another disadvantage is that the device’s configuration data (programmed state) will be lost when power is removed from the system. In turn, this means that these devices always have to be reprogrammed when the system is powered on. However, such devices have the corresponding advantage that they can be reprogrammed quickly and repeatedly as required. The way in which these cells are used in SRAM-based FPGAs is discussed in more detail in the following chapters. For our purposes here, we need only note that such cells could conceptually be used to replace the fusible links in our example circuit shown in Figure 2-2, the antifuse links in Figure 2-4, or the transistor (and associated mask-programmed connection) associated with the ROM cell in Figure 2-7 (of course, this latter case, having an SRAM-based ROM, would be meaningless). MRAM is pronounced by spelling out the “M” to rhyme with “hem,” followed by “RAM” to rhyme with “clam.” Summary Table 2-1 shows the devices with which the various programming technologies are predominantly associated. Additionally, we shouldn’t forget that new technologies are constantly bobbing to the surface. Some float around for a bit, and then sink without a trace while you aren’t looking; others thrust themselves onto center stage so rapidly that you aren’t quite sure where they came from. For example, one technology that is currently attracting a great deal of interest for the near-term future is magnetic RAM (MRAM). The seeds of this technology were sown back in 1974, when IBM developed a component called a magnetic Fundamental Concepts ■ 23 Table 2-1. Summary of Programming Technologies tunnel junction (MJT). This comprises a sandwich of two ferromagnetic layers separated by a thin insulating layer. An MRAM memory cell can be created at the intersection of two tracks—say a row (word) line and a column (data) line—with an MJT sandwiched between them. MRAM cells have the potential to combine the high speed of SRAM, the storage capacity of DRAM, and the nonvolatility of FLASH, all while consuming a miniscule amount of power. MRAM-based memory chips are predicted to become available circa 2005. Once these memory chips do reach the market, other devices—such as MRAM-based FPGAs—will probably start to appear shortly thereafter. MJT is pronounced by spelling it out as “M-J-T.” Chapter 3 The Origin of FPGAs Related technologies In order to get a good feel for the way in which FPGAs developed and the reasons why they appeared on the scene in the first place, it’s advantageous to consider them in the context of other related technologies (Figure 3-1). 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs FPGAs Figure 3-1. Technology timeline (dates are approximate). The white portions of the timeline bars in this illustration indicate that although early incarnations of these technologies may have been available, for one reason or another they weren’t enthusiastically received by the engineers working in the trenches during this period. For example, although Xilinx introduced the world’s first FPGA as early as 1984, design engineers didn’t really start using these little scamps with gusto and abandon until the early 1990s. 26 ■ The Design Warrior's Guide to FPGAs BJT is pronounced by spelling it out as “B-J-T.” TTL is pronounced by spelling it out as “T-T-L.” ECL is pronounced by spelling it out as “E-C-L.” FET is pronounced to rhyme with “bet.” NMOS, PMOS, and CMOS are pronounced by spelling out the “N,” “P,” “or “C” to rhyme with “hen,” “pea,” or “sea,” respectively, followed by “MOS” to rhyme with “boss.” Transistors On December 23, 1947, physicists William Shockley, Walter Brattain, and John Bardeen, working at Bell Laboratories in the United States, succeeded in creating the first transistor: a point-contact device formed from germanium (chemical symbol Ge). The year 1950 saw the introduction of a more sophisticated component called a bipolar junction transistor (BJT), which was easier and cheaper to build and had the added advantage of being more reliable. By the late 1950s, transistors were being manufactured out of silicon (chemical symbol Si) rather than germanium. Even though germanium offered certain electrical advantages, silicon was cheaper and more amenable to work with. If BJTs are connected together in a certain way, the resulting digital logic gates are classed as transistor-transistor logic (TTL). An alternative method of connecting the same transistors results in emitter-coupled logic (ECL). Logic gates constructed in TTL are fast and have strong drive capability, but they also consume a relatively large amount of power. Logic gates built in ECL are substantially faster than their TTL counterparts, but they consume correspondingly more power. In 1962, Steven Hofstein and Fredric Heiman at the RCA research laboratory in Princeton, New Jersey, invented a new family of devices called metal-oxide semiconductor field-effect transistors (MOSFETs). These are often just called FETs for short. Although the original FETs were somewhat slower than their bipolar cousins, they were cheaper, smaller, and used substantially less power. There are two main types of FETs, called NMOS and PMOS. Logic gates formed from NMOS and PMOS transistors connected together in a complementary manner are known as a complementary metal-oxide semiconductor (CMOS). Logic gates implemented in CMOS used to be a tad slower than their TTL cousins, but both technologies are pretty The Origin of FPGAs ■ 27 much equivalent in this respect these days. However, CMOS logic gates have the advantage that their static (nonswitching) power consumption is extremely low. Integrated circuits The first transistors were provided as discrete components that were individually packaged in small metal cans. Over time, people started to think that it would be a good idea to fabricate entire circuits on a single piece of semiconductor. The first public discussion of this idea is credited to a British radar expert, G. W. A. Dummer, in a paper presented in 1952. But it was not until the summer of 1958 that Jack Kilby, working for Texas Instruments (TI), succeeded in fabricating a phase-shift oscillator comprising five components on a single piece of semiconductor. Around the same time that Kilby was working on his prototype, two of the founders of Fairchild Semiconductor—the Swiss physicist Jean Hoerni and the American physicist Robert Noyce—invented the underlying optical lithographic techniques that are now used to create transistors, insulating layers, and interconnections on modern ICs. During the mid-1960s, TI introduced a large selection of basic building block ICs called the 54xx (“fifty-four hundred”) series and the 74xx (“seventy-four hundred”) series, which were specified for military and commercial use, respectively. These “jelly bean” devices, which were typically around 3/4" long, 3/8" wide, and had 14 or 16 pins, each contained small amounts of simple logic (for those readers of a pedantic disposition, some were longer, wider, and had more pins). For example, a 7400 device contained four 2-input NAND gates, a 7402 contained four 2-input NOR gates, and a 7404 contained six NOT (inverter) gates. TI’s 54xx and 74xx series were implemented in TTL. By comparison, in 1968, RCA introduced a somewhat equivalent CMOS-based library of parts called the 4000 (“four thousand”) series. IC is pronounced by spelling it out as “I-C.” 28 ■ The Design Warrior's Guide to FPGAs SRAM and DRAM are pronounced by spelling out the “S” or “D” to rhyme with “mess” or “bee,” respectively, followed by “RAM” to rhyme with “spam.” PLD and SPLD are pronounced by spelling them out as “P-L-D” and “S-P-L-D,” respectively. SRAMs, DRAMs, and microprocessors The late 1960s and early 1970s were rampant with new developments in the digital IC arena. In 1970, for example, Intel announced the first 1024-bit DRAM (the 1103) and Fairchild introduced the first 256-bit SRAM (the 4100). One year later, in 1971, Intel introduced the world’s first microprocessor (µP)—the 4004—which was conceived and created by Marcian “Ted” Hoff, Stan Mazor, and Federico Faggin. Also referred to as a “computer-on-a-chip,” the 4004 contained only around 2,300 transistors and could execute 60,000 operations per second. Actually, although the 4004 is widely documented as being the first microprocessor, there were other contenders. In February 1968, for example, International Research Corporation developed an architecture for what they referred to as a “computer-on-a-chip.” And in December 1970, a year before the 4004 saw the light of day, one Gilbert Hyatt filed an application for a patent entitled “Single Chip Integrated Circuit Computer Architecture” (wrangling about this patent continues to this day). What typically isn’t disputed, however, is the fact that the 4004 was the first microprocessor to be physically constructed, to be commercially available, and to actually perform some useful tasks. The reason SRAM and microprocessor technologies are of interest to us here is that the majority of today’s FPGAs are SRAM-based, and some of today’s high-end devices incorporate embedded microprocessor cores (both of these topics are discussed in more detail in chapter 4). SPLDs and CPLDs The first programmable ICs were generically referred to as programmable logic devices (PLDs). The original components, which started arriving on the scene in 1970 in the form of PROMs, were rather simple, but everyone was too polite to mention it. It was only toward the end of the 1970s that significantly more complex versions became available. In order The Origin of FPGAs to distinguish them from their less-sophisticated ancestors, which still find use to this day, these new devices were referred to as complex PLDs (CPLDs). Perhaps not surprisingly, it subsequently became common practice to refer to the original, less-pretentious versions as simple PLDs (SPLDs). Just to make life more confusing, some people understand the terms PLD and SPLD to be synonymous, while others regard PLD as being a superset that encompasses both SPLDs and CPLDs (unless otherwise noted, we shall embrace this latter interpretation). And life just keeps on getting better and better because engineers love to use the same acronym to mean different things or different acronyms to mean the same thing (listening to a gaggle of engineers regaling each other in conversation can make even the strongest mind start to “throw a wobbly”). In the case of SPLDs, for example, there is a multiplicity of underlying architectures, many of which have acronyms formed from different combinations of the same three or four letters (Figure 3-2). PLDs SPLDs PROMs PLAs CPLDs PALs GALs etc. Figure 3-2. A positive plethora of PLDs. Of course there are also EPLD, E2PLD, and FLASH versions of many of these devices—for example, EPROMs and E2PROMs—but these are omitted from figure 3-2 for purposes of simplicity (these concepts were introduced in chapter 2). ■ 29 1500: Italy. Leonard da Vinci sketches details of a rudimentary mechanical calculator. 30 ■ The Design Warrior's Guide to FPGAs PROMs The first of the simple PLDs were PROMs, which appeared on the scene in 1970. One way to visualize how these devices perform their magic is to consider them as consisting of a fixed array of AND functions driving a programmable array of OR functions. For example, consider a 3-input, 3-output PROM (Figure 3-3). a b c Predefined link Programmable link Address 0 & Address 1 & Address 2 & Address 3 & Address 4 & Address 5 & Address 6 & Address 7 & !a & b & !c !a & b & c a & !b & !c a & !b & c a & b & !c a &b& c l l Predefined AND array !a & !b & c l a !a b !b c !c !a & !b & !c Programmable OR array PROM is pronounced like the high school dance of the same name. w x y Figure 3-3. Unprogrammed PROM (predefined AND array, programmable OR array). The programmable links in the OR array can be implemented as fusible links, or as EPROM transistors and E2PROM cells in the case of EPROM and E2PROM devices, respectively. It is important to realize that this illustration is intended only to provide a high-level view of the way in which our example device works—it does not represent an actual circuit diagram. In reality, each AND function in the AND array has three inputs provided by the appropriate true or complemented versions of the a, b, and c device inputs. Similarly, each OR function in the OR array has eight inputs provided by the outputs from the AND array. The Origin of FPGAs As was previously noted, PROMs were originally intended for use as computer memories in which to store program instructions and constant data values. However, design engineers also used them to implement simple logical functions such as lookup tables and state machines. In fact, a PROM can be used to implement any block of combinational (or combinational) logic so long as it doesn’t have too many inputs or outputs. The simple 3-input, 3-output PROM shown in Figure 3-3, for example, can be used to implement any combinatorial function with up to 3 inputs and 3 outputs. In order to understand how this works, consider the small block of logic shown in Figure 3-4 (this circuit has no significance beyond the purposes of this example). a b & w x l c y a b c w x y 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 ■ 31 Some folks prefer to say “combinational logic,” while others favor “combinatorial logic.” Figure 3-4. A small block of combinational logic. We could replace this block of logic with our 3-input, 3-output PROM. We would only need to program the appropriate links in the OR array (Figure 3-5). With regard to the equations shown in this figure, “&” represents AND, “|” represents OR, “^” represents XOR, and “!” represents NOT. This syntax (or numerous variations thereof) was very common in the early days of PLDs because it allowed logical equations to be easily and concisely represented in text files using standard computer keyboard characters. The above example is, of course, very simple. Real PROMs can have significantly more inputs and outputs and can, therefore, be used to implement larger blocks of combinational logic. From the mid-1960s until the mid-1980s (or later), The ‘&’ (ampersand) character is commonly referred to as an “amp” or “amper.” The ‘|’ (vertical line) character is commonly referred to as a “bar,” “or,” or “pipe.” ■ The Design Warrior's Guide to FPGAs The ‘^’ (circumflex) character is commonly referred to as a “hat,” “control,” “up-arrow,” or “caret.” More rarely it may be referred to as a “chevron,” “power of” (as in “to the power of”), or “shark-fin.” b c Predefined link Programmable link Address 0 & Address 1 & Address 2 & Address 3 & Address 4 & Address 5 & Address 6 & Address 7 & !a & !b & !c !a & !b & c !a & b & !c !a & b & c a & !b & !c a & !b & c a & b & !c a &b& c l Predefined AND array l a !a b !b c !c l The ‘!’ (exclamation mark) character is commonly referred to as a “bang,” “ping,” or “shriek”. a Programmable OR array 32 w x y w = (a & b) x = !(a & b) y = (a & b) ^ c Figure 3-5. Programmed PROM. combinational logic was commonly implemented by means of jelly bean ICs such as the TI 74xx series devices. The fact that quite a large number of these jelly bean chips could be replaced with a single PROM resulted in circuit boards that were smaller, lighter, cheaper, and less prone to error (each solder joint on a circuit board provides a potential failure mechanism). Furthermore, if any logic errors were subsequently discovered in this portion of the design (if the design engineer had inadvertently used an AND function instead of a NAND, for example), then these slipups could easily be fixed by blowing a new PROM (or erasing and reprogramming an EPROM or E2PROM). This was preferable to the ways in which errors had to be addressed on boards based on jelly bean ICs. These included adding new devices to the board, cutting existing tracks with a scalpel, and adding wires by hand to connect the new devices into the rest of the circuit. The Origin of FPGAs ■ 33 In logical terms, the AND (“&”) operator is known as a logical multiplication or product, while the OR (“|”) operator is known as a logical addition or sum. Furthermore, when we have a logical equation in the form y = (a & !b & c) | (!a & b & c) | (a & !b & !c) | (a & !b & c) then the term literal refers to any true or inverted variable (a, !a, b, !b, etc.), and a group of literals linked by “&” operators is referred to as a product term. Thus, the product term (a & !b & c) contains three literals—a, !b, and c—and the above equation is said to be in sum-of-products form. The point is that, when they are employed to implement combinational logic as illustrated in figures 3-4 and 3-5, PROMs are useful for equations requiring a large number of product terms, but they can support relatively few inputs because every input combination is always decoded and used. PLAs In order to address the limitations imposed by the PROM architecture, the next step up the PLD evolutionary ladder was that of programmable logic arrays (PLAs), which first became available circa 1975. These were the most user configurable of the simple PLDs because both the AND and OR arrays were programmable. First, consider a simple 3-input, 3-output PLA in its unprogrammed state (Figure 3.6). Unlike a PROM, the number of AND functions in the AND array is independent of the number of inputs to the device. Additional ANDs can be formed by simply introducing more rows into the array. Similarly, the number of OR functions in the OR array is independent of both the number of inputs to the device and the number of AND functions in the AND array. Additional ORs can be formed by simply introducing more columns into the array. PLA is pronounced by spelling it out as “P-L-A.” The Design Warrior's Guide to FPGAs a 1600: John Napier invents a simple multiplication table called Napier’s Bones. b c Predefined link Programmable link N/A & Program mable OR array ■ N/A & N/A & l l Predefined AND array l a !a b !b c !c w x y Figure 3-6. Unprogrammed PLA (programmable AND and OR arrays). Now assume that we wish our example PLA to implement the three equations shown below. We can achieve this by programming the appropriate links as illustrated in Figure 3-7. a b c Predefined link Programmable link & & & a&b& c a&c !b & !c l l Predefined AND array l a !a b !b c !c w x y w = (a & c) | (!b & !c) x = (a & b & c) | (!b & !c) y = (a & b & c) Figure 3-7. Programmed PLA. w = (a & c) | (!b & !c) x = (a & b & c) ! (!b & !c) y = (a & b & c) Program mable OR array 34 The Origin of FPGAs As fate would have it, PLAs never achieved any significant level of market presence, but several vendors experimented with different flavors of these devices for a while. For example, PLAs were not obliged to have AND arrays feeding OR arrays, and some alternative architectures such as AND arrays feeding NOR arrays were occasionally seen strutting their stuff. However, while it would be theoretically possible to field architectures such as OR-AND, NAND-OR, and NAND-NOR, these variations were relatively rare or nonexistent. One reason these devices tended to stick to AND-OR1 (and AND-NOR) architectures was that the sum-of-products representations most often used to specify logical equations could be directly mapped onto these structures. Other equation formats—like product-of-sums—could be accommodated using standard algebraic techniques (this was typically performed by means of software programs that could perform these techniques with their metaphorical hands tied behind their backs). PLAs were touted as being particularly useful for large designs whose logical equations featured a lot of common product terms that could be used by multiple outputs; for example, the product term (!b & !c) is used by both the w and x outputs in Figure 3-7. This feature may be referred to as product-term sharing. On the downside, signals take a relatively long time to pass through programmable links as opposed to their predefined counterparts. Thus, the fact that both their AND and OR arrays were programmable meant that PLAs were significantly slower than PROMs. 1 Actually, one designer I talked to a few moments before penning these words told me that his team created a NOT-NOR-NOR-NOT architecture (this apparently offered a slight speed advantage), but they told their customers it was an AND-OR architecture (which is how it appeared to the outside world) because “that was what they were expecting.” Even today, what device vendors say they build and what they actually build are not necessarily the same thing. ■ 1614: John Napier invents logarithms. 35 ■ The Design Warrior's Guide to FPGAs PAL, which is a registered trademark of Monolithic Memories, Inc., is pronounced the same way you’d greet a buddy (“Hiya pal”). Created by Lattice Semiconductor Corporation in 1983, generic array logic (GAL) devices offered sophisticated CMOS electrically erasable (E2) variations on the PAL concept. PALs and GALs In order to address the speed problems posed by PLAs, a new class of device called programmable array logic (PAL) was introduced in the late 1970s. Conceptually, a PAL is almost the exact opposite of a PROM because it has a programmable AND array and a predefined OR array. As an example, consider a simple 3-input, 3-output PAL in its unprogrammed state (Ffigure 3-8). a b c Predefined link Programmable link & GAL is pronounced the same way a guy thinks of his wife or girlfriend (“What a gal!”). Predefined OR array 36 & & & & & l l Programmable AND array l a !a b !b c !c w x y Figure 3-8. Unprogrammed PAL (programmable AND array, predefined OR array). The advantage of PALs (as compared to PLAs) is that they are faster because only one of their arrays is programmable. On the downside, PALs are more limited because they only allow a restricted number of product terms to be ORed together (but engineers are cunning people, and we have lots of tricks up our sleeves that, to a large extent, allow us to get around this sort of thing). Additional programmable options The PLA and PAL examples shown above were small and rudimentary for the purposes of simplicity. In addition to being a lot larger (having more inputs, outputs, and internal The Origin of FPGAs ■ signals), real devices can offer a variety of additional programmable options, such as the ability to invert the outputs or to have tristatable outputs, or both. Furthermore, some devices support registered or latched outputs (with associated programmable multiplexers that allow the user to specify whether to use the registered or nonregistered version of the output on a pin-by-pin basis). And some devices provide the ability to configure certain pins to act as either outputs or additional inputs, and the list of options goes on. The problem here is that different devices may provide different subsets of the various options, which makes selecting the optimum device for a particular application something of a challenge. Engineers typically work around this by (a) restricting themselves to a limited selection of devices and then tailoring their designs to these devices, or (b) using a software program to help them decide which devices best fit their requirements on an application-by-application basis. CPLDs The one truism in electronics is that everyone is always looking for things to get bigger (in terms of functional capability), smaller (in terms of physical size), faster, more powerful, and cheaper—surely that’s not too much to ask, is it? Thus, the tail end of the 1970s and the early 1980s began to see the emergence of more sophisticated PLD devices that became known as complex PLDs (CPLDs). Leading the fray were the inventors of the original PAL devices—the guys and gals at Monolithic Memories Inc. (MMI)—who introduced a component they called a MegaPAL. This was an 84-pin device that essentially comprised four standard PALs with some interconnect linking them together. Unfortunately, the MegaPAL consumed a disproportionate amount of power, and it was generally perceived to offer little advantage compared to using four individual devices. The big leap forward occurred in 1984, when newly formed Altera introduced a CPLD based on a combination of CMOS CPLD is pronounced by spelling it out as “C-P-L-D.” 37 38 ■ The Design Warrior's Guide to FPGAs 1621: William Oughtred invents the slide rule (based on John Napier’s Logarithms). and EPROM technologies. Using CMOS allowed Altera to achieve tremendous functional density and complexity while consuming relatively little power. And basing the programmability of these devices on EPROM cells made them ideal for use in development and prototyping environments. Having said this, Altera’s claim to fame wasn’t due only to the combination of CMOS and EPROM. When engineers started to grow SPLD architectures into larger devices like the MegaPAL, it was originally assumed that the central interconnect array (also known as the programmable interconnect matrix) linking the individual SPLD blocks required 100 percent connectivity to the inputs and outputs associated with each block. The problem was that a twofold increase in the size of the SPLD blocks (equating to twice the inputs and twice the outputs) resulted in a fourfold increase in the size of the interconnect array. In turn, this resulted in a huge decrease in speed coupled with higher power dissipation and increased component costs. Altera made the conceptual leap to using a central interconnect array with less than 100 percent connectivity (see the discussions associated with figure 3-10 for a tad more information on this concept). This increased the complexity of the software design tools, but it kept the speed, power, and cost of these devices scalable. Although every CPLD manufacturer fields its own unique architecture, a generic device consists of a number of SPLD blocks (typically PALs) sharing a common programmable interconnection matrix (Figure 3-9). In addition to programming the individual SPLD blocks, the connections between the blocks can be programmed by means of the programmable interconnect matrix. Of course, figure 3-9 is a high-level representation. In reality, all of these structures are formed on the same piece of silicon, and various additional features are not shown here. For example, the programmable interconnect matrix may contain a lot of wires (say 100), but this is more than can be The Origin of FPGAs ■ 39 1623: Wilhelm Schickard invents the first mechanical calculator. Programmable Interconnect matrix SPLD-like blocks Input/output pins Figure 3-9. A generic CPLD structure. handled by the individual SPLD blocks, which might only be able to accommodate a limited number of signals (say 30). Thus, the SPLD blocks are interfaced to the interconnect matrix using some form of programmable multiplexer (Figure 3-10). 100 wires Programmable multiplexer 30 wires Figure 3-10. Using programmable multiplexers. Depending on the manufacturer and the device family, the CPLD’s programmable switches may be based on EPROM, E2PROM, FLASH, or SRAM cells. In the case of SRAM-based devices, some variants increase their versatility by allowing the SRAM cells associated with each SPLD block to be used either as programmable switches or as an actual chunk of memory. 40 ■ The Design Warrior's Guide to FPGAs ABEL, CUPL, PALASM, JEDEC, etc. The Dark Ages refers to the period of history between classical antiquity and the Italian Renaissance. (Depending on the source, the starting point for the Dark Ages can vary by several hundred years.) In many respects, the early days of PLDs were the design engineers’ equivalent of the Dark Ages. The specification for a new device typically commenced life in the form of a schematic (or state machine) diagram. These diagrams were created using pencil and paper because computer-aided electroni design capture tools, in the form we know them today, really didn’t exist at that time. Once a design had been captured in diagrammatic form, it was converted by hand into a tabular equivalent and subsequently typed into a text file. Among other things, this text file defined which fuses were to be blown or which antifuses were to be grown. In those days of yore, the text file was typed directly into a special box called a device programmer, which was subsequently used to program the chip. As time progressed, however, it became common to create the file on a host computer, which downloaded it into—and controlled—the device programmer as required (Figure 3-11). Unprogrammed device Programmed device (a) Host computer (b) Device programmer Figure 3-11. Programming a physical PLD. Creating this programming file required the engineer to have an intimate knowledge of the device’s internal links and the file format used by the device programmer. Just to increase the fun, every PLD vendor developed its own file format that typically worked only with its own devices. It was obvious to everyone concerned that this design flow was time-consuming and prone to error, and it certainly didn’t facilitate locating and fixing any mistakes. The Origin of FPGAs In 1980, a committee of the Joint Electron Device Engineering Council (JEDEC)—part of the Electronics Industry Association—proposed a standard format for PLD programming text files. It wasn’t long before all of the device programmers were modified to accept this format. Around the same time, John Birkner, the man who conceived the first PALs and managed their development, created PAL Assembler (PALASM). PALASM referred to both a rudimentary hardware description language (HDL) and a software application. In its role as an HDL, PALASM allowed design engineers to specify the function of a circuit in the form of a text source file containing Boolean equations in sum-ofproducts form. In its role as a software application (what we would now refer to as an EDA tool), PALASM—which was written in only six pages of FORTRAN code—read in the text source file and automatically generated a text-based programming file for use with the device programmer. In the context of its time, PALASM was a huge leap forward, but the original version only supported PAL devices made by MMI, and it didn’t perform any minimization or optimization. In order to address these issues, Data I/O released its Advanced Boolean Expression Language (ABEL) in 1983. Around the same time, Assisted Technology released its Common Universal tool for Programmable Logic (CUPL). ABEL and CUPL were both HDLs and software applications. In addition to supporting state machine constructs and automatic logic minimization algorithms, they both worked with multiple PLD types and manufacturers. Although PALASM, ABEL, and CUPL are the best known of the early HDLs, there were many others, such as Automated Map and Zap of Equations (AMAZE) from Signetics. These simple languages and associated tools paved the way for the higher-level HDLs (such as Verilog and VHDL) and tools (such as logic synthesis) that are used for today’s ASIC and FPGA designs. ■ 41 JEDEC is pronounced “jed-eck”; that is, “jed” to rhyme with “bed” and “eck” to rhyme with “deck.” PALASM is pronounced “pal-as-em.” HDL is pronounced by spelling it out as “H-D-L.” Developed at IBM in the mid 1950s, FORTRAN, which stands for FORmula TRANslation language, was the first computer programming language higher than the assembly level. ABEL is pronounced to rhyme with “fable.” CUPL is pronounced “koo-pel”; that is, “koo” to rhyme with “loo” and “pel” to rhyme with “bell.” 42 ■ The Design Warrior's Guide to FPGAs ASIC is pronounced by spelling out the “A” to rhyme with “hay,” followed by “SIC” to rhyme with “tick.” ASICs (gate arrays, etc.) At the time of this writing, four main classes of applicationspecific integrated circuit (ASIC) deserve mention. In increasing order of complexity, these are gate arrays, structured ASICs, standard cell devices, and full-custom chips (Figure 3-12). ASICs Gate Arrays Structured ASICs Standard Cell Full Custom Increasing complexity Figure 3-12. Different types of ASIC. Although it would be possible to introduce these ASIC types in the order of increasing complexity reflected in this figure, it actually makes more sense to describe them in the sequence in which they appeared on the scene, which was full-custom chips, followed by gate arrays, then standard cell devices, and finally structured ASICs. (Note that it’s arguable whether structured ASICs are more or less complex than traditional gate arrays.) Full custom In the early days of digital ICs, there were really only two classes of devices (excluding memory chips). The first were relatively simple building block–type components that were created by companies like TI and Fairchild and sold as standard off-the-shelf parts to anyone who wanted to use them. The second were full-custom ASICs like microprocessors, which were designed and built to order for use by a specific company. The Origin of FPGAs In the case of full-custom devices, design engineers have complete control over every mask layer used to fabricate the silicon chip. The ASIC vendor does not prefabricate any components on the silicon and does not provide any libraries of predefined logic gates and functions. By means of appropriate tools, the engineers can handcraft the dimensions of individual transistors and then create higher-level functions based on these elements. For example, if the engineers require a slightly faster logic gate, they can alter the dimensions of the transistors used to build that gate. The design tools used for full-custom devices are often created in-house by teh engineers themselves. The design of full-custom devices is highly complex and time-consuming, but the resulting chips contain the maximum amount of logic with minimal waste of silicon real estate. The Micromatrix and Micromosaic Some time in the mid-1960s, Fairchild Semiconductor introduced a device called the Micromatrix, which comprised a limited number (around 100) of noninterconnected barebones transistors. In order to make this device perform a useful function, design engineers hand-drew the metallization layers used to connect the transistors on two plastic sheets. The first sheet—drawn using a green pen—represented the Y-axis (north-south) tracks to be implemented on metal layer 1, while the second sheet—drawn using a red pen—represented the X-axis (east-west) tracks to be implemented on metal layer two. (Additional sheets were used to draw the vias (conducting columns) linking metal layer 1 to the transistors and the vias linking metal layers 1 and 2 together.) Capturing a design in this way was painfully timeconsuming and prone to error, but at least the hard, expensive, and really time-consuming work—creating the transistors—had already been performed. This meant that the Micromatrix allowed design engineers to create a custom device for a reasonable (though still expensive) cost in a reasonable (though still long) time frame. ■ 43 1642: Blaise Pascal invents a mechanical calculator called the Arithmetic Machine. 44 ■ The Design Warrior's Guide to FPGAs CAD is pronounced to rhyme with “bad.” Early gate arrays were sometimes known as uncommitted logic arrays (ULAs), but this term has largely fallen into disuse. A few years later, in 1967, Fairchild introduced a device called the Micromosaic, which contained a few hundred noninterconnected transistors. These transistors could subsequently be connected together to implement around 150 AND, OR, and NOT gates. The key feature of the Micromosaic was that design engineers could specify the function the device was required to perform by means of a text file containing Boolean (logic) equations, and a computer program then determined the necessary transistor interconnections and constructed the photo-masks required to complete the device. This was revolutionary at the time, and the Micromosaic is now credited as being the forerunner of the modern gate array form of ASIC and also the first real application of computeraided design CAD. Gate arrays The gate array concept originated in companies like IBM and Fujitsu as far back as the late 1960s. However, these early devices were only available for internal consumption, and it wasn’t until the mid-1970s that access to CMOS-based gate array technology became available to anyone willing to pay for it. Gate arrays are based on the idea of a basic cell consisting of a collection of unconnected transistors and resistors. Each ASIC vendor determines what it considers to be the optimum mix of components provided in its particular basic cell (Figure 3-13). The ASIC vendor commences by prefabricating silicon chips containing arrays of these basic cells. In the case of channeled gate arrays, the basic cells are typically presented as either single-column or dual-column arrays; the free areas between the arrays are known as the channels (Figure 3-14). By comparison, in the case of channel-less or channel-free devices, the basic cells are presented as a single large array. The surface of the device is covered in a “sea” of basic cells, and there are no dedicated channels for the interconnections. The Origin of FPGAs ■ 45 1671: Baron Gottfried von Leibniz invents a mechanical calculator called the Step Reckoner. (a) Pure CMOS basic cell (b) BiCMOS basic cell Figure 3-13. Examples of simple gate array basic cells. I/O cells/pads Channels Basic cells (a) Single-column arrays (b) Dual-column arrays Figure 3-14. Channeled gate array architectures. Thus, these devices are popularly referred to as sea-of-gates or sea-of-cells. The ASIC vendor defines a set of logic functions such as primitive gates, multiplexers, and registers that can be used by the design engineers. Each of these building block functions is referred to as a cell—not to be confused with a basic cell—and the set of functions supported by the ASIC vendor is known as the cell library. The means by which ASICs are actually designed is beyond the scope of this book. Suffice it to say that the design engineers eventually end up with a gate-level netlist, which describes the logic gates they wish to use and the connections between them. Special mapping, placement, and routing software tools are used to assign the logic gates to specific basic 46 ■ The Design Warrior's Guide to FPGAs cells and define how the cells will be connected together. The results are used to generate the photo-masks that are in turn used to create the metallization layers that will link the components inside the basic cells and also link the basic cells to each other and to the device’s inputs and outputs. Gate arrays offer considerable cost advantages in that the transistors and other components are prefabricated, so only the metallization layers need to be customized. The disadvantage is that most designs leave significant amounts of internal resources unutilized, the placement of gates is constrained, and the routing of internal tracks is less than optimal. All of these factors negatively impact the performance and power consumption of the design. Standard cell devices When a team of electronics engineers is tasked with designing a complex integrated circuit, rather than reinventing the wheel, they may decide to purchase the plans for one or more functional blocks that have already been created by someone else. The plans for these functional blocks are known as intellectual property, or IP. IP is pronounced by spelling it out as “I-P.” IP blocks can range all the way up to sophisticated communications functions and microprocessors. The more complex functions, like microprocessors, may be referred to as “cores.” In order to address the problems associated with gate arrays, standard cell devices became available in the early 1980s. These components bear many similarities to gate arrays. Once again, the ASIC vendor defines the cell library that can be used by the design engineers. The vendor also supplies hard-macro and soft-macro libraries, which include elements such as processors, communication functions, and a selection of RAM and ROM functions. Last but not least, the design engineers may decide to reuse previously designed functions or to purchase blocks of intellectual property (IP). Once again, by one means or another (which today involves incredibly sophisticated software tools), the design engineers end up with a gate-level netlist, which describes the logic gates they wish to use and the connections between them. Unlike gate arrays, standard cell devices do not use the concept of a basic cell, and no components are prefabricated on the chip. Special tools are used to place each logic gate individually in the netlist and to determine the optimum way in which the gates are to be routed (connected together). The results are then used to create custom photo-masks for every layer in the device’s fabrication. The Origin of FPGAs The standard cell concept allows each logic function to be created using the minimum number of transistors with no redundant components, and the functions can be positioned so as to facilitate any connections between them. Standard cell devices, therefore, provide a closer-to-optimal utilization of the silicon than do gate arrays. Structured ASICs It’s often said that there’s nothing new under the sun. Ever since the introduction of standard cell devices, industry observers have been predicting the demise of gate arrays, but these little rascals continue to hold on to their market niche and, indeed, have seen something of a resurgence in recent years. Structured ASICs (although they weren’t called that at the time) spluttered into life around the beginning of the 1990s, slouched around for a while, and then returned to the nether regions from whence they came. A decade later—circa 2001 to 2002—a number of ASIC manufacturers started to investigate innovative ways of reducing ASIC design costs and development times. Not wishing to be associated with traditional gate arrays, everyone was happy when someone came up with the structured ASIC moniker somewhere around the middle of 2003. As usual, of course, every vendor has its own proprietary architecture, so our discussions here will provide only a generic view of these components. Each device commences with a fundamental element called a module by some and a tile by others. This element may contain a mixture of prefabricated generic logic (implemented either as gates, multiplexers, or a lookup table), one or more registers, and possibly a little local RAM (Figure 3-15). An array (sea) of these elements is then prefabricated across the face of the chip. Alternatively, some architectures commence with a base cell (or base tile or base module, or …) containing only generic logic in the form of prefabricated ■ 47 1746: Holland. The Leyden jar is invented at University of Leyden. 48 ■ The Design Warrior's Guide to FPGAs 1752: America. Benjamin Franklin performs his notorious kite experiment. LUT LUT (a) Gate, mux, and flop-based (b) LUT and flop-based Figure 3-15. Examples of structured ASIC tiles. gates, multiplexers, or lookup tables. An array of these base units (say 4 × 4, 8 × 8, or 16 × 16)—in conjunction with some special units containing registers, small memory elements, and other logic—then make up a master cell (or master tile or master module or …). Once again, an array (sea) of these master units is then prefabricated across the face of the chip. Also prefabricated (typically around the edge of the device) are functions like RAM blocks, clock generators, boundary scan logic, and so forth (Figure 3-16). Prefabricated I/O, cores, etc. Embedded RAM Sea-of-tiles Figure 3-16. Generic structured ASIC. The Origin of FPGAs The idea is that the device can be customized using only the metallization layers (just like a standard gate array). The difference is that, due to the greater sophistication of the structured ASIC tile, most of the metallization layers are also predefined. Thus, many structured ASIC architectures require the customization of only two or three metallization layers (in one case, it is necessary to customize only a single via layer). This dramatically reduces the time and costs associated with creating the remaining photo-masks used to complete the device. Although it’s difficult to assign an exact value, the predefined and prefabricated logic associated with structured ASICs results in an overhead compared to standard cell devices in terms of power consumption, performance, and silicon real estate. Early indications are that structured ASICs require three times the real estate and consume two to three times the power of a standard cell device to perform the same function. In reality, these results will vary architecture-by-architecture, and also different types of designs may well favor different architectures. Unfortunately, no evaluations based on industry-standard reference designs have been performed across all of the emerging structured ASIC architectures at the time of this writing. FPGAs Around the beginning of the 1980s, it became apparent that there was a gap in the digital IC continuum. At one end, there were programmable devices like SPLDs and CPLDs, which were highly configurable and had fast design and modification times, but which couldn’t support large or complex functions. At the other end of the spectrum were ASICs. These could support extremely large and complex functions, but they were painfully expensive and time-consuming to design. Furthermore, once a design had been implemented as an ASIC it was effectively frozen in silicon (Figure 3-17). ■ 49 1775: Italy. Count Alessandro Giuseppe Antonio Anastasio Volta invents a static electricity generator called the Electrophorus. 50 ■ The Design Warrior's Guide to FPGAs PLDs SPLDs CPLDs ASICs The GAP Gate Arrays Structured ASICs* Standard Cell Full Custom *Not available circa early 1980s Figure 3-17. The gap between PLDs and ASICs. FPGA is pronounced by spelling it out as “F-P-G-A.” LUT is pronounced to rhyme with “nut.” In order to address this gap, Xilinx developed a new class of IC called a field-programmable gate array, or FPGA, which they made available to the market in 1984. The various FPGAs available today are discussed in detail in chapter 4. For the nonce, we need only note that the first FPGAs were based on CMOS and used SRAM cells for configuration purposes. Although these early devices were comparatively simple and contained relatively few gates (or the equivalent thereof) by today’s standards, many aspects of their underlying architecture are still employed to this day. The early devices were based on the concept of a programmable logic block, which comprised a 3-input lookup table (LUT), a register that could act as a flip-flop or a latch, and a multiplexer, along with a few other elements that are of little interest here. Figure 3-18 shows a very simple programmable logic block (the logic blocks in modern FPGAs can be significantly more complex—see chapter 4 for more details). Each FPGA contained a large number of these programmable logic blocks, as discussed below. By means of appropriate SRAM programming cells, every logic block in the device could be configured to perform a different function. Each register could be configured to initialize containing a logic 0 or a logic 1 and to act as a flip-flop (as shown in Figure 3-18) or a latch. If the flip-flop option were selected, the register could be configured to be triggered by a positive- or The Origin of FPGAs a b c 3-input LUT y mux flip-flop q d clock Figure 3-18. The key elements forming a simple programmable logic block. negative-going clock (the clock signal was common to all of the logic blocks). The multiplexer feeding the flip-flop could be configured to accept the output from the LUT or a separate input to the logic block, and the LUT could be configured to represent any 3-input logical function. For example, assume that a LUT was required to perform the function y = (a & b) | !c This could be achieved by loading the LUT with the appropriate output values (figure 3-19). Figure 3-19. Configuring a LUT. ■ 51 1777: Charles Stanhope invents a mechanical calculating machine. 52 ■ The Design Warrior's Guide to FPGAs Note that the 8:1-multiplexer-based LUT illustrated in Figure 3-19 is used for purposes of simplicity; a more realistic implementation is shown in chapter 4. Furthermore, Chapter 5 presents in detail the ways in which FPGAs are actually programmed. The complete FPGA comprised a large number of programmable logic block “islands” surrounded by a “sea” of programmable interconnects (Figure 3-20). Programmable interconnect Programmable logic blocks Figure 3-20. Top-down view of simple, generic FPGA architecture. I/O is pronounced by spelling it out as “I-O.” As usual, this high-level illustration is merely an abstract representation. In reality, all of the transistors and interconnects would be implemented on the same piece of silicon using standard IC creation techniques. In addition to the local interconnect reflected in figure 3-20, there would also be global (high-speed) interconnection paths that could transport signals across the chip without having to go through multiple local switching elements. The device would also include primary I/O pins and pads (not shown here). By means of its own SRAM cells, the interconnect could be programmed such that the primary inputs to the device were connected to the inputs of one or more programmable logic blocks, and the outputs from any logic block could be used to drive the inputs to any other logic block, the primary outputs from the device, or both. The Origin of FPGAs The end result was that FPGAs successfully bridged the gap between PLDs and ASICs. On the one hand, they were highly configurable and had the fast design and modification times associated with PLDs. On the other hand, they could be used to implement large and complex functions that had previously been the domain only of ASICs. (ASICs were still required for the really large, complex, high-performance designs, but as FPGAs increased in sophistication, they started to encroach further and further into ASIC design space.) Platform FPGAs The concept of a reference design or platform design has long been used at the circuit board level. This refers to creating a base design configuration from which multiple products can be derived. In addition to tremendous amounts of programmable logic, today’s high-end FPGAs feature embedded (block) RAMs, embedded processor cores, high-speed I/O blocks, and so forth. Furthermore, designers have access to a wide range of IP. The end result is the concept of the platform FPGA. A company may use a platform FPGA design as a basis for multiple products inside that company, or it may supply an initial design to multiple other companies for them to customize and differentiate. FPGA-ASIC hybrids It would not make any sense to embed ASIC material inside an FPGA because designs created using such a device would face all of the classic problems (high NREs, long lead times, etc.) associated with ASIC design flows. However, there are a number of cases in which one or more FPGA cores have been used as part of a standard cell ASIC design. One reason for embedding FPGA material inside an ASIC is that it facilitates the concept of platform design. The platform in this case would be the ASIC, and the embedded FPGA material could form one of the mechanisms used to customize and differentiate subdesigns. ■ 53 Late 1700s: Charles Stanhope invents a logic machine called the Stanhope Demonstrator. 54 ■ The Design Warrior's Guide to FPGAs 1800: Italy. Count Alessandro Giuseppe Antonio Anastasio Volta invents the first battery. Another reason is that the last few years have seen an increasing incidence of FPGAs being used to augment ASIC designs. In this scenario, a large, complex ASIC has an associated FPGA located in close proximity on the board (Figure 3-21). To other chips on the board To other chips on the board To other chips on the board ASIC FPGA To other chips on the board Figure 3-21. Using an FPGA to augment an ASIC design. The reason for this scenario is that it’s incredibly timeconsuming and expensive to fix a bug in the ASIC or to modify its functionality to accommodate any changes to its original design specification. If the ASIC is designed in the right way, however, its associated FPGA can be used to implement any downstream modifications and enhancements. One problem with this approach is the time taken for signals to pass back and forth between the ASIC and the FPGA. The solution is to embed the FPGA core inside the ASIC, thereby resulting in an FPGA-ASIC hybrid. One concern that has faced these hybrids, however, is that ASIC and FPGA design tools and flows have significant underlying differences. For example, ASICs are said to be fine-grained because (ultimately) they are implemented at the level of primitive logic gates. This means that traditional design technologies like logic synthesis and implementation The Origin of FPGAs technologies like place-and-route are also geared toward finegrained architectures. By comparison, FPGAs are said to be medium-grained (or coarse-grained depending on whom you are talking to) because they are physically realized using higher-level blocks like the programmable logic blocks introduced earlier in this chapter. In this case, the best design results are realized when using FPGA-specific synthesis and place-and-route technologies that view their world in terms of these higher-level blocks. One area of interest for FPGA-ASIC hybrids is that of structured ASICs because they too may be considered to be block based. This means that, when looking around for design tools, structured ASIC vendors are talking to purveyors of FPGA-centric synthesis and place-and-route technology rather than their traditional ASIC counterparts. In turn, this means that FPGA-ASIC hybrids based on structured ASICs would automatically tend toward a unified tool and design flow because the same block-based synthesis and place-and-route engines could be used for both the ASIC and FPGA portions of the design. How FPGA vendors design their chips Last but not least, one question that is commonly asked—but is rarely (if ever) addressed in books on FPGAs—is, how do FPGA vendors actually go about designing a new generation of devices? To put this another way, do they handcraft each transistor and track using a design flow similar to that of a full-custom ASIC, or do they create an RTL description, synthesize it into a gate-level netlist, and then use place-and-route software along the lines of a classic ASIC (gate array or standard cell) design flow (the concepts behind these tools are discussed in more detail in Section 2). The short answer is yes! The slightly longer answer is that there are some portions of the device, like the programmable logic blocks and the basic routing structure, where the FPGA vendors fight tooth and nail for every square micron and every ■ 55 1801: France. Joseph-Marie Jacquard invents a loom controlled by punch cards. 56 ■ The Design Warrior's Guide to FPGAs 1820: France. Andre Ampere investigates the force of an electric current in a magnetic field. fraction of a nanosecond. These sections of the design are handcrafted at the transistor and track level using full-custom ASIC techniques. On the bright side, these portions of the design are both relatively small and highly repetitive, so once created they are replicated thousands of times across the face of the chip. Then there are housekeeping portions of the device, such as the configuration control circuitry, that only occur once per device and are not particularly size or performance critical. These sections of the design are created using standard cell ASIC-style techniques. Chapter 4 Alternative FPGA Architectures A word of warning In this chapter we introduce a plethora of architectural features. Certain options—such as using antifuse versus SRAM configuration cells—are mutually exclusive. Some FPGA vendors specialize in one or the other; others may offer multiple device families based on these different technologies. (Unless otherwise noted, the majority of these discussions assume SRAM-based devices.) In the case of embedded blocks such as multipliers, adders, memory, and microprocessor cores, different vendors offer alternative “flavors” of these blocks with different “recipes” of ingredients. (Much like different brands of chocolate chip cookies featuring larger or smaller chocolate chips, for example, some FPGA families will have bigger/better/badder embedded RAM blocks, while others might feature more multipliers, or support more I/O standards, or …) The problem is that the features supported by each vendor and each family change on an almost daily basis. This means that once you’ve decided what features you need, you then need to do a little research to see which vendor’s offerings currently come closest to satisfying your requirements. A little background information Before hurling ourselves into the body of this chapter, we need to define a couple of concepts to ensure that we’re all marching to the same drumbeat. For example, you’re going to see the term fabric used throughout this book. In the context of The word “fabric” comes from the Middle English fabryke, meaning “something constructed.” 58 ■ The Design Warrior's Guide to FPGAs The “µ” symbol stands for “micro” from the Greek micros, meaning “small” (hence the use of “µP” as an abbreviation for microprocessor.”) In the metric system, “µ” stands for “one millionth part of,” so 1 µm represents “one millionth of a meter.” DSM is pronounced by spelling it out as “D-S-M.” UDSM is pronounced by spelling it out as “U-D-S-M.” a silicon chip, this refers to the underlying structure of the device (sort of like the phrase “the fabric of civilized society”). When you first hear someone using “fabric” in this way, it might sound a little snooty or pretentious (in fact, some engineers regard it as nothing more than yet another marketing term promoted by ASIC and FPGA vendors to make their devices sound more sophisticated than they really are). Truth to tell, however, once you get used to it, this is really quite a useful word. When we talk about the geometry of an IC, we are referring to the size of the individual structures constructed on the chip—such as the portion of a field-effect transistor (FET) known as its channel. These structures are incredibly small. In the early to mid-1980s, devices were based on 3 µm geometries, which means that their smallest structures were 3 millionths of a meter in size. (In conversation, we would say, “This IC is based on a three-micron technology.”) Each new geometry is referred to as a technology node. By the 1990s, devices based on 1 µm geometries were starting to appear, and feature sizes continued to plummet throughout the course of the decade. As we moved into the twenty-first century, high-performance ICs had geometries as small as 0.18 µm. By 2002, this had shrunk to 0.13 µm, and by 2003, devices at 0.09 µm were starting to appear. Any geometry smaller than around 0.5 µm is referred to as deep submicron (DSM). At some point that is not well defined (or that has multiple definitions depending on whom one is talking to), we move into the ultradeep submicron (UDSM) realm. Things started to become a little awkward once geometries dropped below 1 µm, not the least because it’s a pain to keep having to say things like “zero point one three microns.” For this reason, when conversing it’s becoming common to talk in terms of nano, where one nano (short for nanometer) equates to a thousandth of a micron—that is, one thousandth of a millionth of a meter. Thus, instead of mumbling, “point zero nine microns” (0.09 µm), one can simply proclaim, “ninety Alternative FPGA Architectures ■ 59 nano” (90 nano) and have done with it. Of course, these both mean exactly the same thing, but if you feel moved to regale your friends on these topics, it’s best to use the vernacular of the day and present yourself as hip and trendy rather than as an old fuddy-duddy from the last millennium. Antifuse versus SRAM versus … SRAM-based devices The majority of FPGAs are based on the use of SRAM configuration cells, which means that they can be configured over and over again. The main advantages of this technique are that new design ideas can be quickly implemented and tested, while evolving standards and protocols can be accommodated relatively easily. Furthermore, when the system is first powered up, the FPGA can initially be programmed to perform one function such as a self-test or board/system test, and it can then be reprogrammed to perform its main task. Another big advantage of the SRAM-based approach is that these devices are at the forefront of technology. FPGA vendors can leverage the fact that many other companies specializing in memory devices expend tremendous resources on research and development (R&D) in this area. Furthermore, the SRAM cells are created using exactly the same CMOS technologies as the rest of the device, so no special processing steps are required in order to create these components. In the past, memory devices were often used to qualify the manufacturing processes associated with a new technology node. More recently, the mixture of size, complexity, and regularity associated with the latest FPGA generations has resulted in these devices being used for this task. One advantage of using FPGAs over memory devices to qualify the manufacturing process is that, if there’s a defect, the structure of FPGAs is such that it’s easier to identify and locate the problem (that is, figure out what and where it is). For example, when IBM and UMC were rolling out their 0.09 µm (90 nano) processes, R&D is pronounced by spelling it out as “R-and-D.” 60 ■ The Design Warrior's Guide to FPGAs FPGAs from Xilinx were the first devices to race out of the starting gate. Unfortunately, there’s no such thing as a free lunch. One downside of SRAM-based devices is that they have to be reconfigured every time the system is powered up. This either requires the use of a special external memory device (which has an associated cost and consumes real estate on the board) or of an on-board microprocessor (or some variation of these techniques—see also chapter 5). Security issues and solutions with SRAM-based devices IP is pronounced by spelling it out as “I-P.” Another consideration with regard to SRAM-based devices is that it can be difficult to protect your intellectual property, or IP, in the form of your design. This is because the configuration file used to program the device is stored in some form of external memory. Currently, there are no commercially available tools that will read the contents of a configuration file and generate a corresponding schematic or netlist representation. Having said this, understanding and extracting the logic from the configuration file, while not a trivial task, would not be beyond the bounds of possibility given the combination of clever folks and computing horsepower available today. Let’s not forget that there are reverse-engineering companies all over the world specializing in the recovery of “design IP.” And there are also a number of countries whose governments turn a blind eye to IP theft so long as the money keeps rolling in (you know who you are). So if a design is a highprofit item, you can bet that there are folks out there who are ready and eager to replicate it while you’re not looking. In reality, the real issue here is not related to someone stealing your IP by reverse-engineering the contents of the configuration file, but rather their ability to clone your design, irrespective of whether they understand how it performs its magic. Using readily available technology, it is relatively easy Alternative FPGA Architectures for someone to take a circuit board, put it on a “bed of nails” tester, and quickly extract a complete netlist for the board. This netlist can subsequently be used to reproduce the board. Now the only task remaining for the nefarious scoundrels is to copy your FPGA configuration file from its boot PROM (or EPROM, E2PROM, or whatever), and they have a duplicate of the entire design. On the bright side, some of today’s SRAM-based FPGAs support the concept of bitstream encryption. In this case, the final configuration data is encrypted before being stored in the external memory device. The encryption key itself is loaded into a special SRAM-based register in the FPGA via its JTAG port (see also Chapter 5). In conjunction with some associated logic, this key allows the incoming encrypted configuration bitstream to be decrypted as it’s being loaded into the device. The command/process of loading an encrypted bitstream automatically disables the FPGA’s read-back capability. This means that you will typically use unencrypted configuration data during development (where you need to use read-back) and then start to use encrypted data when you move into production. (You can load an unencrypted bitstream at any time, so you can easily load a test configuration and then reload the encrypted version.) The main downside to this scheme is that you require a battery backup on the circuit board to maintain the contents of the encryption key register in the FPGA when power is removed from the system. This battery will have a lifetime of years or decades because it need only maintain a single register in the device, but it does add to the size, weight, complexity, and cost of the board. Antifuse-based devices Unlike SRAM-based devices, which are programmed while resident in the system, antifuse-based devices are programmed off-line using a special device programmer. The proponents of antifuse-based FPGAs are proud to point to an assortment of (not-insignificant) advantages. First ■ 61 JTAG is pronounced “J-TAG”; that is, by spelling out the ‘J’ followed by “tag” to rhyme with “bag.” 62 ■ The Design Warrior's Guide to FPGAs Radiation can come in the form of gamma rays (very high-energy photons), beta particles (high-energy electrons), and alpha particles. It should be noted that rad-hard devices are not limited to antifuse technologies. Other components, such as those based on SRAM architectures, are available with special rad-hard packaging and triple redundancy design. of all, these devices are nonvolatile (their configuration data remains when the system is powered down), which means that they are immediately available as soon as power is applied to the system. Following from their nonvolatility, these devices don’t require an external memory chip to store their configuration data, which saves the cost of an additional component and also saves real estate on the board. One noteworthy advantage of antifuse-based FPGAs is the fact that their interconnect structure is naturally “rad hard,” which means they are relatively immune to the effects of radiation. This is of particular interest in the case of military and aerospace applications because the state of a configuration cell in an SRAM-based component can be “flipped” if that cell is hit by radiation (of which there is a lot in space). By comparison, once an antifuse has been programmed, it cannot be altered in this way. Having said this, it should also be noted that any flip-flops in these devices remain sensitive to radiation, so chips intended for radiation-intensive environments must have their flip-flops protected by triple redundancy design. This refers to having three copies of each register and taking a majority vote (ideally all three registers will contain identical values, but if one has been “flipped” such that two registers say 0 and the third says 1, then the 0s have it, or vice versa if two registers say 1 and the third says 0). But perhaps the most significant advantage of antifusebased FPGAs is that their configuration data is buried deep inside them. By default, it is possible for the device programmer to read this data out because this is actually how the programmer works. As each antifuse is being processed, the device programmer keeps on testing it to determine when that element has been fully programmed; then it moves onto the next antifuse. Furthermore, the device programmer can be used to automatically verify that the configuration was performed successfully (this is well worth doing when you’re talking about devices containing 50 million plus programmable elements). In order to do this, the device programmer Alternative FPGA Architectures requires the ability to read the actual states of the antifuses and compare them to the required states defined in the configuration file. Once the device has been programmed, however, it is possible to set (grow) a special security antifuse that subsequently prevents any programming data (in the form of the presence or absence of antifuses) from being read out of the device. Even if the device is decapped (its top is removed), programmed and unprogrammed antifuses appear to be identical, and the fact that all of the antifuses are buried in the internal metallization layers makes it almost impossible to reverse-engineer the design. Vendors of antifuse-based FPGAs may also tout a couple of other advantages relating to power consumption and speed, but if you aren’t careful this can be a case of the quickness of the hand deceiving the eye. For example, they might tease you with the fact that an antifuse-based device consumes only 20 percent (approximately) of the standby power of an equivalent SRAM-based component, that their operational power consumption is also significantly lower, and that their interconnect-related delays are smaller. Also, they might casually mention that an antifuse is much smaller and thus occupies much less real estate on the chip than an equivalent SRAM cell (although they may neglect to mention that antifuse devices also require extra programming circuitry, including a large, hairy programming transistor for each antifuse). They will follow this by noting that when you have a device containing tens of millions of configuration elements, using antifuses means that the rest of the logic can be much closer together. This serves to reduce the interconnect delays, thereby making these devices faster than their SRAM cousins. And both of the above points would be true … if one were comparing two devices implemented at the same technology node. But therein lies the rub, because antifuse technology requires the use of around three additional process steps after the main manufacturing process has been qualified. For this (and related) reasons, antifuse devices are always at least ■ 63 It’s worth noting that when the MRAM technologies introduced in Chapter 2 come to fruition, these may well change the FPGA landscape. This is because MRAM fuses would be much smaller than SRAM cells (thereby increasing component density and reducing track delays), and they would also consume much less power. Furthermore, MRAM-based devices could be preprogrammed like antifuse-based devices (great for security) and reprogrammed like SRAM-based components (good for prototyping). 64 ■ The Design Warrior's Guide to FPGAs 1821: England. Michael Faraday invents the first electric motor. one—and usually several—generations (technology nodes) behind SRAM-based components, which effectively wipes out any speed or power consumption advantages that might otherwise be of interest. Of course, the main disadvantage associated with antifuse-based devices is that they are OTP, so once you’ve programmed one of these little scallywags, its function is set in stone. This makes these components a poor choice for use in a development or prototyping environment. EPROM-based devices This section is short and sweet because no one currently makes—or has plans to make—EPROM-based FPGAs. 2 E PROM/FLASH-based devices E2PROM- or FLASH-based FPGAs are similar to their SRAM counterparts in that their configuration cells are connected together in a long shift-register-style chain. These devices can be configured off-line using a device programmer. Alternatively, some versions are in-system programmable, or ISP, but their programming time is about three times that of an SRAM-based component. Once programmed, the data they contain is nonvolatile, so these devices would be “instant on” when power is first applied to the system. With regard to protection, some of these devices use the concept of a multibit key, which can range from around 50 bits to several hundred bits in size. Once you’ve programmed the device, you can load your userdefined key (bit-pattern) to secure its configuration data. After the key has been loaded, the only way to read data out of the device, or to write new data into it, is to load a copy of your key via the JTAG port (this port is discussed later in this chapter and also in chapter 5). The fact that the JTAG port in today’s devices runs at around 20 MHz means that it would take billions of years to crack the key by exhaustively trying every possible value. Alternative FPGA Architectures Two-transistor E2PROM and FLASH cells are approximately 2.5 times the size of their one-transistor EPROM cousins, but they are still way smaller than their SRAM counterparts. This means that the rest of the logic can be much closer together, thereby reducing interconnect delays. On the downside, these devices require around five additional process steps on top of standard CMOS technology, which results in their lagging behind SRAM-based devices by one or more generations (technology nodes). Last but not least, these devices tend to have relatively high static power consumption due to their containing vast numbers of internal pull-up resistors. Hybrid FLASH-SRAM devices Last but not least, there’s always someone who wants to add yet one more ingredient into the cooking pot. In the case of FPGAs, some vendors offer esoteric combinations of programming technologies. For example, consider a device where each configuration element is formed from the combination of a FLASH (or E2PROM) cell and an associated SRAM cell. In this case, the FLASH elements can be preprogrammed. Then, when the system is powered up, the contents of the FLASH cells are copied in a massively parallel fashion into their corresponding SRAM cells. This technique gives you the nonvolatility associated with antifuse devices, which means the device is immediately available when power is first applied to the system. But unlike an antifuse-based component, you can subsequently use the SRAM cells to reconfigure the device while it remains resident in the system. Alternatively, you can reconfigure the device using its FLASH cells either while it remains in the system or off-line by means of a device programmer. Summary Table 4.1 briefly summarizes the key points associated with the various programming technologies described above: ■ 65 1821: England. Michael Faraday plots the magnetic field around a conductor. 66 ■ The Design Warrior's Guide to FPGAs Table 4-1. Summary of programming technologies In reality, the vast majority of the configuration cells in an FPGA are associated with its interconnect (as opposed to its configurable logic blocks). For this reason, engineers joke that FPGA vendors actually sell only the interconnect, and they throw in the rest of the logic for free! Fine-, medium-, and coarse-grained architectures It is common to categorize FPGA offerings as being either fine grained or coarse grained. In order to understand what this means, we first need to remind ourselves that the main feature that distinguishes FPGAs from other devices is that their underlying fabric predominantly consists of large numbers of relatively simple programmable logic block “islands” embedded in a “sea” of programmable interconnect. (Figure 4-1). In the case of a fine-grained architecture, each logic block can be used to implement only a very simple function. For example, it might be possible to configure the block to act as any 3-input function, such as a primitive logic gate (AND, OR, NAND, etc.) or a storage element (D-type flip-flop, D-type latch, etc.). Alternative FPGA Architectures ■ 67 1821: England. Sir Charles Wheatstone reproduces sound. Programmable interconnect Programmable logic blocks Figure 4-1. Underlying FPGA fabric. In addition to implementing glue logic and irregular structures like state machines, fine-grained architectures are said to be particularly efficient when executing systolic algorithms (functions that benefit from massively parallel implementations). These architectures are also said to offer some advantages with regard to traditional logic synthesis technology, which is geared toward fine-grained ASIC architectures. The mid-1990s saw a lot of interest in fine-grained FPGA architectures, but over time the vast majority faded away into the sunset, leaving only their coarse-grained cousins. In the case of a coarse-grained architecture, each logic block contains a relatively large amount of logic compared to their finegrained counterparts. For example, a logic block might contain four 4-input LUTs, four multiplexers, four D-type flip-flops, and some fast carry logic (see the following topics in this chapter for more details). An important consideration with regard to architectural granularity is that fine-grained implementations require a relatively large number of connections into and out of each block compared to the amount of functionality that can be supported by those blocks. As the granularity of the blocks increases to medium-grained and higher, the amount of connections into the blocks decreases compared to the amount of functionality 68 ■ The Design Warrior's Guide to FPGAs they can support. This is important because the programmable interblock interconnect accounts for the vast majority of the delays associated with signals as they propagate through an FPGA. One slight fly in the soup is that a number of companies have recently started developing really coarse-grained device architectures comprising arrays of nodes, where each node is a highly complex processing element ranging from an algorithmic function such as a fast Fourier transform (FFT) all the way up to a complete general-purpose microprocessor core (see also Chapters 6 and 23). Although these devices aren’t classed as FPGAs, they do serve to muddy the waters. For this reason, LUT-based FPGA architectures are now often classed as medium-grained, thereby leaving the coarse-grained appellation free to be applied to these new node-based devices. MUX is pronounced to rhyme with “flux.” LUT is pronounced to rhyme with “nut.” MUX- versus LUT-based logic blocks There are two fundamental incarnations of the programmable logic blocks used to form the medium-grained architectures referenced in the previous section: MUX (multiplexer) based and LUT (lookup table) based. MUX-based As an example of a MUX-based approach, consider one way in which the 3-input function y = (a & b) | c could be implemented using a block containing only multiplexers (Figure 4-2). The device can be programmed such that each input to the block is presented with a logic 0, a logic 1, or the true or inverse version of a signal (a, b, or c in this case) coming from another block or from a primary input to the device. This allows each block to be configured in myriad ways to implement a plethora of possible functions. (The x shown on the input to the central multiplexer in figure 4-2 indicates that we don’t care whether this input is connected to a 0 or a 1.) Alternative FPGA Architectures ■ 69 AND a OR & b | c y y = (a & b) | c 0 0 b 1 MUX 0 MUX a y 1 0 x 1 1 MUX 0 0 0 1 1 MUX c Figure 4-2. MUX-based logic block. LUT-based The underlying concept behind a LUT is relatively simple. A group of input signals is used as an index (pointer) to a lookup table. The contents of this table are arranged such that the cell pointed to by each input combination contains the desired value. For example, let’s assume that we wish to implement the function: y = (a & b) | c Required function Truth table AND a b c & OR | y = (a & b) | c y a b c y 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 Figure 4-3. Required function and associated truth table. If you take a group of logic gates several layers deep, then a LUT approach can be very efficient in terms of resource utilization and input-to-output delays. (In this context, “deep” refers to the number of logic gates between the inputs and the outputs. Thus, the function illustrated in figure 4-3 would be said to be two layers deep.) However, one downside to a LUT-based architecture is that if you only want to implement a small function—such as a 2-input AND gate— somewhere in your design, you’ll end up using an entire LUT to do so. In addition to being wasteful in terms of resources, the resulting delays are high for such a simple function. 70 ■ The Design Warrior's Guide to FPGAs By comparison, in the case of mux-based architectures containing a mixture of muxes and logic gates, it’s often possible to gain access to intermediate values from the signals linking the logic gates and the muxes. In this case, each logic block can be broken down into smaller fragments, each of which can be used to implement a simple function. Thus, these architectures may offer advantages in terms of performance and silicon utilization for designs containing large numbers of independent simple logic functions. This can be achieved by loading a 3-input LUT with the appropriate values. For the purposes of the following examples, we shall assume that the LUT is formed from SRAM cells (but it could be formed using antifuses, E2PROM, or FLASH cells, as discussed earlier in this chapter). A commonly used technique is to use the inputs to select the desired SRAM cell using a cascade of transmission gates as shown in Figure 4-4. (Note that the SRAM cells will also be connected together in a chain for configuration purposes—that is, to load them with the required values—but these connections have been omitted from this illustration to keep things simple.) Transmission gate (active low) 1 0 Transmission gate (active high) 1 SRAM cells 1 y 1 0 1 1 c b a Figure 4-4. A transmission gate-based LUT (programming chain omitted for purposes of clarity). If a transmission gate is enabled (active), it passes the signal seen on its input through to its output. If the gate is disabled, its output is electrically disconnected from the wire it is driving. The transmission gate symbols shown with a small circle (called a “bobble” or a “bubble”) indicate that these gates will be activated by a logic 0 on their control input. By compari- Alternative FPGA Architectures ■ 71 son, symbols without bobbles indicate that these gates will be activated by a logic 1. Based on this understanding, it’s easy to see how different input combinations can be used to select the contents of the various SRAM cells. MUX-based versus LUT-based? Once upon a time—when engineers handcrafted their circuits prior to the advent of today’s sophisticated CAD tools—some folks say that it was possible to achieve the best results using MUX-based architectures. (Sad to relate, they usually don’t explain exactly how these results were better, so this is largely left to our imaginations.) It is also said that MUX-based architectures have an advantage when it comes to implementing control logic along the lines of “if this input is true and this input is false, then make that output true …”1 However, some of these architectures don’t provide high-speed carry logic chains, in which case their LUT-based counterparts are left as the leaders in anything to do with arithmetic processing. Throughout much of the 1990s, FPGAs were widely used in the telecommunications and networking markets. Both of these application areas involve pushing lots of data around, in which case LUT-based architectures hold the high ground. Furthermore, as designs (and device capacities) grew larger and synthesis technology increased in sophistication, handcrafting circuits largely became a thing of the past. The end result is that the majority of today’s FPGA architectures are LUTbased, as discussed below. 3-, 4-, 5-, or 6-input LUTs? The great thing about an n-input LUT is that it can implement any possible n-input combinational (or combinatorial) 1 Some MUX-based architectures—such as those fielded by QuickLogic (www.quicklogic.com)—feature logic blocks containing multiple layers of MUXes preceded by primitive logic gates like ANDs. This provides them with a large fan-in capability, which gives them an advantage for address decoding and state machine decoding applications. As was noted in Chapter 3, some folks prefer to say “combinational logic,” while others favor “combinatorial logic.” 72 ■ The Design Warrior's Guide to FPGAs 1822: England. Charles Babbage starts to build a mechanic calculating machine called the Difference Engine. logic function. Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells. The first FPGAs were based on 3-input LUTs. FPGA vendors and university students subsequently researched the relative merits of 3-, 4-, 5-, and even 6-input LUTs into the ground (whatever you do, don’t get trapped in conversation with a bunch of FPGA architects at a party). The current consensus is that 4-input LUTs offer the optimal balance of pros and cons. In the past, some devices were created using a mixture of different LUT sizes, such as 3-input and 4-input LUTs, because this offered the promise of optimal device utilization. However, one of the main tools in the design engineer’s treasure chest is logic synthesis, and uniformity and regularity are what a synthesis tool likes best. Thus, all of the really successful architectures are currently based only on the use of 4-input LUTs. (This is not to say that mixed-size LUT architectures won’t reemerge in the future as design software continues to increase in sophistication.) LUT versus distributed RAM versus SR The fact that the core of a LUT in an SRAM-based device comprises a number of SRAM cells offers a number of interesting possibilities. In addition to its primary role as a lookup table, some vendors allow the cells forming the LUT to be used as a small block of RAM (the 16 cells forming a 4-input LUT, for example, could be cast in the role of a 16 × 1 RAM). This is referred to as distributed RAM because (a) the LUTs are strewn (distributed) across the surface of the chip, and (b) this differentiates it from the larger chunks of block RAM (introduced later in this chapter). Yet another possibility devolves from the fact that all of the FPGA’s configuration cells—including those forming the LUT—are effectively strung together in a long chain (Figure 4-5). Alternative FPGA Architectures From the previous cell in the chain 1 0 SRAM cells 0 0 To the next cell in the chain Figure 4-5. Configuration cells linked in a chain. This aspect of the architecture is discussed in more detail in chapter 5. The point here is that, once the device has been programmed, some vendors allow the SRAM cells forming a LUT to be treated independently of the main body of the chain and to be used in the form of a shift register. Thus, each LUT may be considered to be multifaceted (figure 4-6). CLBs versus LABs versus slices 16-bit SR 16 x 1 RAM 4-input LUT Figure 4-6. A multifaceted LUT. “Man can not live by LUTs alone,” as the Bard would surely say if he were to be reincarnated accidentally as an FPGA designer. For this reason, in addition to one or more LUTs, a programmable logic block will contain other elements, such as multiplexers and registers. But before we delve ■ 73 1822: France. Andre Ampere discovers that two wires carrying electric currents attract each other. 74 ■ The Design Warrior's Guide to FPGAs 1827: England. Sir Charles Wheatstone constructs a microphone. into this topic, we first need to wrap our brains around some terminology. A Xilinx logic cell One niggle when it comes to FPGAs is that each vendor has its own names for things. But we have to start somewhere, so let’s kick off by saying that the core building block in a modern FPGA from Xilinx is called a logic cell (LC). Among other things, an LC comprises a 4-input LUT (which can also act as a 16 × 1 RAM or a 16-bit shift register), a multiplexer, and a register (Figure 4-7). 16-bit SR 16x1 RAM a b c d 4-input LUT y mux flip-flop q e clock clock enable set/reset Figure 4-7. A simplified view of a Xilinx LC. It must be noted that the illustration presented in Figure 4-7 is a gross simplification, but it serves our purposes here. The register can be configured to act as a flip-flop, as shown in Figure 4-7 or as a latch. The polarity of the clock (rising-edge triggered or falling-edge triggered) can be configured, as can the polarity of the clock enable and set/reset signals (active-high or active-low). In addition to the LUT, MUX, and register, the LC also contains a smattering of other elements, including some spe- Alternative FPGA Architectures cial fast carry logic for use in arithmetic operations (this is discussed in more detail a little later). An Altera logic element ■ 75 1827: Germany. Georg Ohm investigates electrical resistance and defines Ohm’s Law. Just for reference, the equivalent core building block in an FPGA from Altera is called a logic element (LE). There are a number of differences between a Xilinx LC and an Altera LE, but the overall concepts are very similar. Slicing and dicing The next step up the hierarchy is what Xilinx calls a slice (Altera and the other vendors doubtless have their own equivalent names). Why “slice”? Well, they had to call it something, and—whichever way you look at it—the term slice is “something.” At the time of this writing, a slice contains two logic cells (Figure 4-8). Slice 16-bit SR 16x1 RAM Logic Cell (LC) 4-input LUT LUT 16-bit SR MUX REG Logic Cell (LC) 16x1 RAM 4-input LUT LUT MUX REG Figure 4-8. A slice containing two logic cells. The reason for the “at the time of this writing” qualifier is that these definitions can—and do—change with the seasons. The definition of what forms a CLB varies from year to year. In the early days, a CLB consisted of two 3-input LUTs and one register. Later versions sported two 4-input LUTs and two registers. 76 ■ The Design Warrior's Guide to FPGAs Now, each CLB can contain two or four slices, where each slice contains two 4-input LUTS and two registers. And as for the morrow … well, it would take a braver man than I even to dream of speculating. The internal wires have been omitted from this illustration to keep things simple; it should be noted, however, that although each logic cell’s LUT, MUX, and register have their own data inputs and outputs, the slice has one set of clock, clock enable, and set/reset signals common to both logic cells. CLBs and LABs And moving one more level up the hierarchy, we come to what Xilinx calls a configurable logic block (CLB) and what Altera refers to as a logic array block (LAB). (Other FPGA vendors doubtless have their own equivalent names for each of these entities, but these are of interest only if you are actually working with their devices.) Using CLBs as an example, some Xilinx FPGAs have two slices in each CLB, while others have four. At the time of this writing, a CLB equates to a single logic block in our original visualization of “islands” of programmable logic in a “sea” of programmable interconnect (Figure 4-9). Configurable logic block (CLB) CLB CLB The point where a set of data or control signals enters or exits a logic function is commonly referred to as a “port. ”In the case of a single-port RAM, data is written in and read out of the function using a common data bus. CLB CLB Slice Slice Logic cell Logic cell Logic cell Logic cell Slice Slice Logic cell Logic cell Logic cell Logic cell Figure 4-9. A CLB containing four slices (the number of slices depends on the FPGA family). There is also some fast programmable interconnect within the CLB. This interconnect (not shown in Figure 4-9 for reasons of clarity) is used to connect neighboring slices. Alternative FPGA Architectures The reason for having this type of logic-block hierarchy—LC→ Slice (with two LCs)→ CLB (with four slices)—is that it is complemented by an equivalent hierarchy in the interconnect. Thus, there is fast interconnect between the LCs in a slice, then slightly slower interconnect between slices in a CLB, followed by the interconnect between CLBs. The idea is to achieve the optimum trade-off between making it easy to connect things together without incurring excessive interconnect-related delays. Distributed RAMs and shift registers We previously noted that each 4-bit LUT can be used as a 16 × 1 RAM. And things just keep on getting better and better because, assuming the four-slices-per-CLB configuration illustrated in figure 4-9, all of the LUTs within a CLB can be configured together to implement the following: ■ ■ ■ ■ ■ ■ ■ Single-port 16 × 8 bit RAM Single-port 32 × 4 bit RAM Single-port 64 × 2 bit RAM Single-port 128 × 1 bit RAM Dual-port 16 × 4 bit RAM Dual-port 32 × 2 bit RAM Dual-port 64 × 1 bit RAM Alternatively, each 4-bit LUT can be used as a 16-bit shift register. In this case, there are special dedicated connections between the logic cells within a slice and between the slices themselves that allow the last bit of one shift register to be connected to the first bit of another without using the ordinary LUT output (which can be used to view the contents of a selected bit within that 16-bit register). This allows the LUTs within a single CLB to be configured together to implement a shift register containing up to 128 bits as required. Fast carry chains A key feature of modern FPGAs is that they include the special logic and interconnect required to implement fast carry ■ 77 In the case of a dual-port RAM, data is written into the function using one data bus (port) and read out using a second data bus (port). In fact, the read and write operations each have an associated address bus (used to point to a word of interest inside the RAM). This means that the read and write operations can be performed simultaneously. 78 ■ The Design Warrior's Guide to FPGAs DSP is pronounced by spelling it out as “D-S-P.” chains. In the context of the CLBs introduced in the previous section, each LC contains special carry logic. This is complemented by dedicated interconnect between the two LCs in each slice, between the slices in each CLB, and between the CLBs themselves. This special carry logic and dedicated routing boosts the performance of logical functions such as counters and arithmetic functions such as adders. The availability of these fast carry chains—in conjunction with features like the shift register incarnations of LUTs (discussed above) and embedded multipliers and the like (introduced below)—provided the wherewithal for FPGAs to be used for applications like DSP. Embedded RAMs A lot of applications require the use of memory, so FPGAs now include relatively large chunks of embedded RAM called e-RAM or block RAM. Depending on the architecture of the component, these blocks might be positioned around the periphery of the device, scattered across the face of the chip in relative isolation, or organized in columns, as shown in Figure 4-10. Columns of embedded RAM blocks Arrays of programmable logic blocks Figure 4-10. Bird’s-eye view of chip with columns of embedded RAM blocks. Alternative FPGA Architectures Depending on the device, such a RAM might be able to hold anywhere from a few thousand to tens of thousands of bits. Furthermore, a device might contain anywhere from tens to hundreds of these RAM blocks, thereby providing a total storage capacity of a few hundred thousand bits all the way up to several million bits. Each block of RAM can be used independently, or multiple blocks can be combined together to implement larger blocks. These blocks can be used for a variety of purposes, such as implementing standard single- or dual-port RAMs, first-in first-out (FIFO) functions, state machines, and so forth. Embedded multipliers, adders, MACs, etc. Some functions, like multipliers, are inherently slow if they are implemented by connecting a large number of programmable logic blocks together. Since these functions are required by a lot of applications, many FPGAs incorporate special hardwired multiplier blocks. These are typically located in close proximity to the embedded RAM blocks introduced in the previous point because these functions are often used in conjunction with each other (Figure 4-11). RAM blocks Multipliers Logic blocks Figure 4-11. Bird’s-eye view of chip with columns of embedded multipliers and RAM blocks. Similarly, some FPGAs offer dedicated adder blocks. One operation that is very common in DSP-type applications is ■ 79 FIFO is pronounced “fi” to rhyme with “hi,” followed by “fo” to rhyme with “no” (like the “Hi-Ho” song in “Snow White and the Seven Dwarfs”). 80 ■ The Design Warrior's Guide to FPGAs 1829: England. Sir Charles Wheatstone invents the concertina. called a multiply-and-accumulate (MAC) (Figure 4-12). As its name would suggest, this function multiplies two numbers together and adds the result to a running total stored in an accumulator. Multiplier Adder Accumulator A[n:0] xx B[n:0] ++ Y[(2n - 1):0] MAC Figure 4-12. The functions forming a MAC. If the FPGA you are working with supplies only embedded multipliers, you will have to implement this function by combining the multiplier with an adder formed from a number of programmable logic blocks, while the result is stored in some associated flip-flops, in a block RAM, or in a number of distributed RAMs. Life becomes a little easier if the FPGA also provides embedded adders, and some FPGAs provide entire MACs as embedded functions. Embedded processor cores (hard and soft) Almost any portion of an electronic design can be realized in hardware (using logic gates and registers, etc.) or software (as instructions to be executed on a microprocessor). One of the main partitioning criteria is how fast you wish the various functions to perform their tasks: Alternative FPGA Architectures ■ ■ ■ Picosecond and nanosecond logic: This has to run insanely fast, which mandates that it be implemented in hardware (in the FPGA fabric). Microsecond logic: This is reasonably fast and can be implemented either in hardware or software (this type of logic is where you spend the bulk of your time deciding which way to go). Millisecond logic: This is the logic used to implement interfaces such as reading switch positions and flashing light-emitting diodes (LEDs). It’s a pain slowing the hardware down to implement this sort of function (using huge counters to generate delays, for example). Thus, it’s often better to implement these tasks as microprocessor code (because processors give you lousy speed—compared to dedicated hardware—but fantastic complexity). The fact is that the majority of designs make use of microprocessors in one form or another. Until recently, these appeared as discrete devices on the circuit board. Of late, high-end FPGAs have become available that contain one or more embedded microprocessors, which are typically referred to as microprocessor cores. In this case, it often makes sense to move all of the tasks that used to be performed by the external microprocessor into the internal core. This provides a number of advantages, not the least being that it saves the cost of having two devices; it eliminates large numbers of tracks, pads, and pins on the circuit board; and it makes the board smaller and lighter. Hard microprocessor cores A hard microprocessor core is implemented as a dedicated, predefined block. There are two main approaches for integrating such a core into the FPGA. The first is to locate it in a strip (actually called “The Stripe”) to the side of the main FPGA fabric (Figure 4-13). 1831: England. Michael Faraday creates the first electric dynamo. ■ 81 82 ■ The Design Warrior's Guide to FPGAs Main FPGA fabric The “Stripe” uP RAM I/O Microprocessor core, special RAM, peripherals and I/O, etc. etc. Figure 4-13. Birds-eye view of chip with embedded core outside of the main fabric. MCM is pronounced by spelling it out as “M-C-M.” In this scenario, all of the components are typically formed on the same silicon chip, although they could also be formed on two chips and packaged as a multichip module (MCM). The main FPGA fabric would also include the embedded RAM blocks, multipliers, and the like introduced earlier, but these have been omitted from this illustration to keep things simple. One advantage of this implementation is that the main FPGA fabric is identical for devices with and without the embedded microprocessor core, which can help make things easier for the design tools used by the engineers. The other advantage is that the FPGA vendor can bundle a whole load of additional functions in the strip to complement the microprocessor core, such as memory, special peripherals, and so forth. An alternative is to embed one or more microprocessor cores directly into the main FPGA fabric. One, two, and even four core implementations are currently available as I pen these words (Figure 4-14). Once again, the main FPGA fabric would also include the embedded RAM blocks, multipliers, and the like introduced Alternative FPGA Architectures ■ 83 1831: England. Michael Faraday creates the first electrical transformer. uP uP uP uP uP (a) One embedded core (b) Four embedded cores Figure 4-14. Bird’s-eye view of chips with embedded cores inside the main fabric. earlier, but these have been omitted from this illustration to keep things simple. In this case, the design tools have to be able to take account of the presence of these blocks in the fabric; any memory used by the core is formed from embedded RAM blocks, and any peripheral functions are formed from groups of general-purpose programmable logic blocks. Proponents of this scheme will argue that there are inherent speed advantages to be gained from having the microprocessor core in intimate proximity to the main FPGA fabric. Soft microprocessor cores As opposed to embedding a microprocessor physically into the fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor. These are typically called soft cores, but they may be more precisely categorized as either “soft” or “firm” depending on the way in which the microprocessor’s functionality is mapped onto the logic blocks (see also the discussions associated with the hard IP, soft IP, and firm IP topics later in this chapter). 84 ■ The Design Warrior's Guide to FPGAs 1831: England. Michael Faraday discovers magnetic lines of force. Soft cores are simpler (more primitive) and slower than their hard-core counterparts.2 However, they have the advantage that you only need to implement a core if you need it and also that you can instantiate as many cores as you require until you run out of resources in the form of programmable logic blocks. Clock trees and clock managers All of the synchronous elements inside an FPGA—for example, the registers configured to act as flip-flops inside the programmable logic blocks—need to be driven by a clock signal. Such a clock signal typically originates in the outside world, comes into the FPGA via a special clock input pin, and is then routed through the device and connected to the appropriate registers. Clock trees Consider a simplified representation that omits the programmable logic blocks and shows only the clock tree and the registers to which it is connected (Figure 4-15). Clock tree Flip-flops Special clock pin and pad Clock signal from outside world Figure 4-15. A simple clock tree. 2 A soft core typically runs at 30 to 50 percent of the speed of a hard core. Alternative FPGA Architectures ■ 85 This is called a “clock tree” because the main clock signal branches again and again (the flip-flops can be consider, to be the “leaves” on the end of the branches). This structure is used to ensure that all of the flip-flops see their versions of the clock signal as close together as possible. If the clock were distributed as a single long track driving all of the flip-flops one after another, then the flip-flop closest to the clock pin would see the clock signal much sooner than the one at the end of the chain. This is referred to as skew, and it can cause all sorts of problems (even when using a clock tree, there will be a certain amount of skew between the registers on a branch and also between branches). The clock tree is implemented using special tracks and is separate from the general-purpose programmable interconnect. The scenario shown above is actually very simplistic. In reality, multiple clock pins are available (unused clock pins can be employed as general-purpose I/O pins), and there are multiple clock domains (clock trees) inside the device. Clock managers Instead of configuring a clock pin to connect directly into an internal clock tree, that pin can be used to drive a special hard-wired function (block) called a clock manager that generates a number of daughter clocks (Figure 4-16). Clock signal from outside world Clock Manager etc. Daughter clocks used to drive internal clock trees or output pins Special clock pin and pad Figure 4-16. A clock manager generates daughter clocks. These daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clock- A clock manager as described here is referred to as a digital clock manager (DCM) in the Xilinx world.DCM is pronounced by spelling it out as “D-C-M.” 86 ■ The Design Warrior's Guide to FPGAs The term hertz was taken from the name of Heinrich Rudolf Hertz, a professor of physics at Karlsruhe Polytechnic in Germany, who first transmitted and received radio waves in a laboratory environment in 1888. One hertz (Hz) equates to “one cycle per second,” so MHz stands for megahertz or “million Hertz.” ing services to other devices on the host circuit board. Each family of FPGAs has its own type of clock manager (there may be multiple clock manager blocks in a device), where different clock managers may support only a subset of the following features: Jitter removal: For the purposes of a simple example, assume that the clock signal has a frequency of 1 MHz (in reality, of course, this could be much, much higher). In an ideal environment each clock edge from the outside world would arrive exactly one millionth of a second after its predecessor. In the real world, however, clock edges may arrive a little early or a little late. As one way to visualize this effect—known as jitter—imagine if we were to superimpose multiple edges on top of each other; the result would be a “fuzzy” clock (Figure 4-17). Figure 4-17. Jitter results in a fuzzy clock. The FPGA’s clock manager can be used to detect and correct for this jitter and to provide “clean” daughter clock signals for use inside the device (Figure 4-18). Frequency synthesis: It may be that the frequency of the clock signal being presented to the FPGA from the outside world is not exactly what the design engineers wish for. In this case, the clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal. Alternative FPGA Architectures Clock signal from outside world with jitter Clock Manager etc. “Clean” daughter clocks used to drive internal clock trees or output pins Special clock pin and pad Figure 4-18. The clock manager can remove jitter. As a really simple example, consider three daughter clock signals: the first with a frequency equal to that of the original clock, the second multiplied to be twice that of the original clock, and the third divided to be half that of the original clock (Figure 4-19). Figure 4-19. Using the clock manager to perform frequency synthesis. Once again, Figure 4-19 reflects very simple examples. In the real world, one can synthesize all sorts of internal clocks, such as an output that is four-fifths the frequency of the original clock. Phase shifting: Certain designs require the use of clocks that are phase shifted (delayed) with respect to each other. Some clock managers allow you to select from fixed phase shifts of common values such as 120° and 240° (for a three-phase clocking scheme) or 90°, 180°, and 270° (if a four-phase clocking scheme is required). Others allow you to configure the exact amount of phase shift you require for each daughter clock. For example, let’s assume that we are deriving four internal clocks from a master clock, where the first is in phase with the ■ 87 1831: England. Michael Faraday discovers the principal of electro-magnetic induction. 88 ■ The Design Warrior's Guide to FPGAs original clock, the second is phase shifted by 90°, the third by 180°, and so forth (Figure 4-20). Figure 4-20. Using the clock manager to phase-shift the daughter clocks. PLL is pronounced by spelling it out as “P-L-L. ”DLL is pronounced by spelling it out as “D-L-L.” At this time, I do not know why digital delay-locked loop is not abbreviated to “DDLL.” Auto-skew correction: For the sake of simplicity, let’s assume that we’re talking about a daughter clock that has been configured to have the same frequency and phase as the main clock signal coming into the FPGA. By default, however, the clock manager will add some element of delay to the signal as it performs its machinations. Also, more significant delays will be added by the driving gates and interconnect employed in the clock’s distribution. The result is that—if nothing is done to correct it—the daughter clock will lag behind the input clock by some amount. Once again, the difference between the two signals is known as skew. Depending on how the main clock and the daughter clock are used in the FPGA (and on the rest of the circuit board), this can cause a variety of problems. Thus, the clock manager may allow a special input to feed the daughter clock. In this case, the clock manager will compare the two signals and specifically add additional delay to the daughter clock sufficient to realign it with the main clock (Figure 4-21). To be a tad more specific, only the prime (zero phaseshifted) daughter clock will be treated in this way, and all of the other daughter clocks will be phase aligned to this prime daughter clock. Some FPGA clock managers are based on phase-locked loops (PLLs), while others are based on digital delay-locked loops Alternative FPGA Architectures Figure 4-21. Deskewing with reference to the mother clock. (DLLs). PLLs have been used since the 1940s in analog implementations, but recent emphasis on digital methods has made it desirable to match signal phases digitally. PLLs can be implemented using either analog or digital techniques, while DLLs are by definition digital in nature. The proponents of DLLs say that they offer advantages in terms of precision, stability, power management, noise insensitivity, and jitter performance. General-purpose I/O Today’s FPGA packages can have 1,000 or more pins, which are arranged as an array across the base of the package. Similarly, when it comes to the silicon chip inside the package, flip-chip packaging strategies allow the power, ground, clock, and I/O pins to be presented across the surface of the chip. Purely for the purposes of these discussions (and illustrations), however, it makes things simpler if we assume that all of the connections to the chip are presented in a ring around the circumference of the device, as indeed they were for many years. ■ 89 90 ■ The Design Warrior's Guide to FPGAs 1831: England. Michael Faraday discovers that a moving magnet induces an electric current. Configurable I/O standards Let’s consider for a moment an electronic product from the perspective of the architects and engineers designing the circuit board. Depending on what they are trying to do, the devices they are using, the environment the board will operate in, and so on, these guys and gals will select a particular standard to be used to transfer data signals. (In this context, “standard” refers to electrical aspects of the signals, such as their logic 0 and logic 1 voltage levels.) The problem is that there is a wide variety of such standards, and it would be painful to have to create special FPGAs to accommodate each variation. For this reason, an FPGA’s general-purpose I/O can be configured to accept and generate signals conforming to whichever standard is required. These general-purpose I/O signals will be split into a number of banks—we’ll assume eight such banks numbered from 0 to 7 (Figure 4-22). 1 0 2 7 General-purpose I/O banks 0 through 7 3 6 4 5 Figure 4-22. Bird’s-eye view of chip showing general-purpose I/O banks. The interesting point is that each bank can be configured individually to support a particular I/O standard. Thus, in addition to allowing the FPGA to work with devices using multiple I/O standards, this allows the FPGA to actually be used to interface between different I/O standards (and also to Alternative FPGA Architectures translate between different protocols that may be based on particular electrical standards). Configurable I/O impedances The signals used to connect devices on today’s circuit board often have fast edge rates (this refers to the time it takes the signal to switch between one logic value and another). In order to prevent signals reflecting back (bouncing around), it is necessary to apply appropriate terminating resistors to the FPGA’s input or output pins. In the past, these resistors were applied as discrete components that were attached to the circuit board outside the FPGA. However, this technique became increasingly problematic as the number of pins started to increase and their pitch (the distance between them) shrank. For this reason, today’s FPGAs allow the use of internal terminating resistors whose values can be configured by the user to accommodate different circuit board environments and I/O standards. Core versus I/O supply voltages In the days of yore—circa 1965 to 1995—the majority of digital ICs used a ground voltage of 0V and a supply voltage of +5V. Furthermore, their I/O signals also switched between 0V (logic 0) and +5V (logic 1), which made life really simple. Over time, the geometries of the structures on silicon chips became smaller because smaller transistors have lower costs, higher speed, and lower power consumption. However, these processes demanded lower supply voltages, which have continued to fall over the years (Table 4.2). The point is that this supply (which is actually provided using large numbers of power and ground pins) is used to power the FPGA’s internal logic. For this reason, this is known as the core voltage. However, different I/O standards may use signals with voltage levels significantly different from the core voltage, so each bank of general-purpose I/Os can have its own additional supply pins. ■ 91 1832: England. Charles Babbage conceives the first mechanical computer, the Analytical Engine. 92 ■ The Design Warrior's Guide to FPGAs 1832: England Joseph Henry discovers self-induction or inductance. Year 1998 1999 2000 2001 2003 Supply (Core Voltage (V)) 3.3 2.5 1.8 1.5 1.2 Technology Node (nm) 350 250 180 150 130 Table 4.2. Supply voltages versus technology nodes. It’s interesting to note that—from the 350 nm node onwards—the core voltage has scaled fairly linearly with the process technology. However, there are physical reasons not to go much below 1V (these reasons are based on technology aspects such as transistor input switching thresholds and voltage drops), so this “voltage staircase” might start to tail off in the not-so-distant future. Gigabit transceivers The traditional way to move large amounts of data between devices is to use a bus, a collection of signals that carry similar data and perform a common function (Figure 4-23). n-bit bus FPGA Other device Figure 4-23: Using a bus to communicate between devices. Early microprocessor-based systems circa 1975 used 8-bit buses to pass data around. As the need to push more data around and to move it faster grew, buses grew to 16 bits in width, then 32 bits, then 64 bits, and so forth. The problem is that this requires a lot of pins on the device and a lot of tracks Alternative FPGA Architectures connecting the devices together. Routing these tracks so that they all have the same length and impedance becomes increasingly painful as boards grow in complexity. Furthermore, it becomes increasingly difficult to manage signal integrity issues (such as susceptibility to noise) when you are dealing with large numbers of bus-based tracks. For this reason, today’s high-end FPGAs include special hard-wired gigabit transceiver blocks. These blocks use one pair of differential signals (which means a pair of signals that always carry opposite logical values) to transmit (TX) data and another pair to receive (RX) data (Figure 4-24). ■ 93 1833: England. Michael Faraday defines the laws of electrolysis. Transceiver block Differential pairs FPGA Figure 4-24: Using high-speed transceivers to communicate between devices. These transceivers operate at incredibly high speeds, allowing them to transmit and receive billions of bits of data per second. Furthermore, each block actually supports a number (say four) of such transceivers, and an FPGA may contain a number of these transceiver blocks. Hard IP, soft IP, and firm IP Each FPGA vendor offers its own selection of hard, firm, and soft IP. Hard IP comes in the form of preimplemented blocks such as microprocessor cores, gigabit interfaces, multipliers, adders, MAC functions, and the like. These blocks are designed to be as efficient as possible in terms of power consumption, silicon real estate, and performance. Each FPGA family will feature different combinations of such blocks. together with various quantities of programmable logic blocks. IP is pronounced by spelling it out as “I-P.” HDL is pronounced by spelling it out as “H-D-L.” VHDL is pronounced by spelling it out as “V-H-D-L.” RTL is pronounced by spelling it out as “R-T-L.” 94 ■ The Design Warrior's Guide to FPGAs PCI is pronounced by spelling it out as “P-C-I.” At the other end of the spectrum, soft IP refers to a source-level library of high-level functions that can be included to the users’ designs. These functions are typically represented using a hardware description language, or HDL, such as Verilog or VHDL at the register transfer level (RTL) of abstraction. Any soft IP functions the design engineers decide to use are incorporated into the main body of the design— which is also specified in RTL—and subsequently synthesized down into a group of programmable logic blocks (possibly combined with some hard IP blocks like multipliers, etc.). Holding somewhat of a middle ground is firm IP, which also comes in the form of a library of high-level functions. Unlike their soft IP equivalents, however, these functions have already been optimally mapped, placed, and routed into a group of programmable logic blocks (possibly combined with some hard IP blocks like multipliers, etc.). One or more copies of each predefined firm IP block can be instantiated (called up) into the design as required. The problem is that it can be hard to draw the line between those functions that are best implemented as hard IP and those that should be implemented as soft or firm IP (using a number of general-purpose programmable logic blocks). In the case of functions like the multipliers, adders, and MACs discussed earlier in this chapter, these are generally useful for a wide range of applications. On the other hand, some FPGAs contain dedicated blocks to handle specific interface protocols like the PCI standard. It can, of course, make your life a lot easier if this happens to be the interface with which you wish to connect your device to the rest of the board. On the other hand, if you decide you need to use some other interface, a dedicated PCI block will serve only to waste space, block traffic, and burn power in your chip. Generally speaking, once FPGA vendors add a function like this into their device, they’ve essentially placed the component into a niche. Sometimes you have to do this to achieve the desired performance, but this is a classic problem Alternative FPGA Architectures because the next generation of the device is often fast enough to perform this function in its main (programmable) fabric. System gates versus real gates One common metric used to measure the size of a device in the ASIC world is that of equivalent gates. The idea is that different vendors provide different functions in their cell libraries, where each implementation of each function requires a different number of transistors. This makes it difficult to compare the relative capacity and complexity of two devices. The answer is to assign each function an equivalent gate value along the lines of “Function A equates to five equivalent gates; function B equates to three equivalent gates …” The next step is to count all of the instances of each function, convert them into their equivalent gate values, sum all of these values together, and proudly proclaim, “My ASIC contains 10 million equivalent gates, which makes it much bigger than your ASIC!” Unfortunately, nothing is simple because the definition of what actually constitutes an equivalent gate can vary depending on whom one is talking to. One common convention is for a 2-input NAND function to represent one equivalent gate. Alternatively, some vendors define an equivalent gate as equaling an arbitrary number of transistors. And a more esoteric convention defines an ECL equivalent gate as being “one-eleventh the minimum logic required to implement a single-bit full adder” (who on earth came up with this one?). As usual, the best policy here is to make sure that everyone is talking about the same thing before releasing your grip on your hard-earned money. And so we come to FPGAs. One of the problems FPGA vendors run into occurs when they are trying to establish a basis for comparison between their devices and ASICs. For example, if someone has an existing ASIC design that contains 500,000 equivalent gates and he wishes to migrate this design into an FPGA implementation, how can he tell if his design will fit into a particular FPGA. The fact that each ■ 95 1837: America. Samual Finley Breese Morse exhibits an electric telegraph. 96 ■ The Design Warrior's Guide to FPGAs 1837: England Sir Charles Wheatstone and Sir William Fothergill Cooke patent the five-needle electric telegraph. 4-input LUT can be used to represent anywhere between one and more than twenty 2-input primitive logic gates makes such a comparison rather tricky. In order to address this issue, FPGA vendors started talking about system gates in the early 1990s. Some folks say that this was a noble attempt to use terminology that ASIC designers could relate to, while others say that it was purely a marketing ploy that didn’t do anyone any favors. Sad to relate, there appears to be no clear definition as to exactly what a system gate is. The situation was difficult enough when FPGAs essentially contained only generic programmable logic in the form of LUTs and registers. Even then, it was hard to state whether or not a particular ASIC design containing x equivalent gates could fit into an FPGA containing y system gates. This is because some ASIC designs may be predominantly combinatorial, while others may make excessively heavy use of registers. Both cases may result in a suboptimal mapping onto the FPGA. The problem became worse when FPGAs started containing embedded blocks of RAM, because some functions can be implemented much more efficiently in RAM than in generalpurpose logic. And the fact that LUTs can act as distributed RAM only serves to muddy the waters; for example, one vendor’s system gate count values now include the qualifier, “Assumes 20 percent to 30 percent of LUTs are used as RAM.” And, of course, the problems are exacerbated when we come to consider FPGAs containing embedded processor cores and similar functions, to the extent that some vendors now say, “System gate values are not meaningful for these devices.” Is there a rule of thumb that allows you to convert system gates to equivalent gates and vice versa? Sure, there are lots of them! Some folks say that if you are feeling optimistic, then you should divide the system gate value by three (in which case three million FPGA system gates would equate to one million ASIC equivalent gates, for example). Or if you’re feeling a tad more on the pessimistic side, you could divide the Alternative FPGA Architectures system gates by five (in which case three million system gates would equate to 600,000 equivalent gates). However, other folks would say that the above is only true if you assume that the system gates value encompasses all of the functions that you can implement using both the generalpurpose programmable logic and the block RAMs. These folks would go on to say that if you remove the block RAMs from the equation, then you should divide the system gates value by ten (in which case, three million system gates would equate to only 300,000 equivalent gates), but in this case you still have the block RAMs to play with … arrggghhhh! Ultimately, this topic spirals down into such a quagmire that even the FPGA vendors are trying desperately not to talk about system gates any more. When FPGAs were new on the scene, people were comfortable with the thought of equivalent gates and not so at ease considering designs in terms of LUTs, slices, and the like; however, the vast number of FPGA designs that have been undertaken over the years means that engineers are now much happier thinking in FPGA terms. For this reason, speaking as someone living in the trenches, I would prefer to see FPGAs specified and compared using only simple counts of: ■ ■ ■ ■ ■ ■ Number of logic cells or logic elements or whatever (which equates to the number of 4-input LUTs and associated flip-flops/latches) Number (and size) of embedded RAM blocks Number (and size) of embedded multipliers Number (and size) of embedded adders Number (and size) of embedded MACs etc. Why is this so hard? And it would be really useful to take a diverse suite of real-world ASIC design examples, giving their equivalent gate values, along with details as to their flops/latches, primitive gates, and other more complex functions, then to relate each of these examples to the number of ■ 97 1842: England Joseph Henry discovers that an electrical spark between two conductors is able to induce magnetism in needles—this effect is detected at a distance of 30 meters. 98 ■ The Design Warrior's Guide to FPGAs 1842: Scotland. Alexander Bail demonstrates first electromechanical means to capture, transmit, and reproduce an image. LUTs and flip-flops/latches required in equivalent FPGA implementations, along with the amount of embedded RAM and the number of other embedded functions. Even this would be less than ideal, of course, because one tends to design things differently for FPGA and ASIC targets, but it would be a start. FPGA years We’ve all heard it said that each year for a dog is equivalent to seven human years, the idea being that a 10-year-old pooch would be 70 years old in human terms. Thinking like this doesn’t actually do anyone much good. On the other hand, it does provide a useful frame of reference so that when your hound can no longer keep up with you on a long walk, you can say, “Well, it’s only to be expected because the poor old fellow is almost 100 years old” (or whatever). Similarly, in the case of FPGAs, it may help to think that one of their years equates to approximately 15 human years. Thus, if you’re working with an FPGA that was only introduced to the market within the last year, you should view it as a teenager. On the one hand, if you have high hopes for the future, he or she may end up with a Nobel Peace Prize or as the President of the United States. On the other hand, the object of your affections will typically have a few quirks that you have to get used to and learn to work around. By the time an FPGA has been on the market for two years (equating to 30 years in human terms), you can start to think of it as reasonably mature and a good all-rounder at the peak of its abilities. After three years (45 years old), an FPGA is becoming somewhat staid and middle-aged, and by four years (60 years old). you should treat it with respect and make sure that you don’t try to work it like a carthorse! Chapter 5 Programming (Configuring) an FPGA Weasel words Before plunging headfirst into this topic, it’s probably appropriate to preface our discussions with a few “weasel words” (always remember the saying, “Eagles may soar, but weasels rarely get sucked into jet engines at 10,000 feet!”). The point is that each FPGA vendor has its own unique terminology and its own techniques and protocols for doing things. To make life even more exciting, the detailed mechanisms for programming FPGAs can vary on a family-by-family basis. For these reasons, the following discussions are intended to provide only a generic introduction to this subject. Configuration files, etc. Section 2 of this book describes a variety of tools and flows that may be used to capture and implement FPGA designs. The end result of all of these techniques is a configuration file (sometimes called a bit file), which contains the information that will be uploaded into the FPGA in order to program it to perform a specific function. In the case of SRAM-based FPGAs, the configuration file contains a mixture of configuration data (bits that are used to define the state of programmable logic elements directly) and configuration commands (instructions that tell the device what to do with the configuration data). When the configuration file is in the process of being loaded into the device, the information being transferred is referred to as the configuration bitstream. 100 ■ The Design Warrior's Guide to FPGAs 1843: England. Augusta Ada Lovelace publishes her notes explaining the concept of a computer. E2-based and FLASH-based devices are programmed in a similar manner to their SRAM-based cousins. By comparison, in the case of antifuse-based FPGAs, the configuration file predominantly contains only a representation of the configuration data that will be used to grow the antifuses. Configuration cells The underlying concept associated with programming an FPGA is relatively simple (i.e., load the configuration file into the device). It can, however, be a little tricky to wrap one’s brain around all of the different facets associated with this process, so we’ll start with the basics and work our way up. Initially, let’s assume we have a rudimentary device consisting only of an array of very simple programmable logic blocks surrounded by programmable interconnect (Figure 5-1). Programmable interconnect Programmable logic blocks Figure 5-1. Top-down view of simple FPGA architecture. Any facets of the device that may be programmed are done so by means of special configuration cells. The majority of FPGAs are based on the use of SRAM cells, but some employ FLASH (or E2) cells, while others use antifuses. Irrespective of the underlying technology, the device’s interconnect has a large number of associated cells that can be used to configure it so as to connect the device’s primary inputs and outputs to the programmable logic blocks and Programming (Configuring) an FPGA these logic blocks to each other. (In the case of the device’s primary I/Os, which are not shown in Figure 5-1, each has a number of associated cells that can be used to configure them to accommodate specific I/O interface standards and so forth.) For the purpose of this portion of our discussions, we shall assume that each programmable logic block comprises only a 4-input LUT, a multiplexer, and a register (Figure 5-2). The multiplexer requires an associated configuration cell to specify which input is to be selected. The register requires associated cells to specify whether it is to act as an edge-triggered flip-flop (as shown in Figure 5-2) or a level-sensitive latch, whether it is to be triggered by a positive- or negative-going clock edge (in the case of the flip-flop option) or an active-low or active-high enable (if the register is instructed to act as a latch), and whether it is to be initialized with a logic 0 or a logic 1. Meanwhile, the 4-input LUT is itself based on 16 configuration cells. a b c d 4-input LUT y mux flip-flop q e clock Figure 5-2. A very simple programmable logic block. Antifuse-based FPGAs In the case of antifuse-based FPGAs, the antifuse cells can be visualized as scattered across the face of the device at strategic locations. The device is placed in a special device programmer, the configuration (bit) file is uploaded into the device programmer from the host computer, and the device programmer uses this file to guide it in applying pulses of rela- ■ 101 LUT is pronounced to rhyme with “nut.” 102 ■ The Design Warrior's Guide to FPGAs 2 FLASH (and E )–based devices are typically programmed in a similar manner to their SRAM cousins. Unlike SRAM-based FPGAs, FLASH-based devices are nonvolatile. They retain their configuration when power is removed from the system, and they don’t need to be reprogrammed when power is reapplied to the system (although they can be if required). Also, FLASH-based devices can be programmed in-system (on the circuit board) or outside the system by means of a device programmer. tively high voltage and current to selected pins to grow each antifuse in turn. A very simplified way of thinking about this is that each antifuse has a “virtual” x-y location on the surface of the chip, where these x-y values are specified as integers. Based on this scenario, we can visualize using one group of I/O pins to represent the x value associated with a particular antifuse and another group of pins to represent the y value. (Things are more complicated in the real world, but this is a nice way to think about things that doesn’t tax our brains too much.) Once all of the fuses have been grown, the FPGA is removed from the device programmer and attached to a circuit board. Antifuse-based devices are, of course, one-time programmable (OTP) because once you’ve started the programming process, you’re committed and it’s too late to change your mind. SRAM-based FPGAs For the remainder of this chapter we shall consider only SRAM-based FPGAs. Remember that these devices are volatile, which means that they have to be programmed in-system (on the circuit board), and they always need to be reprogrammed when power is first applied to the system. From the outside world, we can visualize all of the SRAM configuration cells as comprising a single (long) shift register. Consider a simple bird’s-eye view of the surface of the chip showing only the I/O pins/pads and the SRAM configuration cells (Figure 5-3). As a starting point, we shall assume that the beginning and end of this register chain are directly accessible from the outside world. However, it’s important to note that this is only the case when using the configuration port programming mechanism in conjunction with the serial load with FPGA as master or serial load with FPGA as slave programming modes, as discussed below. Also note that the configuration data out pin/signal shown in Figure 5-3 is only used if multiple FPGAs are to be config- Programming (Configuring) an FPGA ■ 103 Configuration data in Configuration data out = I/O pin/pad = SRAM cell Figure 5-3. Visualizing the SRAM cells as a long shift register. ured by cascading (daisy-chaining) them together or if it is required to be able to read the configuration data back out of the device for any reason. The quickness of the hand deceives the eye It isn’t really necessary to know this bit, so if you’re in a hurry, you can bounce over into the next section, but I found this interesting and thought you might find it to be so also. As figure 5-3 shows, the easiest way to visualize the internal organization of the SRAM programming cells is as a long shift register. If this were really the case, then each cell would be implemented as a flip-flop, and all of the flop-flops in the chain would be driven by a common clock. The problem is that an FPGA can contain a humongous number of configuration cells. By 2003, for example, a reasonably high-end device could easily contain 25 million such cells! The core of a flip-flop requires eight transistors, while the core of a latch requires only four transistors. For this reason, the configuration cells in an SRAM-based FPGA are formed from latches. (In our example device with 25 million configuration cells, this results in a saving of 100 million transistors, which is nothing to sneeze at.) Programming an FPGA can take a significant amount of time. Consider a reasonably high-end device containing 25 million SRAM-based configuration cells. Programming such a device using a serial mode and a 25 MHz clock would take one second. This isn’t too bad when you are first powering up a system, but it means that you really don’t want to keep on reconfiguring the FPGA when the system is in operation. 104 ■ The Design Warrior's Guide to FPGAs 1843: England. Sir Charles Wheatstone and Sir William Fothergill Cooke patent the 2-needle electrical telegraph. The problem is that you can’t create a shift register out of latches (well, actually you can, as is discussed a little later in this chapter, but not one that’s 25 million cells long). The way the FPGA vendors get around this is to have a group of flip-flops—say 1,024—sharing a common clock and configured as a classic shift register. This group is called a frame. The 25 million configuration cells/latches in our example device are also divided up into frames, each being the same length as the flip-flop frame. From the viewpoint of the outside world, you simply clock the 25 million bits of configuration data into the device. Inside the device, however, as soon as the first 1,024 bits have been serially loaded into the flop-flop frame, special-purpose internal circuitry automatically parallel copies/loads this data into the first latch frame. When the next 1,024 bits have been loaded into the flip-flop frame, they are automatically parallel copied/loaded into the second latch frame, and so on for the rest of the device. (The process is reversed when data is read out of the device.) Programming embedded (block) RAMs, distributed RAMs, etc. In the case of FPGAs containing large blocks of embedded (block) RAM, the cores of these blocks are implemented out of SRAM latches, and each of these latches is a configuration cell that forms a part of our “imaginary” register chain (as discussed in the previous section). One interesting point is that each 4-input LUT (see Figure 5-2) can be configured to act as a LUT, as a small (16 × 1) chunk of distributed RAM, or as a 16-bit shift register. All of these manifestations employ the same group of 16 SRAM latches, where each of these latches is a configuration cell that forms a part of our imaginary register chain. “But what about the 16-bit shift register incarnation,” you cry. “Doesn’t this need to be implemented using real flipflops?” Well, that’s a good question—I’m glad you asked. In fact, a trick circuit is employed using the concept of a capaci- Programming (Configuring) an FPGA tive latch that prevents classic race conditions (this is pretty much the same way designers built flip-flops out of discrete transistors, resistors, and capacitors in the early 1960s). Multiple programming chains Figure 5-3 shows the configuration cells presented as a single programming chain. As there can be tens of millions of configuration cells, this chain can be very long indeed. Some FPGAs are architected so that the configuration port actually drives a number of smaller chains. This allows individual portions of the device to be configured and facilitates a variety of concepts such as modular and incremental design (these concepts are discussed in greater detail in Section 2). Quickly reinitializing the device As was previously noted, the register in the programmable logic block has an associated configuration cell that specifies whether it is to be initialized with a logic 0 or a logic 1. Each FPGA family typically provides some mechanism such as an initialization pin that, when placed in its active state, causes all of these registers to be returned to their initialization values (this mechanism does not reinitialize any embedded [block] or distributed RAMs). Using the configuration port The early FPGAs made use of something called the configuration port. Even today, when more sophisticated techniques are available (like the JTAG interface discussed later in this chapter), the configuration port is still widely used because it’s relatively simple and is well understood by stalwarts in the FPGA fraternity. We start with a small group of dedicated configuration mode pins that are used to inform the device which configuration mode is going to be used. In the early days, only two pins were employed to provide four modes, as shown in Table 5-1. Note that the names of the modes shown in this table—and also the relationship between the codes on the ■ 105 1844: America. Morse Telegraph connects Washington and Baltimore. 106 ■ The Design Warrior's Guide to FPGAs 1845: England. Michael Faraday discovers the rotation of polarised light by magnetism. Table 5-1. The four original configuration modes mode pins and the modes themselves—are intended for use only as an example. The actual codes and mode names vary from vendor to vendor. The mode pins are typically hardwired to the desired logic 0 and logic 1 values at the circuit board level. (These pins can be driven from some other logic that allows the programming mode to be modified, but this is rarely done in practice.) In addition to the hard-wired mode pins, an additional pin is used to instruct the FPGA to actually commence the configuration, while yet another pin is used by the device to report back when it’s finished (there are also ways to determine if an error occurred during the process). This means that in addition to configuring the FPGA when the system is first powered up, the device may also be reinitialized using the original configuration data, if such an occurrence is deemed necessary. The configuration port also makes use of additional pins to control the loading of the data and to input the data itself. The number of these pins depends on the configuration mode selected, as discussed below. The important point here is that once the configuration has been completed, most of these pins can subsequently be used as general-purpose I/O pins (we will return to this point a little later). Serial load with FPGA as master This is perhaps the simplest programming mode. In the early days, it involved the use of an external PROM. This was subsequently superceded by an EPROM, then an E2PROM, Programming (Configuring) an FPGA and now—most commonly—a FLASH-based device. This special-purpose memory component has a single data output pin that is connected to a configuration data in pin on the FPGA (Figure 5-4). FPGA Memory De vice Control Configuration data in Cdata In Cdata Out Configuration data out Figure 5-4. Serial load with FPGA as master. The FPGA also uses several bits to control the external memory device, such as a reset signal to inform it when the FPGA is ready to start reading data and a clock signal to clock the data out. The idea with this mode is that the FPGA doesn’t need to supply the external memory device with a series of addresses. Instead, it simply pulses the reset signal to indicate that it wishes to start reading data from the beginning, and then it sends a series of clock pulses to clock the configuration data out of the memory device. The configuration data out signal coming from the FPGA need only be connected if it is required to read the configuration data from the device for any reason. One such scenario occurs when there are multiple FPGAs on the circuit board. In this case, each could have its own dedicated external memory device and be configured in isolation, as shown in Figure 5-4. Alternatively, the FPGAs could be cascaded (daisy-chained) together and share a single external memory (Figure 5-5). ■ 107 1845: England. First use of the electronic telegraph to help apprehend a criminal. 108 ■ The Design Warrior's Guide to FPGAs Figure 5-5. Daisy-chaining FPGAs. When electronics and computing first started, defining things was something of a free-forall. The end result was that different companies had their own definitions for things like bytes, and it was common to see 5-, 6-, 7-, 8-, and even 9-bit bytes. It was quite some time before the consensus settled on 8-bit bytes, at which time everyone was happy (apart from those who weren’t, but they don’t count). Parallel load with FPGA as master In many respects, this is very similar to the previous mode, except that the data is read in 8-bit chunks from a memory device with eight output pins. Groups of eight bits are very common and are referred to as bytes. In addition to providing control signals, the original FPGAs also supplied the external memory device with an address that was used to point to whichever byte of configuration data was to be loaded next (Figure 5-6). Control Memory De vice Groups of four bits are also common and are given the special name of nybble (sometimes nibble). The idea is that “two nybbles make a byte,” which is a (little) joke. This goes to show that engineers do have a sense of humor; it’s just not very sophisticated. In this scenario, the first FPGA in the chain (the one connected directly to the external memory) would be configured to use the serial master mode, while the others would be serial slaves, as discussed later in this chapter. FPGA Address Configuration data [7:0] Cdata In[7:0] Figure 5-6. Parallel load with FPGA as master (original technique). Programming (Configuring) an FPGA The way this worked was that the FPGA had an internal counter that was used to generate the address for the external memory. (The original FPGAs had 24-bit counters, which allowed them to address 16 million bytes of data.) At the beginning of the configuration sequence, this counter would be initialized with zero. After the byte of data pointed to by the counter had been read, the counter would be incremented to point to the next byte of data. This process would continue until all of the configuration data had been loaded. It’s easy to assume that this parallel-loading technique offers speed advantages, but it didn’t for quite some time. This is because—in early devices—as soon as a byte of data had been read into the device, it was clocked into the internal configuration shift register in a serial manner. Happily, this situation has been rectified in more modern FPGA families. On the other hand, although the eight pins can be used as general-purpose I/O pins once the configuration data has been loaded, in reality this is less than ideal. This is because these pins still have the tracks connecting them to the external memory device, which can cause a variety of signal integrity problems. The real reason why this technique was so popular in the days of yore is that the special-purpose memory devices used in the serial load with FPGA as master mode were quite expensive. By comparison, this parallel technique allowed design engineers to use off-the-shelf memory devices, which were much cheaper. Having said this, special-purpose memory devices created for use with FPGAs are now relatively inexpensive (and being FLASH-based, they are also reusable). Thus, modern FPGAs now use a new variation on this parallel-loading technique. In this case, the external memory is a special-purpose device that doesn’t require an external address, which means that the FPGAs no longer requires an internal counter for this purpose (Figure 5-7). As for the serial mode discussed earlier, the FPGA simply pulses the external memory device’s reset signal to indicate ■ 109 1845: England/France. First telegraph cable is laid across the English Channel. 110 ■ The Design Warrior's Guide to FPGAs 1846: Germany. Gustav Kirchhoff defines Kirchoff’s laws of electrical networks. FPGA Memory De vice Control Configuration data [7:0] Cdata In[7:0] Figure 5-7. Parallel load with FPGA as the master (modern technique). that it wishes to start reading data from the beginning, and then it sends a series of clock pulses to clock the configuration data out of the memory device. Parallel load with FPGA as slave The modes discussed above, in which the FPGA is the master, are attractive because of their inherent simplicity and also because they only require the FPGA itself, along with a single external memory device. However, a large number of circuit boards also include a microprocessor, which is typically already used to perform a wide variety of housekeeping tasks. In this case, the design engineers might decide to use the microprocessor to load the FPGA (Figure 5-8). Figure 5-8. Parallel load with FPGA as slave. Programming (Configuring) an FPGA ■ 111 The idea here is that the microprocessor is in control. The microprocessor informs the FPGA when it wishes to commence the configuration process. It then reads a byte of data from the appropriate memory device (or peripheral, or whatever), writes that data into the FPGA, reads the next byte of data from the memory device, writes that byte into the FPGA, and so on until the configuration is complete. This scenario conveys a number of advantages, not the least being that the microprocessor might be used to query the environment in which its surrounding system resides and to then select the configuration data to be loaded into the FPGA accordingly. Serial load with FPGA as slave This mode is almost identical to its parallel counterpart, except that only a single bit is used to load data into the FPGA (the microprocessor still reads data out of the memory device one byte at a time, but it then converts this data into a series of bits to be written to the FPGA). The main advantage of this approach is that it uses fewer I/O pins on the FPGA. This means that—following the configuration process—only a single I/O pin has the additional track required to connect it to the microprocessor’s data bus. Using the JTAG port Like many other modern devices, today’s FPGAs are equipped with a JTAG port. Standing for the Joint Test Action Group and officially known to engineers by its IEEE 1149.1 specification designator, JTAG was originally designed to implement the boundary scan technique for testing circuit boards and ICs. A detailed description of JTAG and boundary scan is beyond the scope of this book. For our purposes here, it is sufficient to understand that the FPGA has a number of pins that are used as a JTAG port. One of these pins is used to input JTAG data, and another is used to output that data. Each of the FPGAs remaining I/O pins has an associated JTAG regis- JTAG is pronounced by spelling out the “J,” followed by “tag” to rhyme with “bag.” 112 ■ The Design Warrior's Guide to FPGAs 1847: England. George Boole publishes his first ideas on symbolic logic. ter (a flip-flop), where these registers are daisy-chained together (Figure 5-9). JTAG data out JTAG data in From previous JTAG flip-flop To internal logic Input pad Input pin from outside world JTAG flip-flops From internal logic To next JTAG flip-flop Output pin to outside world Output pad Figure 5-9. JTAG boundary scan registers. The idea behind boundary scan is that, by means of the JTAG port, it’s possible to serially clock data into the JTAG registers associated with the input pins, let the device (the FPGA in this case) operate on that data, store the results from this processing in the JTAG registers associated with the output pins, and, ultimately, to serially clock this result data back out of the JTAG port. However, JTAG devices also contain a variety of additional JTAG-related control logic, and, with regard to FPGAs, JTAG can be used for much more than boundary scans. For example, it’s possible to issue special commands that are loaded into a special JTAG command register (not shown in Figure 5-9) by means of the JTAG port’s data-in pin. One such command instructs the FPGA to connect its internal SRAM configuration shift register to the JTAG scan chain. In this case, the JTAG port can be used to program the FPGA. Thus, today’s FPGAs now support five different programming modes and, therefore, require the use of three mode pins, as shown in Table 5-2 (additional modes may be added in the future). Programming (Configuring) an FPGA ■ 113 1850: England. Francis Galton invents the Teletype printer. Table 5-2. Today’s five configuration modes Note that the JTAG port is always available, so the device can initially be configured via the traditional configuration port using one of the standard configuration modes, and it can subsequently be reconfigured using the JTAG port as required. Alternately, the device can be configured using only the JTAG port. Using an embedded processor But wait, there’s more! In chapter 4, we discussed the fact that some FPGAs sport embedded processor cores, and each of these cores will have its own dedicated JTAG boundary scan chain. Consider an FPGA containing just one embedded processor (Figure 5-10). The FPGA itself would only have a single external JTAG port. If required, a JTAG command can be loaded via this port that instructs the device to link the processor’s local JTAG chain into the device’s main JTAG chain. (Depending on the vendor, the two chains could be linked by default, in which case a complementary command could be used to disengage the internal chain.) The idea here is that the JTAG port can be used to initialize the internal microprocessor core (and associated peripherals) to the extent that the main body of the configuration process can then be handed over to the core. In 114 ■ The Design Warrior's Guide to FPGAs JTAG data out JTAG data in FPGA Primary scan chain Internal (core) scan chain Core Figure 5-10. Embedded processor boundary scan chain. some cases, the core might be used to query the environment in which the FPGA resides and to then select the configuration data to be loaded into the FPGA accordingly. Chapter 6 Who Are All the Players? Introduction As was noted in chapter 1, this tome does not focus on particular FPGA vendors or specific FPGA devices because new features and chip types are constantly becoming available. Insofar as is possible, the book also tries not to mention individual EDA vendors or reference their tools by name because this arena is so volatile that tool names and feature sets can change from one day to the next. Having said this, this chapter offers pointers to some of the key FPGA and EDA vendors associated with FPGAs or related areas. FPGA and FPAA vendors The bulk of this book focuses on digital FPGAs. It is interesting to note, however, that field-programmable analog arrays (FPAAs) are also available. Furthermore, as opposed to supplying FPGA devices, some companies specialize in providing FPGA IP cores to be employed as part of standard cell ASIC or structured ASIC designs. Company Actel Corp. Altera Corp. Anadigm Inc. Atmel Corp. Lattice Semiconductor Corp. Leopard Logic Inc. QuickLogic Corp. Xilinx Inc. Web site www.actel.com www.altera.com www.anadigm.com www.atmel.com www.latticesemi.com www.leopardlogic.com www.quicklogic.com www.xilinx.com Comment FPGAs FPGAs FPAAs FPGAs FPGAs Embedded FPGA cores FPGAs FPGAs FPGA is pronounced by spelling it out as “F-P-G-A.” FPAA is pronounced by spelling it out as “F-P-A-A.” 116 ■ The Design Warrior's Guide to FPGAs FPNA is pronounced by spelling it out as “F-P-N-A.” (These are not to be confused with field programmable neural arrays, which share the FPNA acronym.) FFT is pronounced by spelling it out as “F-F-T.” FPNA vendors This is a bit of a tricky category, not the least because the name field programmable nodal arrays (FPNAs) was invented just a few seconds ago by the author as he penned these words (he’s just that sort of a fellow). The idea here is that each of these devices features a mega-coarse-grained architecture comprising an array of nodes, where each node is a complex processing element ranging from an ALU-type operation, to an algorithmic function such as a FFT, all the way up to a complete general-purpose microprocessor core. These devices aren’t FPGAs in the classic sense. Yet, the definition of what is and what isn’t an FPGA is a bit fluffy around the edges on a good day, to the extent that it would be fair to say that modern FPGAs with embedded RAMs, embedded processors, and gigabit transceivers aren’t FPGAs in the “classic sense.” In the case of FPNAs, these devices are both digital and field programmable, so they deserve at least some mention here. At the time of this writing, 30 to 50 companies are seriously experimenting with different flavors of FPNAs; a representative sample of the more interesting ones is as follows (see also Chapter 23): Company Exilent Ltd. IPflex Inc Motorola PACT XPP Technologies AG picoChip Designs Ltd. QuickSilver Technology Inc. EDA is pronounced by spelling it out as “E-D-A.” OEM, which stands for “original equipment manufacturer,” is pronounced by spelling it out as “O-E-M.” Web site www.elixent.com www.ipflex.com www.motorola.com www.pactxpp.com www.picochip.com www.qstech.com Comment ALU-based nodes Operation-based nodes Processor-based nodes ALU-based nodes Processor-based nodes Algorithmic element nodes Full-line EDA vendors Each FPGA, FPAA, and FPNA vendor supplies a selection of design tools associated with its particular devices. In the case of FPGAs, these tools invariably include the placeand-route engines. The FPGA vendor may also OEM tools (often “lite” versions) from external EDA companies. (In this context, OEM means that the FPGA vendors license this soft- Who Are All the Players? ware from a third party and then package it and provide it as part of their own environments.) First of all, we have the big boys—the full-line EDA vendors who can supply complete solutions from soup to nuts (in certain cases, these solutions may include OEM’d point tools from the specialist EDA vendors discussed in the next section). Company Web site Comment Altium Ltd. www.altium.com System-on-FPGA hardware-software design environment Cadence Design Systems www.cadence.com FPGA design entry and simulation (OEM synthesis) Mentor Graphics Corp. www.mentor.com FPGA design entry, simulation, and synthesis Synopsys Inc. www.synopsys.com FPGA design entry, simulation, and synthesis Nothing is simple in this life. For example, it may seem strange to group a relatively small company like Altium with comparative giants like the “big three.” In the context of FPGAs, however, Altium supplies a complete hardware and software codesign environment for system-on-FPGA development. This includes design entry, simulation, synthesis, compilation/assembly, and comprehensive debugging facilities, along with an associated multi-FPGA vendor-capable development board. FPGA-specialist and independent EDA vendors As opposed to purchasing an existing solution, some design teams prefer to create their own customized environments using point tools from a number of EDA vendors. In many cases, these tools are cheaper than their counterparts from the full-line vendors, but they may also be less sophisticated and less powerful. At the same time, smaller vendors sometimes come out with incredibly cool and compelling offerings, and they may be more accessible and responsive to their customers. ■ 1850: The paper bag is invented. 117 118 ■ The Design Warrior's Guide to FPGAs (“You pay your money and you make your choice,” as the old saying goes.) RTOS, which stands for “real-time operating system,” is pronounced by spelling out the “R,” followed by “toss” to rhyme with “boss.” You could probably get through the rest of your day without hearing this, but on the off chance you are interested, a groat was an English silver coin (worth four old pennies) that was used between the fourteenth and seventeenth centuries. Company 0-In Design Automation Web site www.0-In.com AccelChip Inc. Aldec Inc. www.accelchip.com www.aldec.com Celoxica Ltd. www.celoxica.com Elanix Inc. www.elanix.com Fintronic USA Inc. First Silicon Solutions Inc. www.fintronic.com www.fs2.com Green Hills Software Inc. www.ghs.com Hier Design Inc. www.hierdesign.com Novas Software Inc. www.novas.com Simucad Inc. Synplicity Inc. The MathWorks Inc. www.simucad.com www.synplicity.com www.mathworks.com TransEDA PLC Verisity Design Inc. www.transeda.com www.verisity.com Wind River Systems Inc. www.windriver.com Comment Assertion-based verification FPGA-based DSP design Mixed-language simulation FPGA-based system design and synthesis DSP design and algorithmic verification Simulation On-chip instrumentation and debugging for FPGA logic and embedded processors RTOS and embedded software specialists FPGA-based silicon virtual prototyping (SVP) Verification results analysis Simulation FPGA-based synthesis System design and algorithmic verification Verification IP Verification languages and environments RTOS and embedded software specialists FPGA design consultants with special tools There are a lot of small design houses specializing in FPGA designs. Some of these boast rather cunning internally developed design tools that are well worth taking a look at. Company Dillon Engineering Inc. Launchbird Inc. Web site www.dilloneng.com www.launchbird.com Comment ParaCore Architect Confluence system design language and compiler Open-source, free, and low-cost design tools Last but not least, let’s assume that you wish to establish a small FPGA design team or to set yourself up as a small FPGA design consultant, but you are a little short of funds just at the Who Are All the Players? moment (trust me, I can relate to this). In this case, it is possible to use a variety of open-source, free, and low-cost technologies to get a new FPGA design house off the ground without paying more than a few groats for design tools. Company Altera Corp. Gentoo Icarus Xilinx Inc. —— —— Website www.altera.com www.gentoo.com http://icarus.com/eda/verilog www.xilinx.com www.cs.man.ac.uk/apt/tools/gtkwa ve/ www.opencores.org —— www.opencollector.org —— www.python.org —— —— www.veripool.com/dinotrace www.veripool.com/verilator.html Comment Synthesis and place-and-route Linux development platform Verilog simulator Synthesis and place-and-route GTKWave waveform viewer Open-source hardware cores and EDA tools Database of open-source hardware cores and EDA tools Python programming language (for custom tooling and DSP programming) Dinotrace waveform viewer Verilator (Verilog to cycle-accurate C translator) With regard to using Linux as the development platform, the two main FPGA vendors—Xilinx and Altera—are now porting their tools to Linux. Xilinx and Altera also offer free versions of their ISE and Quartus-II FPGA design environments, respectively (and even the full-up versions of these environments are within the budgets of most startups). ■ 119 1853: Scotland/Ireland. Sir Charles Tilston Bright lays the first deepwater cable between Scotland and Ireland. Chapter 7 FPGA Versus ASIC Design Styles Introduction My mother is incredibly proud of the fact that “I R an electronics engineer.” This comes equipped with an absolute and unshakable faith that I can understand—and fix—any piece of electronic equipment (from any era) on the planet. In reality, of course, the truth is far less glamorous because very few among us are experts at everything.1 In a similar vein, some design engineers have spent the best years of their young lives developing a seemingly endless series of ASICs, while others have languished in their cubicles learning the arcane secrets that are the province of the FPGA maestro. The problem arises when an engineer steeped in one of these implementation technologies is suddenly thrust into the antipodal domain. For example, a common scenario these days is for engineers who bask in the knowledge that they know everything there is to know about ASICs to be suddenly tasked with creating a design targeted toward an FPGA implementation. This is a tricky topic because there are so many facets to it; the best we can hope for here is to provide an overview as to some of the more significant differences between ASIC and FPGA design styles. 1 Only the other day, for example, I ran into an old Wortsel Grinder Mark 4 (with the filigreed flanges and reverberating notchet tattles). I didn’t have a clue what to do with it, so you can only imagine how foolish I felt. Meaning a direct or diametrical opposite, the word “antipodal” comes to us from the Greek, from the plural of antipous, meaning “with the feet opposite.” 122 ■ The Design Warrior's Guide to FPGAs 1854: Crimea. Telegraph used in the Crimea War. Coding styles When it comes to the language-driven design flows discussed in chapter 9, ASIC designers tend to write very portable code (in VHDL or Verilog) and to make the minimum use of instantiated (specifically named) cells. By comparison, FPGA designers are more likely to instantiate specific low-level cells. For example, FPGA users may not be happy with the way the synthesis tool generates something like a multiplexer, so they may handcraft their own version and then instantiate it from within their code. Furthermore, pure FPGA users tend to use far more technology-specific attributes with regard to their synthesis engine than do their ASIC counterparts. Pipelining and levels of logic What is pipelining? One tends to hear the word pipelining quite a lot, but this term is rarely explained. Of course, engineers know what this means, but as this book is intended for a wide audience, we’ll take a few seconds to make sure that we’re all tap-dancing to the same tune.2 Let’s suppose that we’re building something like a car, and we have all of the parts lying around at hand. Let’s further assume that the main steps in the process are as follows: 1. Attach the wheels to the chassis. 2. Attach the engine to the chassis. 3. Attach the seats to the chassis. 4. Attach the body to the chassis. 5. Paint everything. 2 As a young man, my dad and his brothers used to be tap-dancers in the variety halls of England before WW II (but I bet they never expected to find this fact noted in an electronics book in the 21st Century). FPGA Versus ASIC Design Styles Yes … I know, I know. For all of you engineers out there whom I can hear moaning and groaning (you know who you are), I’m aware that we haven’t got a steering wheel or lights, etc., but this is just an example for goodness’ sake! Now let’s assume that we require a specialist to perform each of these tasks. One approach would be for everyone to be sitting around playing cards. The first guy (or gal, of course)3 gets up and attaches the wheels to the chassis, and then returns to the game. On his return, the second guy gets up and adds the engine, then he returns to the game. Now the third guy wanders over to attach the seats. Upon the third guy’s return, the fourth guy ambles over to connect the body, and so forth. Once the first car has been completed, they start all over again. Obviously, this is a very inefficient scenario. If, for example, we assume that each step takes one hour, then the whole process will take five hours. Furthermore, for each of these hours, only one man is working, while the other four are hanging around amusing themselves. It would be much more efficient to have five cars on the assembly line at any one time. In this case, as soon as the first guy has attached the wheels to the first chassis, the second guy would start to add the engine to that chassis while the first guy would begin to add the wheels to the second chassis. Once the assembly line is in full flow, everyone will be working all of the time and a new car will be created every hour. Pipelining in electronic systems The point is that we can often replicate this scenario in electronic systems. For example, let’s assume that we have a design (or a function forming part of a design) that can be implemented as a series of blocks of combinational (or combinatorial) logic (Figure 7-1). 3 Except where such interpretation is inconsistent with the context, the singular shall be deemed to include the plural, the masculine shall be deemed to include the feminine, and the spelling (and the punctuation) shall be deemed to be correct! ■ 123 1855: England. James Clerk Maxwell explains Faraday’s lines of force using mathematics. 124 ■ The Design Warrior's Guide to FPGAs 1858: America Cunard agents in New York send first commercial telegraph message to report a collision between two steam ships. Combinatorial Logic Combinatorial Logic Combinatorial Logic Data In etc. Figure 7-1. A function implemented using only combinatorial logic. Let’s say that each block takes Y nanoseconds to perform its task and that we have five such blocks (of which only three are shown in Figure 7-1, of course). In this case, it will take 5 × Y nanoseconds for a word of data to propagate through the function, starting with its arrival at the inputs to the first block and ending with its departure from the outputs of the last block. Generally speaking, we wouldn’t want to present a new word of data to the inputs until we have stored the output results associated with the first word of data.4 This means that we end up with the same result as our inefficient car assembly scenario in that it takes a long time to process each word of data, and the majority of the workers (logic blocks) are sitting around twiddling their metaphorical thumbs for most of the time. The answer is to use a pipelined design technique in which “islands” of combinational logic are sandwiched between blocks of registers (Figure 7-2). All of the register banks are driven by a common clock signal. On each active clock edge, the registers feeding a block of logic are loaded with the results from the previous stage. These values then propagate through that block of logic until they arrive at its outputs, at which point they are ready to be loaded into the next set of registers on the next clock. 4 There is a technique called wave-pipelining in which we might have multiple “waves” of data passing through the logic at the same time. However, this is beyond the scope of this book (and it would not be applicable to an FPGA implementation in any case). FPGA Versus ASIC Design Styles Registers Combinatorial Logic Registers Combinatorial Logic ■ 125 Registers Data In etc. Clock Figure 7-2. Pipelining the design. In this case, as soon as “the pump has been primed” and the pipeline is fully loaded, a new word of data can be processed every Y nanoseconds. Levels of logic All of this boils down to the design engineer’s having to perform a balancing act. Partitioning the combinational logic into smaller blocks and increasing the number of register stages will increase the performance of the design, but it will also consume more resources (and silicon real estate) on the chip and increase the latency of the design. This is also the point where we start to run into the concept of levels of logic, which refers to the number of gates between the inputs and outputs of a logic block. For example, Figure 7-3 would be said to comprise three levels of logic because the worst-case path involves a signal having to pass through three gates before reaching the output. Three levels of logic AND From previous bank of registers & OR | NOR | Figure 7-3. Levels of logic. To next bank of registers In the context of an electronic system, the term latency refers to the time (clock cycles) it takes for a specific block of data to work its way through a function, device, or system. One way to think of latency is to return to the concept of an automobile assembly line. In this case, the throughput of the system might be one car rolling off the end of the line every minute. However, the latency of the system might be a full eight-hour shift since it takes hundreds of steps to finish a car (where each of these steps corresponds to a logic/register stage in a pipelined design). 126 ■ The Design Warrior's Guide to FPGAs In the case of an ASIC, a group of gates as shown in Figure 7-3 can be placed close to each other such that their track delays are very small. This means that, depending on the design, ASIC engineers can sometimes be a little sloppy about this sort of thing (it’s not unheard of to have paths with, say, 15 or more levels of logic). By comparison, if this sort of design were implemented on an FPGA with each of the gates implemented in a separate LUT, then it would “fly like a brick” (go incredibly slowly) because the track delays on FPGAs are much more significant, relatively speaking. In reality, of course, a LUT can actually represent several levels of logic (the function shown in Figure 7-3 could be implemented in a single 4-input LUT), so the position isn’t quite as dire as it may seem at first. Having said this, the bottom line is that in order to bring up (or maintain) performance, FPGA designs tend to be more highly pipelined than their ASIC counterparts. This is facilitated by the fact that every FPGA logic cell tends to comprise both a LUT and a register, which makes registering the output very easy. Asynchronous design practices Asynchronous structures Depending on the task at hand, ASIC engineers may include asynchronous structures in their designs, where these constructs rely on the relative propagation delays of signals in order to function correctly. These techniques do not work in the FPGA world as the routing (and associated delays) can change dramatically with each new run of the place-and-route engines. Of course, every latch is based on internal feedback—and every flip-flop is essentially formed from two latches—but this feedback is very tightly controlled by the device manufacturer. Combinational loops As a somewhat related topic, a combinational loop occurs when the generation of a signal depends on itself feeding back through one or more logic gates. These are a major source of critical race conditions where logic values depend on routing FPGA Versus ASIC Design Styles delays. Although the practice is frowned upon in some circles, ASIC engineers can be little rapscallions when it comes to using these structures because they can fix track routing (and therefore the associated propagation delays) very precisely. This is not the case in the FPGA domain, so all such feedback loops should include a register element. Delay chains Last but not least, ASIC engineers may use a series of buffer or inverter gates to create a delay chain. These delay chains may be used for a variety of purposes, such as addressing race conditions in asynchronous portions of the design. In addition to the delay from such a chain being hard to predict in the FPGA world, this type of structure increases the design’s sensitivity to operating conditions, decreases its reliability, and can be a source of problems when migrating to another architecture or implementation technology. Clock considerations Clock domains ASIC designs can feature a huge number of clocks (one hears of designs with more than 300 different clock domains). In the case of an FPGA, however, there are a limited number of dedicated global clock resources in any particular device. It is highly recommended that designers budget their clock systems to stay within the dedicated clock resources (as opposed to using general-purpose inputs as user-defined clocks). Some FPGAs allow their clock trees to be fragmented into clock segments. If the target technology does support this feature, it should be identified and accounted for while mapping external or internal clocks. Clock balancing In the case of ASIC designs, special techniques must be used to balance clock delays throughout the device. By comparison, FPGAs feature device-wide, low-skew clock routing ■ 127 1858: Atlantic. First transatlantic telegraph cable is laid (and later failed). 128 ■ The Design Warrior's Guide to FPGAs 1858: Queen Victoria exchanges transatlantic telegraph messages with President Buchanan in America. resources. This makes clock balancing unnecessary by the design engineer because the FPGA vendor has already taken care of it. Clock gating versus clock enabling ASIC designs often use the technique of gated clocks to help reduce power dissipation, as shown in Figure 7-4a. However, these tend to give the design asynchronous characteristics and make it sensitive to glitches caused by inputs switching too closely together on the gating logic. By comparison, FPGA designers tend to use the technique of enabling clocks. Originally this was performed by means of an external multiplexer as illustrated in Figure 7-4b; today, however, almost all FPGA architectures have a dedicated clock enable pin on the register itself, as shown in Figure 7-4c. (c) Figure 7-4. Clock gating versus clock enabling. PLLs and clock conditioning circuitry FPGAs typically include PLL or DLL functions—one for each dedicated global clock (see also the discussions in Chapter 4). If these resources are used for on-chip clock generation, then the design should also include some mechanism for disabling or bypassing them so as to facilitate chip testing and debugging. FPGA Versus ASIC Design Styles Reliable data transfer across multiclock domains In reality, this topic is true for both ASIC and FPGA designs, the point being that the exchange of data between two independent clock domains must be performed very carefully so as to avoid losing or corrupting data. Bad synchronization may lead to metastability issues and tricky timing analysis problems. In order to achieve reliable transfers across domains, it is recommended to employ handshaking, double flopping, or asynchronous FIFO techniques. Register and latch considerations Latches ASIC engineers often make use of latches in their designs. As a general rule-of-thumb, if you are designing an FPGA, and you are tempted to use a latch, don’t! Flip-flops with both “set” and “reset” inputs Many ASIC libraries offer a wide range of flip-flops, including a selection that offer both set and reset inputs (both synchronous and asynchronous versions are usually available). By comparison, FPGA flip-flops can usually be configured with either a set input or a reset input. In this case, implementing both set and reset inputs requires the use of a LUT, so FPGA design engineers often try to work around this and come up with an alternative implementation. Global resets and initial conditions Every register in an FPGA is programmed with a default initial condition (that is, to contain a logic 0 or a logic 1). Furthermore, the FPGA typically has a global reset signal that will return all of the registers (but not the embedded RAMs) to their initial conditions. ASIC designers typically don’t implement anything equivalent to this capability. ■ 129 1859: Germany. Hittorf and Pucker invent the cathode ray tube (CRT). 130 ■ The Design Warrior's Guide to FPGAs TDM is pronounced by spelling it out as “T-D-M.” In the context of communications, TDM refers to a method of taking multiple data streams and combining them into a single signal by dividing the streams into many segments (each having a very short duration) and multiplexing between them. By comparison, in the context of resource sharing, TDM refers to sharing a resource like a multiplier by multiplexing its inputs and letting different data paths use the resource at different times. Resource sharing (time-division multiplexing) Resource sharing is an optimization technique that uses a single functional block (such as an adder or comparator) to implement several operations. For example, a multiplier may first be used to process two values called A and B, and then the same multiplier may be used to process two other values called C and D. (A good example of resource sharing is provided in Chapter 12.) Another name for resource sharing is time-division multiplexing (TDM). Resources on an FPGA are more limited than on an ASIC. For this reason, FPGA designers tend to spend more effort on resource sharing than do their ASIC counterparts. Use it or lose it! Actually, things are a little subtler than the brief note above might suggest because there is a fundamental use-it-orlose-it consideration with regard to FPGA hardware. This means that FPGAs only come in certain sizes, so if you can’t drop down to the next lower size, then you might as well use everything that’s available on the part you have. For example, let’s assume that you have a design that requires two embedded hard processor cores. In addition to these processors, you might decide that by means of resource sharing, you could squeeze by with say 10 multipliers and 2 megabytes of RAM. But if the only FPGA containing two processors also comes equipped with 50 multipliers and 10 megabytes of RAM, you can’t get a refund, so you might as well make full use of the extra capabilities. But wait, there’s more In the case of FPGAs, getting data from LUTs/CLBs to and from special components like multipliers and MACs is usually more expensive (in terms of connectivity) than connecting with other LUTs/CLBs. Since resource sharing increases the amount of connectivity, you need to keep a watchful eye on this situation. FPGA Versus ASIC Design Styles ■ 131 In addition to the big components like multipliers and MACs, you can also share things like adders. Interestingly enough, in the carry-chain technologies (such as those fielded by Altera and Xilinx), as a first-order approximation, the cost of building an adder is pretty much equivalent to the cost of building a data bus’s worth of sharing logic. For example, implementing two adders “as is” with completely independent inputs and outputs will cost you two adders and no resourcesharing multiplexers. But if you share, you will have one adder and two multiplexers (one for each set of inputs). In FPGA terms, this will be more expensive rather than less (in ASICs, the cost of a multiplexer is far less than the cost of an adder, so you would have a different trade-off point). In the real world, the interactions between “using it or losing it” and connectivity costs are different for each technology and each situation; that is, Altea parts are different from Xilinx parts and so on. State machine encoding The encoding scheme used for state machines is a good example of an area where what’s good for an ASIC design might not be well suited for an FPGA implementation. As we know, every LUT in an FPGA has a companion flip-flop. This usually means that there are a reasonable number of flip-flops sitting around waiting for something to do. In turn, this means that in many cases, a “one-hot” encoding scheme will be the best option for an FPGA-based state machine, especially if the activities in the various states are inherently independent. Test methodologies ASIC designers typically spend a lot of time working with tools that perform SCAN chain insertion and automatic test pattern generation (ATPG). They may also include logic in their designs to perform built-in self-test (BIST). A large proportion of these efforts are intended to test the device for The “one-hot” encoding scheme refers to the fact that each state in a state machine has its own state variable in the form of a flip-flop, and only one state variable may be active (“hot”) at any particular time. 132 ■ The Design Warrior's Guide to FPGAs JTAG is pronounced “J-TAG”; that is, by spelling out the ‘J’ followed by “tag” to rhyme with “nag.” manufacturing defects. By comparison, FPGA designers typically don’t worry about this form of device testing because FPGAs are preverified by the vendor. Similarly, ASIC engineers typically expend a lot of effort inserting and boundary scan (JTAG) into their designs and verifying them. By comparison, FPGAs already contain boundary scan capabilities in their fabric. Chapter 8 Schematic-Based Design Flows In the days of yore In order to set the stage, let’s begin by considering the way in which digital ICs were designed in the days of old—circa the early 1960s. This information will interest nontechnical readers, as well as newbie engineers who are familiar with current design tools and flows, but who may not know how they evolved over time. Furthermore, these discussions establishe an underlying framework that will facilitate understanding the more advanced design flows introduced in subsequent chapters. In those days, electronic circuits were crafted by hand. Circuit diagrams—also known as schematic diagrams or just schematics—were hand-drawn using pen, paper, and stencils (or the occasional tablecloth should someone happen to have a brilliant idea while in a restaurant). These diagrams showed the symbols for the logic gates and functions that were to be used to implement the design, along with the connections between them. Each design team usually had at least one member who was really good at performing logic minimization and optimization, which ultimately boils down to replacing one group of gates with another that will perform the same task faster or using less real estate on the silicon. Checking that the design would work as planned insofar as its logical implementation—functional verification—was typically performed by a group of engineers sitting around a table working their way through the schematics saying, “Well, that looks OK.” Similarly, timing verification—checking that the The wires connecting the logic gates on an integrated circuit may be referred to as wires, tracks, or interconnect, and all of these terms may be used interchangeably. In certain cases, the term metallization may also be used to refer to these tracks because they are predominantly formed by means of the IC’s metal (metallization) layers. 134 ■ The Design Warrior's Guide to FPGAs 1865: England. James Clerk Maxwell predicts the existence of electromagnetic waves that travel in the same way as light. design met its required input-to-output and internal path delays and that no violation times (such as setup and hold parameters) associated with any of the internal registers were violated—was performed using a pencil and paper (if you were really lucky, you might also have access to a mechanical or electromechanical calculator). Finally, a set of drawings representing the structures used to form the logic gates (or, more accurately, the transistors forming the logic gates) and the interconnections between them were drawn by hand. These drawings, which were formed from groups of simple polygons such as squares and rectangles, were subsequently used to create the photo-masks, which were themselves used to create the actual silicon chip. The early days of EDA Front-end tools like logic simulation Not surprisingly, the handcrafted way of designing discussed above was time-consuming and prone to error. Something had to be done, and a number of companies and universities leapt into the fray in a variety of different directions. In the case of functional verification, for example, the late 1960s and early 1970s saw the advent of special programs in the form of rudimentary logic simulators. In order to understand how these work, let’s assume that we have a really simple gate-level design whose schematic diagram has been hand-drawn on paper (Figure 8-1). By “gate-level” we mean that the design is represented as a collection of primitive logic gates and functions and the connections between them. In order to use the logic simulator, the engineers first need to create a textual representation of the circuit called a gate-level netlist. In those far-off times, the engineers would typically have been using a mainframe computer, and the netlist would have been captured as a set of punched cards called a deck (“deck of cards” … get it?). As computers (along with storage devices like hard disks) became Schematic-Based Design Flows SET_A SET_B G1 = NAND N_DATA DATA Q G2 = NOT CLOCK N-Q G4 = DFF CLEAR CLEAR_B G3 = OR Figure 8-1. A simple schematic diagram (on paper). BEGIN CIRCUIT=TEST INPUT SET_A, SET-B, DATA, CLOCK, CLEAR_A, CLEAR_B; OUTPUT Q, N_Q; WIRE SET, N_DATA, CLEAR; GATE GATE GATE GATE G1=NAND G2=NOT G3=OR G4=DFF 135 1865: Atlantic cable links Valencia (Ireland) and Trinity Bay (Newfoundland). SET CLEAR_A ■ (IN1=SET_A, IN2=SET_B, OUT1=SET); (IN1=DATA, OUT1=N_DATA); (IN1=CLEAR_A, IN2=CLEAR_B, OUT1=CLEAR); (IN1=SET, IN2=N_DATA, IN3=CLOCK, IN4=CLEAR, OUT1=Q, OUT2=N_Q); END CIRCUIT=TEST; Figure 8-2. A simple gate-level netlist (text file). more accessible, netlists began to be stored as text files (Figure 8-2). It was also possible to associate delays with each logic gate. These delays—which are omitted here in order to keep things simple—were typically referenced as integer multiples of some core simulation time unit (see also Chapter 19). Note that the format shown in Figure 8-2 was made up purely for the purposes of this example. This was in keeping with the times because—just to keep everyone on their toes—anyone who created a tool like a logic simulator also tended to invent his or her own proprietary netlist language. 136 ■ The Design Warrior's Guide to FPGAs Instead of calling our test vectors “stimulus,” we really should have said “stimuli,” but we were engineers, not English majors! All of the early logic simulators had internal representations of primitive gates like AND, NAND, OR, NOR, etc. These were referred to as simulation primitives. Some simulators also had internal representations of more sophisticated functions like D-type flip-flops. In this case, the G4=DFF function in Figure 8-2 would map directly onto this internal representation. Alternatively, one could create a subcircuit called DFF, whose functionality was captured as a netlist of primitive AND, NAND, etc. gates. In this case, the G4=DFF function in Figure 8-2 would actually be seen by the simulator as a call to instantiate a copy of this subcircuit. Next, the user would create a set of test vectors—also known as stimulus—which were patterns of logic 0 and logic 1 values to be applied to the circuit’s inputs. Once again, these test vectors were textual in nature, and they were typically presented in a tabular form looking something like that shown in Figure 8-3 (anything after a “;” character is considered to be a comment). C C L L S S C E E E E D L A A T T A O R R _ _ T C _ _ A B A K A B ----------1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 TIME ----0 500 1000 1500 2000 2500 : etc. ; ; ; ; ; ; Set up initial values Rising edge on clock (load 0) Falling edge on clock Set data to 0 (N_data = 1) Rising edge on clock (load 1) Clear_B goes active (load 0) Figure 8-3. A simple set of test vectors (text file). Schematic-Based Design Flows The times at which the stimulus values were to be applied were shown in the left-hand column. The names of the input signals are presented vertically to save space. As we know from Figures 8-1 and 8-2, there is an inverting (NOT) gate between the DATA input and the D-type flipflop. Thus, when the DATA input is presented with 1 at time zero, this value will be inverted to a 0, which is the value that will be loaded into the register when the clock undergoes a rising (0-to-1) edge at time 500. Similarly, when the DATA input is presented with 0 at time 1,500, this value will be inverted to a 1, which is the value that will be loaded into the register when the clock undergoes its next rising (0-to-1) transition at time 2,000. In today’s terminology, the file of test vectors shown in Figure 8-3 would be considered a rudimentary testbench. Once again, time values were typically specified as integer multiples of some core simulation time unit. The engineer would then invoke the logic simulator, which would read in the gate-level netlist and construct a virtual representation of the circuit in the computer’s memory. The simulator would then read in the first test vector (the first line from the stimulus file), apply those values to the appropriate virtual inputs, and propagate their effects through the circuit. This would be repeated for each of the subsequent test vectors forming the testbench (Figure 8-4). The simulator would also use one or more control files (or online commands) to tell it which internal nodes (wires) and output pins to monitor, how long to simulate for, and so forth. The results, along with the original stimulus, would be stored in tabular form in a textual output file. Let’s assume that we’ve just travelled back in time and run one of the old simulators using the circuit represented in Figures 8-1 and 8-2 along with the stimulus shown in Figure 8-3. We will also assume that the NOT gate has a delay of five simulator time units associated with it, which means that a change on that gate’s input will take five time units to propagate through the gate and appear on its output. Similarly, ■ 137 1866: Ireland/USA. First permanent transatlantic telegraph cable is laid. 138 ■ The Design Warrior's Guide to FPGAs 1869: William Stanley Jevons invents the Logic Piano. BEGIN CIRCUIT=TEST INPUT SET_A, SET-B, DATA, CLOCK, CLEAR_A, CLEAR_B; OUTPUT Q, N_Q; WIRE SET, N_DATA, CLEAR; GATE G1=NAND (IN1=SET_A, IN2=SET_B, OUT1=SET); GATE G2=NOT (IN1=DATA, OUT1=N_DATA); GATE G3=OR (IN1=CLEAR_A, IN2=CLEAR_B, OUT1=CLEAR); GATE G4=DFF (IN1=SET, IN2=N_DATA, IN3=CLOCK, IN4=CLEAR, OUT1=Q, OUT2=N_Q); END CIRCUIT=TEST; Textual gate-level netlist C C L L S S C E E E E D L A A T T A O R R _ _ T C _ _ A B A K A B ----------1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 TIME ----0 500 1000 1500 2000 2500 : etc. ; ; ; ; ; ; Set up Rising edge Falling edge Set data Rising edge Clear active Logic Simulator C C L L S S C E E E E D L A A T T A O R R _ _ T C _ _ A B A K A B ----------1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 TIME ----0 5 10 500 520 1000 1500 1505 2000 2020 2500 2510 2530 : etc. N _ C D L S A E N E T A _ T A R Q Q ----- --X X X X X X 0 X X X 0 0 0 X X 0 0 0 X X 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 Textual (tabular) results file (stimulus and response) Textual (tabular) stimulus Figure 8-4. Running the logic simulator. we’ll assume that both the NAND and OR gates have associated delays of 10 time units, while the D-type flip-flop has associated delays of 20 time units. In this case, if the simulator were instructed to monitor all of the internal nodes and output pins, the output file containing the simulation results would look something like that shown in figure 8-5. For the purposes of our discussions, any changes to a signal’s value are shown in bold font in this illustration, but this was not the case in the real world. In this example, the initial values are applied to the input pins at time 0. At this time, all of the internal nodes and output pins show X values, which indicates unknown states. After five time units, the initial logic 1 that was applied to the DATA input propagates through the inverting NOT gate and appears as a logic 0 on the internal N_DATA node. Similarly, at time 10, the initial values that were applied to the SET_A and SET_B inputs propagate through the NAND gate to the internal SET node, while the values on the CLEAR_A and CLEAR_B inputs propagate through the OR gate to the internal CLEAR node. Schematic-Based Design Flows TIME ----0 5 10 C C L L S S C E E E E D L A A T T A O R R _ _ T C _ _ A B A K A B ----------1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 N _ C D L S A E N E T A _ T A R Q Q ----- --X X X X X ; Set up initial values X 0 X X X 0 0 0 X X 500 1 1 1 1 0 0 520 1 1 1 1 0 0 0 0 0 0 0 0 X X 0 1 ; Rising edge on clock 1000 1 1 1 0 0 0 0 0 0 0 1 ; Falling edge on clock 1500 1 1 0 0 0 0 1505 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 ; Set data to 0 2000 1 1 0 1 0 0 2020 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 ; Rising edge on clock 2500 1 1 0 1 0 1 2510 1 1 0 1 0 1 2530 1 1 0 1 0 1 : etc. 0 1 0 0 1 1 0 1 1 1 0 1 0 0 1 ; Clear_B goes active Figure 8-5. Output results (text file). At time 500, a rising (0-to-1) edge on the CLOCK input causes the D-type flip-flop to load the value from the N_DATA node. The result appears on the Q and N_Q output pins 20 time units later. And so it goes. Blank lines in the output file, such as the one shown between time 10 and time 500, were used to separate related groups of actions. For example, setting the initial values at time 0 caused signal changes at times 5 and 10. Then the transition on the CLOCK input at time 500 caused signal changes at time 520. As these two groups of actions were totally independent of each other, they were separated by a blank line. It wasn’t long before engineers were working with circuits that could contain thousands of gates and internal nodes along with simulation runs that could encompass thousands of time steps. (Oh, the hours I spent poring over files like this (a) trying to see if a circuit was working as expected, and (b) ■ 139 1872: First simultaneous transmission from both ends of a telegraph wire. 140 ■ The Design Warrior's Guide to FPGAs desperately attempting to track down the problem if it wasn’t!) Back-end tools like layout The drafting department is referred to as the “drawing office” in the UK. As opposed to tools like logic simulators that were intended to aid the engineers who were defining the function of ICs (and circuit boards), some companies focused on creating tools that would help in the process of laying the ICs out. In this context, “layout” refers to determining where to place the logic gates (actually, the transistors forming the logic gates) on the surface of the chip and how to route the wires between them. In the early 1970s, companies like Calma, ComputerVision, and Applicon created special computer programs that helped personnel in the drafting department capture digital representations of hand-drawn designs. In this case, a design was placed on a large-scale digitizing table, and then a mouse-like tool was used to digitize the boundaries of the shapes (polygons) used to define the transistors and the interconnect. These digital files were subsequently used to create the photo-masks, which were themselves used to create the actual silicon chip. Over time, these early computer-aided drafting tools evolved into interactive programs called polygon editors that allowed users to draw the polygons directly onto the computer screen. Descendants of these tools eventually gained the ability to accept the same netlist used to drive the logic simulator and to perform the layout (place-and-route) tasks automatically. CAE + CAD = EDA CAE is pronounced by spelling it out as “C-A-E.” CAD is pronounced to rhyme with “bad.” Tools like logic simulators that were used in the front-end (logical design capture and functional verification) portion of the design flow were originally gathered together under the umbrella name of computer-aided engineering (CAE). By comparison, tools like layout (place-and-route) that were used in Schematic-Based Design Flows the back-end (physical) portion of the design flow were originally gathered together under the name of computer-aided design (CAD). For historical reasons that are largely based on the origins of the terms CAE and CAD, the term design engineer—or simply engineer—typically refers to someone who works in the front-end of the design flow; that is, someone who performs tasks like conceiving and describing (capturing) the functionality of an IC (what it does and how it does it). By comparison, the term layout designer—or simply designer—commonly refers to someone who is ensconced in the back-end of the design flow; that is, someone who performs tasks such as laying out an IC (determining the locations of the gates and the routes of the tracks connecting them together). Sometime during the 1980s, all of the CAE and CAD tools used to design electronic components and systems were gathered under the name of electronic design automation, or EDA, and everyone was happy (apart from the ones who weren’t, but no one listened to their moaning and groaning, so that was alright). A simple (early) schematic-driven ASIC flow Toward the end of the 1970s and the beginning of the 1980s, companies like Daisy, Mentor, and Valid started providing graphical schematic capture programs that allowed engineers to create circuit (schematic) diagrams interactively. Using the mouse, an engineer could select symbols representing such entities as I/O pins and logic gates and functions from a special symbol library and place them on the screen. The engineer could then use the mouse to draw lines (wires) on the screen connecting the symbols together. Once the circuit had been entered, the schematic capture package could be instructed to generate a corresponding gatelevel netlist. This netlist could first be used to drive a logic simulator in order to verify the functionality of the design. The same netlist could then be used to drive the place-and-route software (Figure 8-6). ■ 141 The term CAD is also used to refer to computer-aided design tools used in a variety of other engineering disciplines, such as mechanical and architectural design. EDA is pronounced by spelling it out as “E-D-A.” 142 ■ The Design Warrior's Guide to FPGAs Schematic capture 1873: England James Clerk Maxwell describes the electromagnetic nature of light and publishes his theory of radio waves. Gate-level netlist BEGIN CIRCUIT=TEST INPUT SET_A, SET-B, DATA, CLOCK, CLEAR_A, CLEAR_B; OUTPUT Q, N_Q; WIRE SET, N_DATA, CLEAR; GATE G1=NAND (IN1=SET_A, IN2=SET_B, OUT1=SET); GATE G2=NOT (IN1=DATA, OUT1=N_DATA); GATE G3=OR (IN1=CLEAR_A, IN2=CLEAR_B, OUT1=CLEAR); GATE G4=DFF (IN1=SET, IN2=N_DATA, IN3=CLOCK, IN4=CLEAR, OUT1=Q, OUT2=N_Q); END CIRCUIT=TEST; Logic Simulator Place-andRoute Functional verification Extraction and timing analysis Detect and fix problems Detect and fix problems Figure 8-6. Simple (early) schematic-driven ASIC flow. Any timing information that was initially used by the logic simulator would be estimated—particularly in the case of the tracks—and accurate timing analysis was only possible once all of the logic gates had been placed and the tracks connecting them had been routed. Thus, following place-and-route, an extraction program would be used to calculate the parasitic resistance and capacitance values associated with the structures (track segments, vias, transistors, etc.) forming the circuit. A timing analysis program would then use these values to generate a timing report for the device. In some flows, this timing information was also fed back to the logic simulator in order to perform a more accurate simulation. The important thing to note here is that, when creating the original schematic, the user would access the symbols for the logic gates and functions from a special library that was associated with the targeted ASIC technology.1 Similarly, the 1 There are always different ways to do things. For example, some flows were based on the concept of using a generic symbol library containing a subset of logic functions common to all ASIC cell libraries. The netlist Schematic-Based Design Flows simulator would be instructed to use a corresponding library of simulation models with the appropriate logical functionality2 and timing for the targeted ASIC technology. The end result was that the gate-level netlist presented to the place-and-route software directly mapped onto the logic gates and functions being physically implemented on the silicon chip (this is a tad different from the FPGA flow, as is discussed in the following topic). A simple (early) schematic-driven FPGA flow When the first FPGAs arrived on the scene in 1984, it was natural that their design flows would be based on existing schematic-driven ASIC flows. Indeed, the early portions of the flows were very similar in that, once again, a schematic capture package was used to represent the circuit as a collection of primitive logic gates and functions and to generate a corresponding gate-level netlist. As before, this netlist was subsequently used to drive the logic simulator in order to perform the functional verification. The differences began with the implementation portion of the flow because the FPGA fabric consisted of an array of configurable logic blocks (CLBs), each of which was formed from a number of LUTs and registers. This required the introduction of some additional steps called mapping and packing into the flow (Figure 8-7). generated from the schematic capture application could then be run through a translator that converted the generic cell names to their equivalents in the targeted ASIC library. 2 With regard to functionality, one might expect a primitive logical entity like a 2-input AND gate to function identically across multiple libraries. This is certainly the case when “good” (logic 0 and 1) values are applied to the inputs, but things may vary when high-impedance ‘Z’ values or unknown ‘X’ values are applied to the inputs. And even with good 0 and 1 values applied to their inputs, more complex functions like D-type latches and flip-flops can behave very differently for “unusual” cases such as the set and clear inputs being driven active at the same time. ■ 143 1874: America. Alexander Graham Bell conceives the idea of the telephone. 144 ■ The Design Warrior's Guide to FPGAs 1875: America. Edison invents the Mimeograph. Gate-level netlist Schematic capture BEGIN CIRCUIT=TEST INPUT SET_A, SET-B, DATA, CLOCK, CLEAR_A, CLEAR_B; OUTPUT Q, N_Q; WIRE SET, N_DATA, CLEAR; GATE G1=NAND (IN1=SET_A, IN2=SET_B, OUT1=SET); GATE G2=NOT (IN1=DATA, OUT1=N_DATA); GATE G3=OR (IN1=CLEAR_A, IN2=CLEAR_B, OUT1=CLEAR); GATE G4=DFF (IN1=SET, IN2=N_DATA, IN3=CLOCK, IN4=CLEAR, OUT1=Q, OUT2=N_Q); END CIRCUIT=TEST; Mapping Packing Place-andRoute Timing analysis and timing report Gate-level netlist for simulation Fully-routed physical (CLB-level) netlist SDF (timing info) for simulation Figure 8-7. Simple (early) schematic-driven FPGA flow. Mapping In this context, mapping refers to the process of associating entities such as the gate-level functions in the gate-level netlist with the LUT-level functions available on the FPGA. Of course, this isn’t a one-for-one mapping because each LUT can be used to represent a number of logic gates (Figure 8-8). Portion of gate-level netlist Contents of 3-input LUT XOR a | d XNOR b | NOT c e y a b c y 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 Figure 8-8. Mapping logic gates into LUTs. Schematic-Based Design Flows Mapping (which is still performed today, but elsewhere in the flow, as will be discussed in later chapters) is a nontrivial problem because there are a large number of ways in which the logic gates forming a netlist can be partitioned into the smaller groups to be mapped into LUTs. As a simple example, the functionality of the NOT gate shown in Figure 8-8 might have been omitted from this LUT and instead incorporated into the upstream LUT driving wire c. Packing Following the mapping phase, the next step was packing, in which the LUTs and registers were packed into the CLBs. Once again, packing (which is still performed today, but elsewhere in the flow, as will be discussed in later chapters) is a nontrivial problem because there are myriad potential combinations and permutations. For example, assume an incredibly simple design comprising only a couple of handfuls of logic gates that end up being mapped onto four 3-input LUTs that we’ll call A, B, C, and D. Now assume that we’re dealing with an FPGA whose CLBs can each contain two 3-input LUTs. In this case we’ll need two CLBs (called 1 and 2) to contain our four LUTs. As a first pass, there are 4! (factorial four = 4 3 2 1 = 24) different ways in which our LUTs can be packed into the two CLBs (Figure 8-9). Functionally equivalent CLB 1 A A A A A A B B B B B B B B C C D D A A C C D D C D B D B C C D A D A C D C D B C B D C D A C A etc. CLB 2 Different permutations Figure 8-9. Packing LUTs into CLBs. Only 12 of the 24 possible permutations are shown here (the remainder are left as an exercise for the reader). Further- ■ 145 1875: England. James Clerk Maxwell states that atoms must have a structure. 146 ■ The Design Warrior's Guide to FPGAs Prior to the advent of FPGAs, the equivalent functionality to placeand-route in “CPLD land” was performed by an application known as a “fitter.” When FPGAs first arrived on the scene, people used the same “fitter” appellation, but over time they migrated to using the term “place-and-route” because this more accurately reflected what was actually occurring. As opposed to using a symbol library of primitive logic gates and registers, an interesting alternative circa the early 1990s was to use a symbol library corresponding to slightly more complex logical functions (say around 70 functions). The output from the schematic was a netlist of functional blocks that were already de facto mapped onto LUTs and packed into CLBs. This had the advantage of giving a better idea of the number of levels of logic between register stages, but it limited such activities as optimization and swapping. more, in reality there are actually only 12 permutations of significance because each has a “mirror image” that is functionally its equivalent, such as the AC-BD and BD-AC pairs shown in Figure 8-9. The reason for this is that when we come to place-and-route, the relative locations of the two CLBs can be exchanged. Place-and-route Following packing, we move to place-and-route. With regard to the previous point, let’s assume that our two CLBs need to be connected together, but that—purely for the purposes of this portion of our discussions—they can only be placed horizontally or vertically adjacent to each other, in which case there are four possibilities (Figure 8-10). Figure 8-10. Placing the CLBs. In the case of placement (i) for example, if CLB 1 contained LUTs A-C and CLB 2 contained LUTs B-D, then this would be identical to swapping the positions of the two CLBs and exchanging their contents. If we only had the two CLBs shown in figure 8-10, it would be easy to determine their optimal placement with respect to each other (which would have to be one of the four options shown above) and the absolute placement of this two-CLB group with respect to the entire chip. Schematic-Based Design Flows ■ 147 The placement problem is much more complex in the real world because a real design can contain extremely large numbers of CLBs (hundreds or thousands in the early days, and hundreds of thousands by 2004). In addition to CLBs 1 and 2 being connected together, they will almost certainly need to be connected to other CLBs. For example, CLB 1 may also need to be connected to CLBs 3, 5 and 8, while CLB 2 may need to be connected to CLBs 4, 6, 7, and 8. And each of these new CLBs may need to be connected to each other or to yet more CLBs. Thus, although placing CLBs 1 and 2 next to each other would be best for them, it might be detrimental to their relationships with the other CLBs, and the most optimal solution overall might actually be to separate CLBs 1 and 2 by some amount. Although placement is difficult, deciding on the optimal way to route the signals between the various CLBs poses an even more Byzantine problem. The complexity of these tasks is mind-boggling, so we’ll leave it to those guys and gals who write the place-and-route algorithms (they are the ones sporting size-16 extra-wide brains with go-faster stripes) and quickly move onto other things. Timing analysis and post-place-and-route simulation Following place-and-route, we have a fully routed physical (CLB-level) netlist, as was illustrated in Figure 8-7. At this point, a static timing analysis (STA) utility will be run to calculate all of the input-to-output and internal path delays and also to check for any timing violations (setup, hold, etc.) associated with any of the internal registers. One interesting point occurs if the design engineers wish to resimulate their design with accurate (post-place-and-route) timing information. In this case, they have to use the FPGA tool suite to generate a new gate-level netlist along with associated timing information in the form of an industry-standard file format called—perhaps not surprisingly—standard delay format (SDF). The main reason for generating this new gate- STA is pronounced by spelling it out as “S-T-A” (see also Chapter 19). SDF is pronounced by spelling it out as “S-D-F” (see also Chapter 10). 148 ■ The Design Warrior's Guide to FPGAs 1876: America. 10th March. Intelligible human speech heard over Alexander Graham Bell’s telephone for the first time. level netlist is that—once the original netlist has been coerced into its CLB-level equivalent—it simply isn’t possible to relate the timings associated with this new representation back into the original gate-level incarnation. Flat versus hierarchical schematics Clunky flat schematics The very first schematic packages essentially allowed a design to be captured as a humongous, flat circuit diagram split into a number of “pages.” In order to visualize this, let’s assume that you wish to draw a circuit diagram comprising 1,000 logic gates on a piece of paper. If you created a single large diagram, you would end up with a huge sheet of paper (say eight-feet square) with primary inputs to the circuit on the left, primary outputs from the circuit on the right, and the body of the circuit in the middle. Carrying this circuit diagram around and showing it to your friends would obviously be a pain. Instead, you might want to cut it up into a number of pages and store them all together in a folder. In this case, you would make sure that your partitioning was logical such that each page contained all of the gates relating to a particular function in the design. Also, you would use interpage connectors (sort of like pseudo inputs and outputs) to link signals between the various pages. This is the way the original schematic capture packages worked. You created a single flat schematic as a series of pages linked together by interpage connector symbols, where the names you gave these symbols told the system which ones were to be connected together. For example, consider a simple circuit sketched on a piece of paper (Figure 8-11). Assume that the gates on the left represent some control logic, while the four registers on the right are implementing a 4-bit shift register. Obviously, this is a trivial example, and a real circuit would have many more logic gates. We’re just trying to tie down some underlying concepts here, such as the fact that when you entered this circuit into your schematic Schematic-Based Design Flows ■ 149 1876: America. Alexander Graham Bell patents the telephone. Figure 8-11. Simple schematic drawn on a piece of paper. capture system, you might split it into two pages (Figure 8-12). Page 1 (Control logic) Schematic capture system Page 2 (Shift register) Figure 8-12. Simple two-page flat schematic. Sleek hierarchical (block-based) schematics There were a number of problems associated with the flat schematics discussed above, especially when dealing with realworld circuits requiring 50 or more pages: ■ ■ ■ It was difficult to visualize a high-level, top-down view of the design. It was difficult to save and reuse portions of the design in future projects. In the case of designs in which some portion of the circuit was repeated multiple times (which is very common), that portion would have to be redrawn or copied onto multiple pages. This became really The Design Warrior's Guide to FPGAs 1877: America. First commercial telephone service goes into operation. painful if you subsequently realized that you had to make a change because you would have to make the same change to all of the copies. The answer was to enhance schematic capture packages to support the concept of hierarchy. In the case of our shift register circuit, for example, you might start with a top-level page in which you would create two blocks called control and shift, each with the requisite number of input and output pins. You would then connect these blocks to each other and also to some primary inputs and outputs. Next, you would instruct the system to “push down” into the control block, which would open up a new schematic page. If you were lucky, the system would automatically prepopulate this page with input and output connector symbols (and with associated names) corresponding to the pins on its parent block. You would then create the schematic corresponding to that block as usual (Figure 8-13). Top-level page Contents of “control” block S h if t ■ C o n tro l 150 Contents of “Shift” block Figure 8-13. Simple hierarchical schematic. In fact, each block could contain a further block-level schematic, or a gate-level schematic, or (very commonly) a mixture of both. These hierarchical block-based schematics answered the problems associated with flat schematics: Schematic-Based Design Flows ■ ■ ■ They made it easier to visualize a high-level, top-down view of the design and to work one’s way through the design. They made it easier to save and reuse portions of the design in future projects. In the case of designs in which some portion of the circuit was repeated multiple times, it was only necessary to create that portion—as a discrete block—once and then to instantiate (call) that block multiple times. This made things easy if you subsequently realized that you had to make a change because you would only have to modify the contents of the initial block. Schematic-driven FPGA design flows today All of the original schematic, mapping, packing, and place-and-route applications were typically created and owned by the FPGA companies themselves. However, the general feeling is that a company can either be good at creating EDA tools or it can be good at creating silicon chips, but not both. Another facet of the problem is that design tools were originally extremely expensive in the ASIC world (even tools like schematic capture, which today are commonly regarded as commodity products). By comparison, the FPGA vendors were focused on selling chips, so right from the get-go they offered their tools at a very low cost (in fact, if you were a big enough customer, they’d give you the entire design tool suite for free). While this had its obvious attractions to the end user, the downside was that the FPGA vendors weren’t too keen spending vast amounts of money enhancing tools for which they received little recompense. Over time, therefore, external EDA vendors started to supply portions of the puzzle, starting with schematic capture and then moving into mapping and packing (via logic synthesis as discussed in Chapters 9 and 19). Having said this, the FPGA vendors still typically provide internally developed, less sophisticated (compared to the state-of-the-art) versions of tools like ■ 151 1877: America. Thomas Watson devises a “thumper” to alert users of incoming telephone calls. 152 ■ The Design Warrior's Guide to FPGAs 1878: America. First public longdistance telephone lines between Boston and Providence become operational. schematic capture as part of their basic tool suite, and they also maintain a Vulcan Death Grip on their crown jewels (the place-and-route software). For many engineers today, driving a design using schematic capture at the gate-level of abstraction is but a distant memory. In some cases, FPGA vendors offer little support for this type of flow for their latest devices to the extent that they only provide schematic libraries for older component generations. However, schematic capture does still find a role with some older engineers and also with folks who need to make minor functional changes to legacy designs. Furthermore, graphical entry mechanisms that are descended from early schematic capture packages still find a place in modern design flows, as is discussed in the next chapter. Chapter 9 HDL-Based Design Flows Schematic-based flows grind to a halt Toward the end of the 1980s, as designs grew in size and complexity, schematic-based ASIC flows began to run out of steam. Visualizing, capturing, debugging, understanding, and maintaining a design at the gate level of abstraction became increasingly difficult and inefficient when juggling 5,000 or more gates and reams of schematic pages. In addition to the fact that capturing a large design at the gate level of abstraction is prone to error, it is also extremely time-consuming. Thus, some EDA vendors started to develop design tools and flows based on the use of hardware description languages, or HDLs. The advent of HDL-based flows The idea behind a hardware description language is, perhaps not surprisingly, that you can use it to describe hardware. In a wider context, the term hardware is used to refer to any of the physical portions of an electronics system, including the ICs, printed circuit boards, cabinets, cables, and even the nuts and bolts holding the system together. In the context of an HDL, however, “hardware” refers only to the electronic portions (components and wires) of ICs and printed circuit boards. (The HDL may also be used to provide limited representations of the cables and connectors linking circuit boards together.) In the early days of electronics, almost anyone who created an EDA tool created his or her own HDL to go with it. Some of these were analog HDLs in that they were intended to rep- EDA is pronounced by spelling it out as “E-D-A.” HDL is pronounced by spelling it out as “H-D-L.” 154 ■ The Design Warrior's Guide to FPGAs resent circuits in the analog domain, while others were focused on representing digital functionality. For the purposes of this book, we are interested in HDLs only in the context of designing digital ICs in the form of ASICs and FPGAs. Different levels of abstraction Some of the more popular digital HDLs are introduced later in this chapter. For the nonce, however, let’s focus more on how a generic digital HDL is used as part of a design flow. The first thing to note is that the functionality of a digital circuit can be represented at different levels of abstraction and that different HDLs support these levels of abstraction to a greater or lesser extent (figure 9-1). Behavioral (Algorithmic) Loops Processes Largely self-taught, George Boole made significant contributions in several areas of mathematics, but was immortalized for two works published in 1847 and 1854 in which he represented logical expressions in a mathematical form now known as Boolean algebra. In 1938, Claude Shannon published an article based on his master’s thesis at MIT, in which he showed how Boole’s concepts could be used to represent the functions of switches in electronic circuits. RTL Functional Boolean Gate Structural Switch Figure 9-1. Different levels of abstraction. The lowest level of abstraction for a digital HDL would be the switch level, which refers to the ability to describe the circuit as a netlist of transistor switches. A slightly higher level of abstraction would be the gate level, which refers to the ability to describe the circuit as a netlist of primitive logic gates and functions. Thus, the early gate-level netlist formats gener- HDL-Based Design Flows ■ ated by schematic capture packages as discussed in the previous chapter were in fact rudimentary HDLs. Both switch-level and gate-level netlists may be classed as structural representations. It should be noted, however, that “structural” can have different connotations because it may also be used to refer to a hierarchical block-level netlist in which each block may have its contents specified using any of the levels of abstraction shown in Figure 9-1. The next level of HDL sophistication is the ability to support functional representations, which covers a range of constructs. At the lower end is the capability to describe a function using Boolean equations. For example, assuming that we had already declared a set of signals called Y, SELECT, DATA-A, and DATA-B, we could capture the functionality of a simple 2:1 multiplexer using the following Boolean equation: Y = (SELECT & DATA-A) | (!SELECT & DATA-B); Note that this is a generic syntax that does not favor any particular HDL and is used only for the purposes of this example. (As we discussed in chapter 3, the “&” character represents a logical AND, the “|” character represents an OR, and the “!” character represents a NOT.) The functional level of abstraction also encompasses register transfer level (RTL) representations. The term RTL covers a multitude of manifestations, but the easiest way to wrap one’s brain around the underlying concept is to consider a design formed from a collection of registers linked by combinational logic. These registers are often controlled by a common clock signal, so assuming that we have already declared two signals called CLOCK and CONTROL, along with a set of registers called REGA, REGB, REGC, and REGD, then an RTL-type statement might look something like the following: RTL is pronounced by spelling it out as “R-T-L.” 155 156 ■ The Design Warrior's Guide to FPGAs when CLOCK rises if CONTROL == “1" then REGA = REGB & REGC; else REGA = REGB | REGD; end if; end when; In this case, symbols like when, rises, if, then, else, and the like are keywords whose semantics are defined by the owners of the HDL. Once again, this is a generic syntax that does not favor any particular HDL and is used only for the purposes of this example. The highest level of abstraction sported by traditional HDLs is known as behavioral, which refers to the ability to describe the behavior of a circuit using abstract constructs like loops and processes. This also encompasses using algorithmic elements like adders and multipliers in equations; for example: Y = (DATA-A + DATA-B) * DATA-C; FSM is pronounced by spelling it out as “F-S-M.” We should note that there is also a system level of abstraction (not shown in figure 9-1) that features constructs intended for system-level design applications, but we’ll worry about this level a little later. Many of the early digital HDLs supported only structural representations in the form of switch or gate-level netlists. Others such as ABEL, CUPL, and PALASM were used to capture the required functionality for PLD devices. These languages (which were introduced in chapter 3) supported different levels of functional abstraction, such as Boolean equations, text-based truth tables, and text-based finite state machine (FSM) descriptions. The next generation of HDLs, which were predominantly targeted toward logic simulation, supported more sophisticated levels of abstraction such as RTL and some behavioral constructs. It was these HDLs that formed the core of the first true HDL-based design flows as discussed below. HDL-Based Design Flows A simple (early) HDL-based ASIC flow The key feature of HDL-based ASIC design flows is their use of logic synthesis technology, which began to appear on the market around the mid-1980s. These tools could accept an RTL representation of a design along with a set of timing constraints. In this case, the timing constraints were presented in a side-file containing statements along the lines of “the maximum delay from input X to output Y should be no greater than N nanoseconds” (the actual format would be a little drier and more boring). The logic synthesis application automatically converted the RTL representation into a mixture of registers and Boolean equations, performed a variety of minimizations and optimizations (including optimizing for area and timing), and then generated a gate-level netlist that would (or at least, should) meet the original timing constraints (Figure 9-2). Register transfer level Gate-level netlist Logic Synthesis RTL Place-andRoute Logic Simulator Logic Simulator RTL functional verification Gate-level functional verification Figure 9-2. Simple HDL-based ASIC flow. There were a number of advantages to this new type of flow. First of all, the productivity of the design engineers rose dramatically because it was a lot easier to specify, understand, ■ 157 1878: England. Sir Joseph Wilson Swan demonstrates a true incandescent light bulb. 158 ■ The Design Warrior's Guide to FPGAs 1878: England. William Crookes invents his version of a cathode ray tube called the Crookes’ Tube. discuss, and debug the required functionality of the design at the RTL level of abstraction as opposed to working with reams of gate-level schematics. Also, logic simulators could run designs described in RTL much more quickly than their gate-level counterparts. One slight glitch was that logic simulators could work with designs specified at high levels of abstraction that included behavioral constructs, but early synthesis tools could only accept functional representations up to the level of RTL. Thus, design engineers were obliged to work with a synthesizable subset of their HDL of choice. Once the synthesis tool had generated a gate-level netlist, the flow became very similar to the schematic-based ASIC flows discussed in the previous chapter. The gate-level netlist could be simulated to ensure its functional validity, and it could also be used to perform timing analysis based on estimated values for tracks and other circuit elements. The netlist could then be used to drive the place-and-route software, following which a more accurate timing analysis could be performed using extracted resistance and linefeed capacitance values. A simple (early) HDL-based FPGA flow It took some time for HDL-based flows to flourish within the ASIC community. Meanwhile, design engineers were still coming to grips with the concept of FPGAs. Thus, it wasn’t until the very early 1990s that HDL-based flows featuring logic synthesis technology became fully available in the FPGA world (Figure 9-3). As before, once the synthesis tool had generated a gatelevel netlist, the flow became very similar to the schematicbased FPGA flows discussed in the previous chapter. The gate-level netlist could be simulated to ensure its functional validity, and it could also be used to perform timing analysis based on estimated values for tracks and other circuit elements. The netlist could then be used to drive the FPGA’s mapping, packing, and place-and-route software, following HDL-Based Design Flows Register transfer level ■ Gate-level netlist Mapping Packing Logic Synthesis RTL Place-andRoute Logic Simulator Logic Simulator RTL functional verification Gate-level functional verification Figure 9-3. Simple HDL-based FPGA flow. which a more accurate timing report could be generated using real-world (physical) values. Architecturally aware FPGA flows The main problem besetting the original HDL-based FPGA flows was that their logic synthesis technologies were derived from the ASIC world. Thus, these tools “thought” in terms of primitive logic gates and registers. In turn, this meant that they output gate-level netlists, and it was left to the FPGA vendor to perform the mapping, packing, and placeand-route functions. Sometime around 1994, synthesis tools were equipped with knowledge about different FPGA architectures. This meant that they could perform mapping—and some level of packing —functions internally and output a LUT/CLB-level netlist. This netlist would subsequently be passed to the FPGA vendor’s place-and-route software. The main advantage of this approach was that these synthesis tools had a better idea about timing estimations and area utilization, which allowed them to generate a better quality of results (QoR). In real terms, FPGA designs generated by architecturally aware synthesis tools were QoR is pronounced by spelling it out as “Q-o-R.” 159 160 ■ The Design Warrior's Guide to FPGAs 1878: Ireland. Denis Redmond demonstrates capturing an image using selenium photocells. 15 to 20 percent faster than their counterparts created using traditional (gate-level) synthesis offerings. Logic versus physically aware synthesis We’re jumping a little bit ahead of ourselves here, but this is as good a place as any to briefly introduce this topic. The original logic synthesis tools were designed for use with the multimicron ASIC technologies of the mid-1980s. In these devices, the delays associated with the logic gates far outweighed the delays associated with the tracks connecting those gates together. In addition to being relatively small in terms of gate-count (by today’s standards), these designs featured relatively low clock frequencies and correspondingly loose design constraints. The combination of all of these factors meant that early logic synthesis tools could employ relatively simple algorithms to estimate the track delays, but that these estimations would be close enough to the real (post-place-and-route) values that the device would work. Over the years, ASIC designs increased in size (number of gates) and complexity. At the same time, the dimensions of the structures on the silicon chip were shrinking with two important results: ■ ■ Delay effects became more complex in general. The delays associated with tracks began to outweigh the delays associated with gates. By the mid-1990s, ASIC designs were orders of magnitude larger—and their delay effects were significantly more sophisticated—than those for which the original logic synthesis tools had been designed. The result was that the estimated delays used by the logic synthesis tool had little relation to the final post-place-and-route delays. In turn, this meant that achieving timing closure (tweaking the design to make it achieve its original performance goals) became increasingly difficult and time-consuming. HDL-Based Design Flows ■ 161 For this reason, ASIC flows started to see the use of physically aware synthesis somewhere around 1996. The ways in which physically aware synthesis performs its magic are discussed in more detail in chapter 19. For the moment, we need only note that, during the course of performing its machinations, the physically aware synthesis engine makes initial placement decisions for the logic gates and functions. Based on these placements, the tool can generate more accurate timing estimations. Ultimately, the physically aware synthesis tool outputs a placed (but not routed) gate-level netlist. The ASIC’s physical implementation (place-and-route) tools use this initial placement information as a starting point from which to perform local (fine-grained) placement optimizations followed by detailed routing. The end result is that the estimated delays used by the physically aware synthesis application more closely correspond to the post-place-and-route delays. In turn, this means that achieving timing closure becomes a less taxing process. “But what of FPGAs,” you cry. Well, these devices were also increasing in size and complexity throughout the 1990s. By the end of the millennium, FPGA designers were running into significant problems with regard to timing closure. Thus, around 2000, EDA vendors started to provide FPGA-centric, physically aware synthesis offerings that could output a mapped, packed, and placed LUT/CLB-level netlist. In this case, the FPGA’s physical implementation (place-and-route) tools use this initial placement information as a starting point from which to perform local (fine-grained) placement optimizations followed by detailed routing. Graphical design entry lives on When the first HDL-based flows appeared on the scene, many folks assumed that graphical design entry and visualization tools, such as schematic capture systems, were poised to exit the stage forever. Indeed, for some time, many design engineers prided themselves on using text editors like VI In an expert’s hands, the VI editor (pronounced by spelling it out as “V-I”) was (and still is) an extremely powerful tool, but it can be very frustrating for new users. 162 ■ The Design Warrior's Guide to FPGAs 1879: America Thomas Alva Edison invents an incandescent light bulb (a year after Sir Joseph Wilson Swan in England). (from Visual Interface) or EMACS as their only design entry mechanism. But a picture tells a thousand words, as they say, and graphical entry techniques remain popular at a variety of levels. For example, it is extremely common to use a block-level schematic editor to capture the design as a collection of highlevel blocks that are connected together. The system might then be used to automatically create a skeleton HDL framework with all of the block names and inputs and outputs declared. Alternatively, the user might create a skeleton framework in HDL, and the system might use this to create a block-level schematic automatically. From the user’s viewpoint, “pushing” down into one of these schematic blocks might automatically open an HDL editor. This could be a pure text-and-command–based editor like VI, or it might be a more sophisticated HDL-specific editor featuring the ability to show language keywords in different colors, automatically complete statements, and so forth. Furthermore, when pushing down into a schematic block, modern design systems often give you a choice between entering and viewing the contents of that block as another, lower-level block-level schematic, raw HDL code, a graphical state diagram (used to represent an FSM), a graphical flowchart, and so forth. In the case of the graphical representations like state diagrams and flowcharts, these can subsequently be used to generate their RTL equivalents automatically (Figure 9-4). Furthermore, it is common to have a tabular file containing information relating to the device’s external inputs and outputs. In this case, both the top-level block diagram and the tabular file will (hopefully) be directly linked to the same data and will simply provide different views of that data. Making a change in any view will update the central data and be reflected immediately in all of the views. HDL-Based Design Flows ■ 163 Textual HDL Graphical State Diagram When clock rises If (s == 0) then y = (a & b) | c; else y = c & !(d ^ e); Top-level block-level schematic Graphical Flowchart Block-level schematic Figure 9-4. Mixed-level design capture environment. A positive plethora of HDLs Life would be so simple if there were only a single HDL to worry about, but no one said that living was going to be easy. As previously noted, in the early days of digital IC electronics design (circa the 1970s), anyone who created an HDL-based design tool typically felt moved to create his or her own language to accompany it. Not surprisingly, the result was a morass of confusion (you had to be there to fully appreciate the dreadfulness of the situation). What was needed was an industry-standard HDL that could be used by multiple EDA tools and vendors, but where was such a gem to be found? Verilog HDL Sometime around the mid-1980s, Phil Moorby (one of the original members of the team that created the famous HILO logic simulator) designed a new HDL called Verilog. In 1985, the company he was working for, Gateway Design Automation, introduced this language to the market along with an accompanying logic simulator called Verilog-XL. PLI is pronounced by spelling it out as “P-L-I.” API is pronounced by spelling it out as “A-P-I.” 164 ■ The Design Warrior's Guide to FPGAs FFT is pronounced by spelling it out as “F-F-T.” SDF is pronounced by spelling it out as “S-D-F.” One very cool concept that accompanied Verilog and Verilog-XL was the Verilog programming language interface (PLI). The more generic name for this sort of thing is application programming interface (API). An API is a library of software functions that allow external software programs to pass data into an application and access data from that application. Thus, the Verilog PLI is an API that allows users to extend the functionality of the Verilog language and simulator. As one simple example, let’s assume that an engineer is designing a circuit that makes use of an existing module to perform a mathematical function such as an FFT. A Verilog representation of this function might take a long time to simulate, which would be a pain if all the engineer really wanted to do was verify the new portion of the circuit. In this case, the engineer might create a model of this function in the C programming language, which would simulate, say, 1,000 times faster than its Verilog equivalent. This model would incorporate PLI constructs, allowing it to be linked into the simulation environment. The model could subsequently be accessed from the Verilog description of the rest of the circuit by means of a PLI call providing a bidirectional link to pass data back and forth between the main circuit (represented in Verilog) and the FFT (captured in C). Yet one more really useful feature associated with Verilog and Verilog-XL was the ability to have timing information specified in an external text file known as a standard delay format (SDF) file. This allowed tools like post-place-and-route timing analysis packages to generate SDF files that could be used by the simulator to provide more accurate results. As a language, the original Verilog was reasonably strong at the structural (switch and gate) level of abstraction (especially with regard to delay modeling capability); it was very strong at the functional (Boolean equation and RTL) level of abstraction; and it supported some behavioral (algorithmic) constructs (Figure 9-5). HDL-Based Design Flows 165 1879: England. William Crookes postulates that cathode rays may be negative charged particles. System Verilog Behavioral (Algorithmic) Functional (RTL, Boolean) ■ Structural (Gate, Switch) Figure 9-5. Levels of abstraction (Verilog). In 1989, Gateway Design Automation along with Verilog (the HDL) and Verilog-XL (the simulator) were acquired by Cadence Design Systems. The most likely scenario at that time was for Verilog to remain as just another proprietary HDL. However, with a move that took the industry by surprise, Cadence put the Verilog HDL, Verilog PLI, and Verilog SDF specifications into the public domain in 1990. This was a very daring move because it meant that anybody could develop a Verilog simulator, thereby becoming a potential competitor to Cadence. The reason for Cadence’s largesse was that the VHDL language (introduced later in this section) was starting to gain a significant following. The upside of placing Verilog in the public domain was that a wide variety of companies developing HDL-based tools, such as logic synthesis applications, now felt comfortable using Verilog as their language of choice. Having a single design representation that could be used by simulation, synthesis, and other tools made everyone’s life a lot easier. It is important to remember, however, that Verilog was originally conceived with simulation in mind; applications like 166 ■ The Design Warrior's Guide to FPGAs LRM is pronounced by spelling it out as “L-R-M.” OVI is pronounced by spelling it out as “O-V-I.” synthesis were something of an afterthought. This means that when creating a Verilog representation to be used for both simulation and synthesis, one is restricted to using a synthesizable subset of the language (which is loosely defined as whatever collection of language constructs your particular logic synthesis package understands and supports). The formal definition of Verilog is encapsulated in a document known as the language reference manual (LRM), which details the syntax and semantics of the language. In this context, the term syntax refers to the grammar of the language—such as the ordering of the words and symbols in relation to each other—while the term semantics refers to the underlying meaning of the words and symbols and the relationships between the things they denote … phew! In an ideal world, an LRM would define things so rigorously that there would be no chance of any misinterpretation. In the real world, however, there were some ambiguities with respect to the Verilog LRM. Admittedly, these were cornercase conditions along the lines of “if a control signal on this register goes inactive at the same time as the clock signal triggers, which signal will be evaluated by the simulator first?” But the end result was that different Verilog simulators might generate different results, which is always somewhat disconcerting to the end user. Verilog quickly became very popular. The problem was that different companies started to extend the language in different directions. In order to curtail this sort of thing, a nonprofit body called Open Verilog International (OVI) was established in 1991. With representatives from all of the major EDA vendors of the time, OVI’s mandate was to manage and standardize Verilog HDL and the Verilog PLI. The popularity of Verilog continued to rise exponentially, with the result that OVI eventually asked the IEEE to form a working committee to establish Verilog as an IEEE standard. Known as IEEE 1364, this committee was formed in 1993. May 1995 saw the first official IEEE Verilog release, which is HDL-Based Design Flows ■ 167 formally known as IEEE 1364-1995, and whose unofficial designation has come to be Verilog 95. Minor modifications were made to this standard in 2001; hence, it is often referred to as the Verilog 2001 (or Verilog 2K1) release. At the time of this writing, the IEEE 1364 committee is working feverishly on a forthcoming Verilog 2005 offering, while the design world holds its breath in dread anticipation (see also the section on “Superlog and SystemVerilog” later in this chapter). VHDL and VITAL In 1980, the U.S. Department of Defense (DoD) launched the very high speed integrated circuit (VHSIC) program, whose primary objective was to advance the state of the art in digital IC technology. This program sought to address, among other things, the fact that it was difficult to reproduce ICs (and circuit boards) over the long life cycles of military equipment because the function of the parts wasn’t documented in a rigorous fashion. Furthermore, different components forming a system were often designed and verified using diverse and incompatible simulation languages and design tools. In order to address these issues, a project to develop a new hardware description language called VHSIC HDL (or VHDL for short) was launched in 1981. One unique feature of this process was that industry was involved from a very early stage. In 1983, a team comprising Intermetrics, IBM, and Texas Instruments was awarded a contract to develop VHDL, the first official release of which occurred in 1985. Also of interest is the fact that in order to encourage acceptance by the industry, the DoD subsequently donated all rights to the VHDL language definition to the IEEE in 1986. After making some modifications to address a few known problems, VHDL was released as official standard IEEE 1076 in 1987. The language was further extended in a 1993 release and again in 1999. Don’t ask me how VHSIC is pronounced (it’s been a long day). VHDL is pronounced by spelling it out as ”V-H-D-L.” The Design Warrior's Guide to FPGAs Initially, VHDL didn’t have an equivalent to Verilog’s PLI. Today, different simulators have their own ways of doing this sort of thing, such as ModelSim’s foreign language interface (FLI). We can but hope that these diverse offerings will eventually converge on a common standard. DAC may be pronounced to rhyme with “sack,” or it may be spelled out as “D-A-C.” As a language, VHDL is very strong at the functional (Boolean equation and RTL) and behavioral (algorithmic) levels of abstraction, and it also supports some system-level design constructs. However, VHDL is a little weak when it comes to the structural (switch and gate) level of abstraction, especially with regard to its delay modeling capability. It quickly became apparent that VHDL had insufficient timing accuracy to be used as a sign-off simulator. For this reason, the VITAL initiative was launched at the Design Automation Conference (DAC) in 1992. VHDL Initiative toward ASIC Libraries (VITAL) was an effort to enhance VHDL’s abilities for modeling timing in ASIC and FPGA design environments. The end result encompassed both a library of ASIC/FPGA primitive functions and an associated method for back-annotating delay information into these library models, where this delay mechanism was based on the same underlying tabular format used by Verilog (Figure 9-6). System Behavioral (Algorithmic) Functional (RTL, Boolean) Structural (Gate, Switch) VHDL ■ Verilog 168 VITAL - Relatively easy to learn - Relatively difficult to learn - Fixed data types - Abstract data types - Interpreted constructs - Compiled constructs - Good gate-level timing - Less good gate-level timing - Limited design reusability - Good design reusability - Limited design management - Good design management - No structure replication - Supports structure replication Figure 9-6. Levels of abstraction (Verilog versus VHDL). HDL-Based Design Flows ■ 169 Mixed-language designs Once upon a time, it was fairly common for an entire design to be captured using a single HDL (Verilog or VHDL). As designs increased in size and complexity, however, it became more common for different portions of the design to be created by different teams. These teams might be based in different companies or even reside in different countries, and it was not uncommon for the different groups to be using different design languages. Another consideration was the increasing use of legacy design blocks or third-party IP, where the latter refers to a design team purchasing a predefined function from an external supplier. As a general rule of thumb related to Murphy’s Law, if you were using one language, then the IP you wanted was probably available only in the other language. The early 1990s saw a period known as the HDL Wars, in which the proponents of one language (Verilog or VHDL) stridently predicted the imminent demise of the other … but the years passed and both languages retained strong followings. The end result was that EDA vendors began to support mixed-language design environments featuring logic simulators, logic synthesis applications, and other tools that could work with designs composed from a mixture of Verilog and VHDL blocks (or modules, depending on your language roots). UDL/I As previously noted, Verilog was originally designed with simulation in mind. Similarly, VHDL was created as a design documentation and specification language that took simulation into account. As a result one can use both of these languages to describe constructs that can be simulated, but not synthesized. In order to address these problems, the Japan Electronic Industry Development Association (JEIDA) introduced its own HDL, the unified design language for integrated circuits (UDL/I) in 1990. Murphy’s Law—if anything can go wrong, it will—is attributed to Capt. Edward Murphy, an engineer working at Edwards Air Force Base in 1949. 170 ■ The Design Warrior's Guide to FPGAs 1880: America. Alexander Graham Bell patents an optical telephone system called the Photophone. The key advantage of UDL/I was that it was designed from the ground up with both simulation and synthesis in mind. The UDL/I environment includes a simulator and a synthesis tool and is available for free (including the source code). However, by the time UDL/I arrived on the scene, Verilog and VHDL already held the high ground, and UDL/I never really managed to attract much interest outside of Japan. Superlog and SystemVerilog In 1997, things started to get complicated because that’s when a company called Co-Design Automation was formed. Working away furiously, the folks at Co-Design developed a “Verilog on steroids” called Superlog. Superlog was an amazing beast that combined the simplicity of Verilog with the power of the C programming language. It also included things like temporal logic, sophisticated design verification capabilities, a dynamic API, and the concept of assertions that are key to the formal verification strategy known as model checking. (VHDL already had a simple assert construct, but the original Verilog had nothing to boast about in this area.) The two main problems with Superlog were (a) it was essentially another proprietary language, and (b) it was so much more sophisticated than Verilog 95 (and later Verilog 2001) that getting other EDA vendors to enhance their tools to support it would have been a major feat. Meanwhile, while everyone was scratching their heads wondering what the future held, the OVI group linked up with their equivalent VHDL organization called VHDL International to form a new body called Accellera. The mission of this new organization was to focus on identifying new standards and formats, to develop these standards and formats, and to foster the adoption of new methodologies based on these standards and formats. In the summer of 2002, Accellera released the specification for a hybrid language called SystemVerilog 3.0 (don’t even ask me about 1.0 and 2.0). The great advantage to this HDL-Based Design Flows language was that it was an incremental enhancement to the existing Verilog, rather than the death-defying leap represented by a full-up Superlog implementation. Actually, SystemVerilog 3.0 featured many of Superlog’s language constructs donated by Co-Design. It included things like the assertion and extended synthesis capabilities that everyone wanted and, being an Accellera standard, it was well placed to quickly gain widespread adoption. The current state of play (at the time of this writing) is that Co-Design was acquired by Synopsys in the fall of 2002. Synopsys maintained the policy of donating language constructs from Superlog to SystemVerilog, but no one is really talking about Superlog as an independent language anymore. After a little pushing and pulling, all of the mainstream EDA vendors officially endorsed SystemVerilog and augmented their tools to accept various subsets of the language, depending on their particular application areas and requirements. SystemVerilog 3.1 hit the streets in the summer of 2003, followed by a 3.1a release (to add a few enhancements and fix some annoying problems) around the beginning of 2004. Meanwhile, the IEEE is set to release the next version of Verilog in 2005. In order to avert a potential schism between Verilog 2005 and SystemVerilog, Accellera has promised to donate their SystemVerilog copyright to the IEEE by the summer of 2004. SystemC And then we have SystemC, which some design engineers love and others hate with a passion. SystemC—discussed in more detail in chapter 11—can be used to describe designs at the RTL level of abstraction.1 These descriptions can subsequently be simulated 5 to 10 times faster than their Verilog or VHDL counterparts, and synthesis tools are available to convert the SystemC RTL into gate-level netlists. 1 SystemC can support higher levels of abstraction than RTL, but those levels are outside the scope of this chapter; instead, they are discussed in more detail in chapter 11. ■ 1880: France. Pierre and Jacques Currie discover piezoelectricity. 171 172 ■ The Design Warrior's Guide to FPGAs 1881: Alan Marquand invents a graphical technique of representing logic problems. One big argument for SystemC is that it provides a more natural environment for hardware/software codesign and coverification. One big argument against it is that the majority of design engineers are very familiar with Verilog or VHDL, but they are not familiar with the object-orientated aspects of SystemC. Another consideration is that the majority of today’s synthesis offerings represent hundreds of engineer years of development in translating Verilog or VHDL into gate-level netlists. By comparison, there are far fewer SystemC-based synthesis tools, and those that are available tend to be somewhat less sophisticated than their more traditional counterparts. In reality, SystemC is more applicable to a system-level versus an RTL design environment. Having said this, SystemC seems to be gaining a lot of momentum in Asia and Europe, and the debate on SystemC versus SystemVerilog versus VHDL will doubtless be with us for quite some time. Points to ponder Be afraid, be very afraid Most software engineers throw up their hands in horror when they look at another programmer’s code, and they invariably start a diatribe as to the lack of comments, consistency, whatever … you name it, and they aren’t happy about it. They don’t know how lucky they are because the RTL source code for a design often sets new standards for awfulness! Sad to relate, the majority of designs described in RTL are almost unintelligible to another designer. In an ideal world, the RTL description of a design should read like a book, starting with a “table of contents” (an explanation of the design’s structure), having a logical flow partitioned into “chapters” (logical breaks in the design), and having lots of “commentary” (comments explaining the structure and operation of the design). HDL-Based Design Flows It’s also important to note that coding style can impact performance (this typically affects FPGAs more than ASICs). One reason for this is that, although they might be logically equivalent, different RTL statements can yield different results. Also, tools are part of the equation because different tools can yield different results. The various FPGA vendors and EDA vendors are in a position to provide their customers with reams of information on particular coding styles and considerations with regard to their chips and tools, respectively. However, the following points are reasonably generic and will apply to most situations. Serial versus parallel multiplexers When creating RTL code, it is useful to understand what your synthesis tool is going to do in certain circumstances. For example, every time you use an if-then-else statement, the result will be a 2:1 multiplexer. This becomes interesting in the case of nested if-then-else statements, which will be synthesized into a priority structure. For example, assume that we have already declared signals Y, A, B, C, D, and SEL (for select) and that we use them to create a nested if-then-else (Figure 9-7). if SEL == 00“ then Y elseif SEL == 01“ then Y elseif SEL == 10“ then Y else Y end if; = = = = A; B; C; D; 2:1 MUX 2:1 MUX D 2:1 MUX C Y B A SEL == 10 SEL == 01 SEL == 00 Figure 9-7. Synthesizing nested if-then-else statements. ■ 173 1883: America. William Hammer and Thomas Alva Edison discover the “Edison Effect”. 174 ■ The Design Warrior's Guide to FPGAs 1884: Germany. Paul Gottleib Nipkow uses spinning disks to scan, transmit, and reproduce images. As before, the syntax used here is a generic one that doesn’t really reflect any of the mainstream languages. In this case, the innermost if-then-else will be the fastest path, while the outermost if-then-else will be the critical signal (in terms of timing). Having said this, in some FPGAs all of the paths through this structure will be faster than using a case statement. Speaking of which, a case statement implementation of the above will result in a 4:1 multiplexer, in which all of the timing paths associated with the inputs will be (relatively) equal (Figure 9-8). 4:1 MUX case SEL of; 00“: Y = 01“: Y = 10“: Y = otherwise:Y = end case; A; B; C; D; A B C D 00 01 10 Y 11 SEL Figure 9-8. Synthesizing a case statement. Beware of latch inference Generally speaking, it’s a good idea to avoid the use of latches in FPGA designs unless you really need them. One other thing to watch out for: If you use an if-then-else statement, but neglect to complete the “else” portion, then most synthesis tools will infer a latch. Use constants wisely Adders are the most used of the more complex operators in a typical design. In certain cases, ASIC designers sometimes employ special versions using combinations of half-adders and full-adders. This may work very efficiently in the case of a gate array device, for example, but it will typically result in a very bad FPGA implementation. When using an adder with constants, a little thought goes a long way. For example, “A + 2” can be implemented more HDL-Based Design Flows efficiently as “A + 1 with carry-in,” while “A – 2” would be better implemented as “A – 1 with carry-in.” Similarly, when using multipliers, “A * 2” can be implemented much more efficiently as “A SHL 1” (which translates to “A shifted left by one bit”), while “A * 3” would be better implemented as “(A SHL 1) + A.” In fact, a little algebra also goes a long way in FPGAs. For example, replacing “A * 9” with “(A SHL 3) + A” results in at least a 40 percent reduction in area. Consider resource sharing Resource sharing is an optimization technique that uses a single functional block (such as an adder or comparator) to implement several operators in the HDL code. If you do not use resource sharing, then each RTL operation is built using its own logic. This results in better performance, but it uses more logic gates, which equates to silicon real estate. If you do decide to use resource sharing, the result will be to reduce the gate-count, but you will typically take a hit in performance. For example, consider the statement illustrated in Figure 9-9. Note that frequency values shown in Figure 9-9 are of interest only for the purposes of this comparison, because these values will vary according to the particular FPGA architecture, and they will change as new process nodes come online. The following operators can be shared with other instances of the same operator or with related operators on the same line: * + – > < >= <= For example, a + operator can be shared with instances of other + operators or with – operators, while a * operator can be shared only with other * operators. ■ 175 1886: Reverend Charles Lutwidge Dodgson (Lewis Carrol) publishes a diagrammatic technique for logic representation in The Game of Logic. 176 ■ The Design Warrior's Guide to FPGAs For not-so-technical readers, the circles with “>” symbols indicate comparators (circuits that compare two numbers to determine which is the larger); the circles with “+” symbols indicate adders; and the wedgeshaped blocks are 2:1 multiplexers that select between their inputs based on the value of the control signals coming out of the comparators. if (B > C) then Y = A + B; else Y = A + C; end if; Resource Sharing = ON (one adder) Resource Sharing = OFF (two adders) A + A B + B Y + C Y C > > Total LUTs =32 Clock frequency = 87.7 MHz Total LUTs =64 Clock frequency =133.3 MHz (+52% !) Figure 9-9. Resource sharing. If nothing else, it’s a good idea to check whether or not your synthesis application has resource sharing enabled or disabled by default. And one final point is that resource sharing in ASICs can alleviate routing congestion, but it may actually cause routing problems in FPGAs. Last but not least Internal tri-state buses are slow in most FPGAs and should be avoided unless you are 100 percent confident that you know what you’re doing. If at all possible, use tri-state buffers only at the top-most level of the design. If you do wish to use internal tri-state buffers, then in the case of FPGA families that don’t support these gates, the majority of today’s synthesis tools provide automatic tri-state-to-multiplexer conversion (this basically involves converting the tri-state buffers specified in the RTL into corresponding LUT/CLB-based logic.) HDL-Based Design Flows Also, bidirectional buffers can cause timing loop problems, so if you use them, make sure that any false paths are clearly marked. ■ 177 Chapter 10 Silicon Virtual Prototyping for FPGAs Just what is an SVP? Before we leap headfirst into the concept of silicon virtual prototyping for FPGAs, it’s probably worth reminding ourselves how the silicon virtual prototype (SVP) concept originated in the ASIC world, some of the alternative SVP manifestations one might see in that world, and some of the problems associated with those manifestations. As high-end ASIC devices containing tens of millions of logic gates appeared on the scene, capacity and complexity issues associated with these megadesigns caused design flows to become a little wobbly around the edges. The problem is that, with traditional flows, many design issues do not become apparent until accurate timing analysis can be performed following extraction of realistic physical values (capacitance, resistance, and sometimes inductance), based on the results from place-and-route. This requires the engineers to go all of the way through the flow (including synthesis and place-and-route) before they discover a major problem that would have been better detected and resolved earlier in the process. This is extremely irritating, and the end result often involves numerous time-consuming iterations that can so delay a design that it completely misses its time-to-market window. (In many cases there is only room in the market for the winner, and there’s no such thing as second place!) One solution is to create an SVP, which is a representation of the design that can be generated relatively quickly, but which (hopefully) contains sufficient information to allow the SVP is pronounced by spelling it out as “S-V-P.” 180 ■ The Design Warrior's Guide to FPGAs 1887: England. J. Thomson discovers the electron. designers to identify and address a large proportion of potential problems before they undergo the time-consuming portions of the design flow. In theory, the time taken to iterate a design using an SVP can be measured in hours, as opposed to days or weeks using conventional design flows. ASIC-based SVP approaches As was discussed in the previous chapter, the role of logic synthesis is to accept an RTL representation of a design along with a set of timing constraints. The logic synthesis application automatically converts this RTL representation into a mixture of registers and Boolean equations, performs a variety of minimizations and optimizations (including optimizing for area and timing), and then generates a gate-level netlist that hopefully meets the original timing constraints. Conventional logic synthesis solutions operate in the gate-size versus delay plane, which means they are constantly making trade-offs with regard to the size of gates and the delays associated with those gates. Due to their underlying modus operandi, these tools perform tremendous amounts of compute-intensive, time-consuming evaluations. Even worse, many of the optimization decisions performed by the synthesis tool are often rendered meaningless when the design is handed over to the physical implementation (place-androute) portion of the flow. Gate-level SVPs (from fast-and-dirty synthesis) One key aspect of an SVP is the ability to generate it quickly and easily. The majority of current ASIC SVPs are based on the use of a gate-level netlist representation of the design that is subsequently placed using a rough-and-ready placement algorithm. Unfortunately, conventional synthesis tools consume too much time and computational resources to meet the speed demands of prototyping. Thus, some ASICbased SVP flows make use of a fast-and-dirty synthesis engine (Figure 10-1). Silicon Virtual Prototyping for FPGAs Implementation World SVP World Fast & Dirty Synthesis Synthesis Different Engines Logic Logic Synthesis Prototype Exploration Exploration Place & Route Timing Analysis Timing Analysis Iterations take days/weeks Iterations take hours RTL Figure 10-1. SVP based on fast-and-dirty synthesis. This fast-and-dirty engine is typically based on completely different algorithms from the main synthesis application, for example, direct RTL mapping. Thus, the ensuing gate-level netlist used to form the SVP is not as accurate a representation of the design’s final implementation as one might hope for. In turn, this means that once the SVP has been used to perform RTL exploration and timing analysis, engineers still have to perform a full-up logic synthesis (or physically aware synthesis) step using a completely different synthesis engine in order to generate the real netlist to be passed on to the physical implementation (place-and-route) tools. So, the big problem with this SVP-based approach is that the prototyping tools and their methodologies are separate and distinct from the implementation tools and their methodologies. This leads to unpredictability of design convergence due to lack of correlation, which can result in time-consuming back-end–to–front-end iterations, which sort of defeats the whole purpose of using an SVP in the first place! Gate-level SVPs (from gain-based synthesis) As opposed to conventional logic synthesis that is based in the gate-size versus delay plane, a concept known as gain-based ■ 181 1887: England. William Crookes demonstrates that cathode rays travel in a straight line. 182 ■ The Design Warrior's Guide to FPGAs 1887: Germany. Heinrich Hertz demonstrates the transmission, reception, and reflection of radio waves. synthesis1 is a kettle of fish of a different color (I never metaphor I didn’t like). This form of synthesis is derived from ideas put forward by Ivan Sutherland, Bob Sproull, and David Harris in their 1999 book Logical Effort: Designing Fast CMOS Circuits.2 In this case, the synthesis engine uses logical effort concepts to establish a fixed-timing plane, and the physical implementation (place-and-route) tools subsequently work within this plane. This means that all timing optimizations are completed and all circuit delays are determined and frozen by the end of the synthesis step. When the placement engine performs its task, it uses a size-driven algorithm in which all of the cells are dynamically sized to meet their timing budgets based on the actual loads they see. Following placement, a load-driven routing engine is used to tune the width and spacing of the tracks so as to maintain the original timing budgets and to ensure signal integrity. One interesting point with regard to the gain-based approach is that the amount of computer memory and computational effort required to perform this type of synthesis are a fraction of that demanded by conventional synthesis tools. This means that a gain-based synthesis engine claims an order of magnitude increase in capacity over conventional synthesis approaches. Another interesting point is that the gain-based synthesis engine automatically uses up any slack in path delays. This means that the smallest possible sizes are used for each gate that will just meet the timing budget. Thus, the resulting implementation occupies the smallest amount of silicon real estate, which significantly reduces congestion, power consumption, and noise problems. 1 At the time of this writing, one of the chief proponents of gain-based synthesis is Magma Design Automation (www.magma-da.com). 2 Ivan Sutherland is internationally renowned for his pioneering work on logic design. Silicon Virtual Prototyping for FPGAs “But,” you cry, “what does all of this have to do with SVPs?” Well, the speed and capacity inherent to gain-based synthesis means that the same synthesis engine can be used for both prototyping and implementation (Figure 10-2). Implementation World SVP World Gain-based Fast & Dirty Synthesis Synthesis Identical Engines Gain-based Logic Synthesis Prototype Exploration Exploration Place & Route Timing Analysis Timing Analysis Iterations take days/weeks Iterations take hours RTL Figure 10-2. SVP based on gain-based synthesis. Basing both the prototyping and implementation environments on the same algorithms, tools, and methodologies provides high correlation and predictable design convergence and significantly reduces time-consuming back-end–to–frontend iterations. Cluster-level SVPs As discussed earlier, the majority of today’s SVPs are based on full-blown gate-level netlist representations of the design. Even though these representations may be generated using fast-and-dirty synthesis, they can still contain millions upon millions of logic gates, which can strain the capacity of the SVP’s placement and analysis engines. One solution is to use the concept of clustering as a basis for the SVP’s placement decisions and track-delay estimations. In this case the cells (gates and registers) generated by fast-anddirty or gain-based synthesis are automatically gathered into groups called clusters. Each cluster typically consists of tens to ■ 1888: America. First coin-operated public telephone invented. 183 184 ■ The Design Warrior's Guide to FPGAs 1889: America. Almon Brown Strowger invents the first automatic telephone exchange. hundreds of cells, which means that they are small enough to preserve overall placement quality; however, the number of clusters is orders of magnitude smaller than the number of cells, providing extremely significant run-time improvements. The actual number of cells may vary from cluster to cluster so as to keep the areas of the clusters as uniform as possible. In order to streamline computational complexity and capacity requirements, optimization and analysis are performed on the clustered data. Furthermore, in cases where two clusters are linked by multiple wires (which is a common occurrence), these wires are considered to be a single “weighted” wire for the purposes of estimating routing resource utilization, which has an effect on cluster placement. RTL-based SVPs A well-accepted engineering rule of thumb states that detecting, isolating, and resolving a problem at any stage of the design, implementation, or deployment process costs 10 times more than addressing the same problem at the previous stage in the process. In the case of digital ICs, there are three major breakpoints in the design flow with respect to analyzing area, timing, and so forth. (Figure 10-3). Relative cost of timing analysis and debug ~100x ~10x 1x RTL (Pre-synthesis) Gate (Post-IPO) Gate (Post-layout) Level of Design Abstraction Figure 10-3. Major breakpoints with respect to analyzing area, timing, and so forth. Silicon Virtual Prototyping for FPGAs The term timing closure refers to analyzing a design or architecture to detect and correct any problematic timing paths. Irrespective of the level it is performed at, timing closure is an iterative process, which means that the analyze-detect-correct steps typically need to be run a number of times in order to achieve convergence. With regard to the levels of abstraction shown in Figure 10-3, postlayout timing analysis is the most accurate by far, but it is extremely expensive with regard to cost and time. Iterating at the postlayout level is a painful proposition, and design teams try very hard to avoid making changes at this level. In the case of conventional flows, the first breakpoint for relatively accurate timing analysis occurs at the gate level following synthesis and in-place optimization (IPO). The problem is that getting to this post-IPO breakpoint using conventional flows requires the use of physically aware synthesis to provide a placed gate-level netlist. This approach is therefore extremely compute-intensive and time-consuming, and large blocks can take days to go through the full physical synthesis and timing analysis process. Not only does this stretch out the design and timing closure process, but it also ties up expensive EDA tools that could be being used for implementing chips rather than analyzing their timing. One alternative is to use a gate-level SVP as discussed above; but, once again, these representations have their own problems, including requiring the use of some form of compute-intensive and time-consuming synthesis and placement. Another approach is to work with an RTL-based SVP,3 which allows engineers to quickly identify and address paths that will cause downstream timing problems. In order to wrap one’s brain around how this works, it’s first necessary to understand that there’s a related application that takes the logical 3 At the time of this writing, one of the chief proponents of RTL-based SVPs is InTime Software (www.intimesw.com). ■ 185 IPO means that, after the placement algorithm has performed its initial pass, it is possible to make certain “tweaks” (optimizations), such as changing the size of cells based on updated estimates of the length of the tracks they will see. 186 ■ The Design Warrior's Guide to FPGAs LEF stands for “logical exchange format,” where this file details the logical functionality of the cells in the library. Similarly, DEF stands for “design exchange format,” where this file details the physical aspects of the cells in the library, such as their resistance and capacitance values and their physical dimensions. and physical (LEF and DEF) definition files associated with an ASIC cell library and generates a corresponding design kit database to be used by the RTL-based SVP (Figure 10-4). LEF DEF Design Kit Generator Design Kit Figure 10-4. Generating a design kit. It’s important to note that such a design kit is not a library of characterized gates, but is instead a database of characterized logical functions (such as counters, XOR trees, etc.). The design kit generator captures the behavior of these logical functions, including timing and area estimations. The RTL-based SVP generator and analysis engine subsequently accepts the RTL code for the design, the time constraints associated with the design block (in industrystandard SDF format), and the design kit associated with the target cell library. As the SVP generator reads in the RTL, it converts it into a netlist of entities called work functions. Each work function is an abstraction that directly maps onto an equivalent function in the design kit. Once the RTL has been converted into a netlist of work functions, the SVP generator performs identical logical operations to those that are typically performed at the gate level, including common subexpression elimination, constant propagation, loop unraveling, the removal of all redundant functional computations, and so forth. The SVP generator and analysis engine uses the resulting minimal irredundant network of work functions to perform a “virtual placement” of these functions. This placement is then used to generate accurate area estimates, which are subsequently used to generate accurate time estimates. In con- Silicon Virtual Prototyping for FPGAs junction with the design kit, the SVP generator and analysis engine understands how the various synthesis engines will weight various factors and modify their implementation strategies (such as swapping counter realizations) in order to meet the specified timing constraints. All of these factors are taken into account when performing the analysis. Proponents of RTL-based SVPs claim a 40-fold speed increase as compared to generating a post-IPO, pre-place-androute gate-level netlist using a physically aware synthesis approach. In an example 4.5-million-gate design circa 2003, this equated to a 2.5-hour iteration to generate and analyze an RTL-based SVP as compared to 99 hours to generate and analyze a post-IPO gate-level netlist. Of course the big question is, just how accurate are RTLbased SVPs? The supporters of this form of SVP claim that its timing analysis results typically correlate to post-IPO delays with an error of 20 percent or less (worst-case errors may rise to 30 percent). Although this may sound pretty dire, the latest generation of synthesis tools is capable of closing timing on RTL that is within 20 to 30 percent of the desired timing (it’s the paths that are off by 80, 150, 200 percent, and higher that cause problems). Thus, RTL-based SVPs are accurate enough to allow design engineers to generate RTL code that can subsequently be fully implemented by the downstream syntheses and layout engines. I know, I know. We’ve digressed again, although you have to admit that this is all interesting stuff! But now let’s return to pondering FPGAs. FPGA-based SVPs Not surprisingly, multimillion-gate FPGA designs are hitting the same problems that befell ASICs, including the fact that it takes ages to place, route, and perform timing analysis on the little rascals. One particularly painful aspect of this process is that, although the original RTL representation of the design is ■ 187 1890: America Census is performed using Herman Hollerith’s punched cards and automatic tabulating machines. 188 ■ The Design Warrior's Guide to FPGAs 1890s: John Venn proposes logic representation using circles and eclipses. almost invariably hierarchical,4 the FPGA’s place-and-route tools typically end up working on flattened representations of the design. This means that even if you make the smallest of changes to a single block of RTL code and resynthesize only that block, you end up having to rerun place-and-route on the entire design. In turn, this means that you can grow to be old and gray before you finally get to achieve timing closure on your design. In order to address these problems, some EDA vendors have started to provide tools that support the concept of an FPGA SVP by providing a mixture of floor planning and preplace-and-route timing analysis. This is coupled with the ability to perform place-and-route on individual design blocks, which dramatically speeds up the implementation process.5 This form of SVP commences with a graphical top-down view of the target FPGA device showing all of the internal logical resources, such as LUTs, registers, slices, CLBs, embedded RAMs, multipliers, and so forth. Following the logic synthesis step (using the synthesizer of your choice), the SVP generator loads the ensuing hierarchical LUT/CLB-level netlist, along with any associated timing and physical constraints, and automatically creates an initial floor plan. This auto-generated floor plan shows a collection of square and/or rectangular blocks, each of which corresponds to a top-level module in the design. Furthermore, if any of these top-level modules itself contains submodules, then these are shown as embedded blocks in the floor plan (and so on down through the hierarchy). The SVP generator performs its own initial placement of the resources (LUTs, registers, RAMs, multipliers, etc.) used by each block. These resources are also shown in the topdown view of the device, along with graphical representations 4 By “hierarchical,” we mean that the top level of the design is typically formed from a number of functional modules, which may themselves call submodules and so forth. 5 At the time of this writing, one of the chief proponents of FPGA SVPs, in the form described here, is Hier Design (www.hierdesign.com). Silicon Virtual Prototyping for FPGAs as to the amount of routing resources required to link the various blocks together. Interactive manipulation The initial placement of the design in the SVP allows it to provide accurate timing estimations on a block-by-block basis prior to running place-and-route. If any potential problem areas are detected, you can interactively modify the floor plan in order to address them. The simplest form of manipulation is to reshape the rectangular blocks in the floor plan by pulling their sides to make them taller, thinner, shorter, or fatter. Alternatively, you can create more complex outlines such as “L,” “U,” and “T” shapes (pretty much any contour you can form out of squares and rectangles). Next, you can move the blocks around. When you grab a block and start to drag it across the face of the device, the system will provide a graphical indication as to whether or not there are the necessary resources required to implement that block at its current location (you can only drop the block in an area where there are sufficient resources). Furthermore, as you manipulate a block by reshaping it or moving it around, the system dynamically displays the utilization of resources (LUTs, registers, RAMs, multipliers, etc.) inside that block relative to the total amount of each resource type currently encompassed by that block. You can also split existing blocks into two or more subblocks, which you can then manipulate independently. Alternatively, you can merge two or more blocks into a single block. Also, in some cases (say, areas of control logic), you might wish to pull one or more subblocks out of their parent blocks and move them up to the top level of the design, at which point you can reshape them, merge them together, move them around, and so forth. Much of this reflects a different philosophy of how one might use an ASIC floor-planning tool. In the case of an ASIC, for example, if you have two ■ 189 1892: America. First automatic telephone switchboard comes into service. 190 ■ The Design Warrior's Guide to FPGAs 1894: Germany. Heinrich Hertz discovers that radio waves travel at the speed of light and can be refracted and polarised. blocks with lots of interconnect between them, you would typically place them side by side. By comparison, in the case of an FPGA, merging the blocks (thereby allowing the placeand-route tools to do a much better job of optimization using local versus global routing resources) might provide a more efficacious solution. Furthermore, you aren’t limited to manipulating blocks only as described in the original RTL hierarchy. You can actually manipulate individual FPGA resources like LUTs, registers, slices, CLBs, and the like. This includes dragging them around and repositioning them within their current hierarchical block, dragging them from one hierarchical block to another, creating new blocks, and dragging groups of LUTs from one or more existing blocks into this new block, and so forth. Where things start to get really clever is that, if you go back to make changes to your original RTL and resynthesize those modules, then when you reimport the resulting LUT/CLB-level netlist(s), the SVP generator sorts everything out for you and loads the right logic into the appropriate floor-plan blocks. (How do they do it? I don’t have a clue!) Incremental place-and-route As soon as you are ready to rock and roll, you can select one or more floor-plan blocks and kick off the FPGA vendor’s place-and-route software. Each block is treated as an individual entity, so once you’ve laid out a block, it will remain untouched unless you decide you want to change it. This has a number of advantages. First of all, place-and-route run times for individual blocks are extremely small compared to the traditional times associated with full-up multimillion-gate designs. And even if you add up the place-and-route times for running all of the blocks individually, the total elapsed time is much less than it would be if one were performing place-androute on the design in its entirety. This is because the complexity (and associated run times) of place-and-route increases Silicon Virtual Prototyping for FPGAs in a nonlinear manner as the size of the block being processed increases. Furthermore, once you’ve run place-and-route on all of the blocks, you can make changes to individual blocks and rerun place-and-route only on those blocks without affecting the rest of the chip. An additional advantage associated with this SVP approach is that it lends itself to creating and preserving IP. That is, once a block has undergone place-and-route, you can lock it down and export it as a new structural LUT/CLB-level netlist along with its associated physical and timing constraints. This block can subsequently be used in other designs (its placement is relative, which means that it can be dragged around the chip and relocated as discussed above). RTL-based FPGA SVPs In an ideal world, it would be nice to be able to work with RTL-based FPGA SVPs. The various FPGA and EDA vendors do provide RTL-level floor-planning tools with varying degrees of sophistication. At the time of this writing, however, there is no FPGA equivalent to the state-of-the-art in RTLbased ASIC SVP technology (but we will doubtless see such a beast in the not-so-distant future). ■ 1894: Italy. Guglielmo Marconi invents wireless telegraphy. 191 Chapter 11 C/C++ etc.–Based Design Flows Problems with traditional HDL-based flows With regard to the traditional HDL-based flows introduced in chapter 9, a design commences with an original concept, whose high-level definition is determined by system architects and system designers. It is at this stage that macroarchitecture decisions are made, such as partitioning the design into hardware and software components (see also chapter 13). The resulting specification is then handed over to the hardware design engineers, who commence their portion of the development process by performing microarchitecture definition tasks such as detailing control structures, bus structures, and primary data path elements. These microarchitecture definitions, which are often performed in brainstorming sessions on a whiteboard, may include performing certain operations in parallel versus sequential, pipelining portions of the design versus nonpipelining, sharing common resources (for example, two operations sharing a single multiplier, versus using dedicated resources) and so forth. Eventually, the design intent is captured by writing RTL VHDL/Verilog. Following verification via simulation, this RTL is then synthesized down to a structural netlist suitable for use by the target technology’s place-and-route applications (Figure 11-1). At the time of this writing, these VHDL or Verilog-based flows account for around 95 percent of all ASIC and FPGA designs; however, there are a number of problems associated with these flows: Note that this chapter focuses on C/C++ flows in the context of generic digital designs. Considerations such as quantization (commencing with floating-point representations which are subsequently coerced into their fixedpoint counterparts) are covered in the discussions on DSP-centric designs in chapter 12. 194 ■ The Design Warrior's Guide to FPGAs ASIC target In the case of an FPGA target, the LUT/CLB-level netlist may be presented in EDIF, VHDL, or Verilog depending on the FPGA vendor. With regards to physically aware synthesisbased flows, EDIF remains the “netlist of choice.” In this case, the placement information may be incorporated in the EDIF itself or presented in an external “constraints” side-file. Original Concept FPGA target uA Definition Capture RTL Implementation-specific micro-architecture definition uA Definition Simulate Synthesize Gate-level netlist Implementation-specific RTL (time-consuming to create, slow to simulate, difficult to modify) Capture RTL Simulate Synthesize LUT/CLBlevel netlist Figure 11-1. Traditional (simplified) HDL-based flows. ■ ■ ■ ■ ■ Capturing the RTL is time-consuming: Even though Verilog and VHDL are intended to represent hardware, it is still time-consuming to use these languages to capture the functionality of a design. Verifying RTL is time-consuming: Using simulation to verify large designs represented in RTL is computationally expensive and time-consuming. Evaluating alternative implementations is difficult: Modifying and reverifying RTL to perform a series of what-if evaluations of alternative microarchitecture implementations is difficult and time-consuming. This means that the number of evaluations the design team can perform may be limited, which can result in a lessthan-optimal implementation. Accommodating specification changes is difficult: If any changes to the specification are made during the course of the project, folding these changes into the RTL and performing any necessary reverification can be painful and time-consuming. This is a significant consideration in certain application areas, such as wireless projects, because broadcast standards and protocols are constantly evolving and changing. The RTL is implementation specific: Realizing a design in an FPGA typically requires a different RTL coding style from that used for an ASIC implementation (see also the discussions in Chapters 7, 9, and 18). This C/C++ etc.–Based Design Flows ■ means that it can be extremely difficult to retarget a complex design represented in RTL from one implementation technology to another. This is of concern when one is migrating an existing ASIC design into an FPGA equivalent or creating an FPGA design to be used as a prototype for a future ASIC implementation. One way to view this is that all of the implementation intelligence associated with the design is hardcoded into the RTL, which therefore becomes implementation specific. It’s important to understand that this implementation specificity goes beyond the coarse ASIC-versus-FPGA boundary, which dictates that RTL intended for an FPGA implementation is not suitable for an optimal ASIC realization, and vice versa. Even assuming a single target device architecture, the way in which a set of algorithms is used to process data may require a number of different microarchitecture implementations, depending on the target application areas. Actually, to be scrupulously fair, we should probably note that the same RTL may be used to drive both ASIC and FPGA implementations. The reason for doing this is to avoid the risk of introducing a functional bug into the RTL when retargeting the code, but there is typically a penalty to be paid. That is, if code originally targeted toward an FPGA implementation is subsequently used to drive an ASIC implementation, the resulting ASIC will typically require more silicon real estate and have higher power consumption as compared to using RTL created with an ASIC architecture in mind. Similarly, if code originally targeted toward an ASIC implementation is subsequently used to drive an FPGA implementation, the ensuing FPGA will typically take a significant performance hit as compared to using RTL created with an FPGA architecture in mind. RTL is less than ideal for hardware-software codesign: System-on-chip (SoC) devices are generally under- ■ 195 1895: America. Dial telephones go into Milwaukee’s city hall. 196 ■ The Design Warrior's Guide to FPGAs RTOS is pronounced “R-tos.” That is, by spelling out the “R” followed by “TOS” to rhyme with “boss.” Real-time systems are those in which the correctness of a computation or action depends not only on how it is performed but also when it is performed. stood to be those that include microprocessor cores. Irrespective of whether these designs are to be realized using ASICs or FPGAs, today’s SoCs are exhibiting an ever-increasing amount of software content. When coupled with increased design reuse on the hardware side, in many cases it is necessary to verify the software and hardware concurrently so as to completely validate such things as the system diagnostics, RTOS, device drivers, and embedded application software. Generally speaking, it can be painful verifying (simulating) the hardware represented in VHDL or Verilog in conjunction with the software represented in C/C++ or assembly language. One approach that addresses the issues enumerated above is to perform the initial design capture at a higher level of abstraction than can be achieved with RTL VHDL/Verilog. The first such level is to use some form of C/C++, but as usual nothing is simple because there are a variety of alternatives, including SystemC, augmented C/C++, and pure C/C++. C versus C++ and concurrent versus sequential Before we leap into the fray, we should tie down a couple of points to ensure that we’re all marching in step to the same beat. First, there is a wide variety of programming languages available, but—excepting specialist application areas—the most commonly used by far are traditional C and its objectoriented offspring C++. For our purposes here, we will refer to these collectively as C/C++. The next point of import is that, by default, statements in languages like C/C++ are executed sequentially. For example, assuming that we have already declared three integer variables called a, b, and c, then the following statements a = 6; b = 2; c = 9; /* Statement in C/C++ program */ /* Statement in C/C++ program */ /* Statement in C/C++ program */ C/C++ etc.–Based Design Flows would, perhaps not surprisingly, occur one after the other. However, this has certain implications; for example, if we now assume that the following statements occur sometime later in the program a = b; b = a; /* Statement in C/C++ program */ /* Statement in C/C++ program */ then a (which initially contained 6) will be loaded with the value currently stored in b (which is 2). Next, b (which initially contained 2) will be loaded with the value currently stored in a (which is now 2), so both a and b will end up containing the same value. The sequential nature of programming languages is the way in which software engineers think. However, hardware design engineers have quite a different view of the world. Let’s assume that a piece of hardware contains two registers called a and b that are driven by a common clock signal. Let’s further assume that these registers have previously been loaded with values of 6 and 2, respectively. Finally, let’s assume that at some point in the HDL code, we see the following statements: a = b; b = a; /* Statement in VHDL/Verilog Code */ /* Statement in VHDL/Verilog Code */ As usual, the above syntax doesn’t actually represent VHDL or Verilog; it’s just a generic syntax used only for the purposes of this example. Generally speaking, hardware engineers would expect both of these statements to be executed concurrently (at the same time). This means that a (which initially contained 6) will be loaded with the value stored in b (which was 2) while—at the same time—b (which initially contained 2) will be loaded with the value stored in a (which was 6). The end result is that the initial contents of a and b will be exchanged. As usual, of course, the above is something of a simplification. However, it’s fair to say that HDL statements will ■ 197 1895: Germany. Wilhelm Kohnrad Roentgen discovers X-rays. 198 ■ The Design Warrior's Guide to FPGAs execute concurrently by default, unless sequential behavior is forced by means of techniques like blocking assignments. Thus, by default, RTL-based logic simulators will execute the statements shown above in this concurrent manner; similarly RTL-based logic synthesis tools will generate hardware that handles these two activities simultaneously. By comparison, unless explicitly directed to do otherwise (by means of the techniques introduced later in this chapter), C/C++ statements will execute sequentially. SystemC-based flows SystemC is “managed” by the Open SystemC Initiative (OSCI). This is an independent not-forprofit organization composed of companies, universities, and individuals dedicated to promoting SystemC as an open-source standard for system-level design. The code for SystemC— along with an integrated simulator and design environment—is available from www.systemc.org. What is SystemC (and where did it come from)? Before we come to consider SystemC-based flows, it is probably a good idea to briefly summarize just what SystemC is, because there is typically some confusion on this point (not the least in the mind of the author). SystemC 1.0 One of the underlying concepts behind SystemC is that it is an open-source environment to which everyone contributes. As an example, consider Linux, which was rough around the edges at first. Based on contributions from different folks, however, Linux eventually became a real operating system (OS) with the potential to challenge Microsoft. In this spirit, a relatively undocumented SystemC 1.0 was let loose to roam wild and free circa 2000. SystemC 1.0 was a C++ class library that facilitated the representation of notions such as concurrency (things happening at the same time), timing, and I/O pins. By means of this class library, engineers could capture designs at the RTL level of abstraction. One advantage of this early incarnation was that it facilitated hardware/software codesign environments. Another was that SystemC representations at the RTL level of abstraction might simulate 5 to 10 times faster than their VHDL and Ver- C/C++ etc.–Based Design Flows ilog counterparts.1 On the downside, it was harder and more time-consuming to capture an RTL-level design in SystemC 1.0 than with VHDL or Verilog. Furthermore, there was a scarcity of design tools that could synthesize SystemC 1.0 representations into netlist-level equivalents with any degree of sophistication. SystemC 2.0 Later, in 2002, SystemC 2.0 arrived on the scene. This augmented the 1.0 release with some high-level modeling constructs such as FIFOs (a form of memory that can accept and subsequently make available a series of words of data and that operates on a first-in first-out principle). The 2.0 release also included a variety of behavioral, algorithmic, and system-level modeling capabilities, such as the concepts of transactions and channels (which are used to describe the communication of data between blocks at an abstract level). In order to gain a little perspective on all of this, let’s first consider a typical scenario of how things would have worked using the original SystemC 1.0. As a simple example, let’s assume that we have two functions called f(x) and g(x) that have to communicate with each other. (Figure 11-2). Two functions captured in high-level C/C++ f(x) g(x) Interface between functions has to be defined as pins Figure 11-2. Interfacing in SystemC 1.0. 1 This is design-dependent; in reality, some SystemC RTL-level simulation run times are at parity with their HDL counterparts. ■ 199 1895: Russia. Alexander Popov (also spelled Popoff) constructs a receiver for natural electricity waves and tries to detect thunderstorms. 200 ■ The Design Warrior's Guide to FPGAs 1897: England. Guglielmo Marconi transmits a Morse code message “let it be so” across the Bristol Channel. In this case, the interface between the blocks would have to be defined at the pin level. The real problem with this approach occurs when you are in the early stages of a design, because you are already defining implementation details such as bus widths. This makes things difficult to change if you wish to experiment with different what-if architectural scenarios. This aspect of things became much easier with SystemC 2.0, which allowed abstract interfaces to be declared between the blocks (Figure 11-3). Two functions captured in high-level C/C++ Interfaces f(x) g(x) Interface can be at the level of abstract records Figure 11-3. Interfacing in SystemC 2.0. Now, the interfacing between the blocks can be performed at the level of abstract records on the basis that, in the early stages of the design cycle, we don’t really care how data gets from point a to point b, just that it does get there somehow. These abstract interfaces facilitate performing architectural evaluation early in the design cycle. Once the architecture starts to firm up, you can start refining the interface by using high-level constructs such as a FIFO to which one would assign attributes like width and depth and characteristics like blocking write, nonblocking read, and how to behave when empty or full. Still later, this logical interface can be replaced by a completely specified (pin-level) interface that binds the functional blocks together at a more physical level. C/C++ etc.–Based Design Flows Levels of abstraction Truth to tell, this is where things start to become a little fuzzy around the edges, not the least because one runs into different definitions depending on to whom one is talking. As a first pass, however, we might take a stab at capturing the different levels of SystemC abstraction, as shown in Figure 11-4. System Untimed System C 2.0 Algorithmic Behavioral/ Transactionlevel RTL SystemC 1.0 Timed Figure 11-4. Levels of SystemC abstraction. This is why things become confusing, because SystemC can mean all things to all people. To some it’s a replacement for RTL VHDL/Verilog, while to others it’s a single language that can be used for system-level specification, algorithmic and architectural analysis, behavioral design, and testbenches for use in verification. One area of confusion comes when you start to talk about behavioral synthesis. This encompasses certain aspects of both the algorithmic and transactional levels (in the latter case, however, you have to be careful as to how to define your transactions). SystemC-based design-flow alternatives This is a tricky one because one might go various ways here. For example, many of today’s designs begin life as com- ■ 201 1897: England. Marconi establishes the first Marconi-station at the Needles (Isle of Wight, England), sending a signal over 22 km to the English coast. 202 ■ The Design Warrior's Guide to FPGAs 1901: Hubert Booth invents the first vacuum cleaner. plex algorithms. In this case, it is very common to start off by creating a C or C++ representation. This representation can be used to validate the algorithms by compiling it into a form that can be run (simulated) 1,000 or more times faster than an RTL equivalent. In the case of the HDL-based flows discussed in chapter 9, this C/C++ representation of the algorithms would then be hand-translated into RTL VHDL/Verilog. The C/C++ representation will typically continue to be used as a golden model, which means it can be linked into the RTL simulator and run in parallel with the RTL simulation. The results from the C/C++ and RTL models can be compared so as to ensure that they are functionally equivalent. Alternatively, in one flavor of a SystemC-based flow, the original C/C++ model could be incrementally modified by adding timing, concurrency, pin definitions, and so forth to transform it to a level at which it would be amenable to SystemC-based RTL or behavioral synthesis. In another flavor of a SystemC-based flow, the design might be initially captured in SystemC using system, algorithmic, or transaction-level constructs that could be used for verification at a high level of abstraction. This representation could then be incrementally modified to bring it down to a level at which it would be amenable to SystemC-based RTL or behavioral synthesis. Irrespective of the actual route by which one might get there, let’s assume that we are in possession of a SystemC representation of a design that is suitable for SystemC-based behavioral or RTL synthesis. In this case, there are two main design-flow alternatives, which are (1) to translate the System C into RTL VHDL/Verilog automatically and then to use conventional RTL synthesis technology, or (2) to use SystemC-based synthesis to generate an implementation-level netlist directly. There are two schools of thought here. One says that synthesizing the SystemC directly into the implementation-level netlist offers the cleanest, fastest, and most efficient route. C/C++ etc.–Based Design Flows However, another view is that it’s better to translate the SystemC into RTL VHDL/Verilog first because RTL is the way design engineers really visualize their world; that this level is a natural staging point for integrating design blocks (including third-party IP) originating from multiple sources; and that Verilog/VHDL synthesis technology is extremely mature and powerful (as compared to SystemC-based synthesis technology). But we digress. Both of these flows can be applied to ASIC or FPGA targets (Figure 11-5). ASIC target Auto-RTL Translation Verilog / VHDL RTL RTL Synthesis Gate-level netlist SystemC SystemC Synthesis Implementationspecific code Auto-RTL Translation Verilog / VHDL RTL RTL Synthesis LUT/CLBlevel netlist SystemC FPGA target SystemC Synthesis Figure 11-5. Alternative SystemC flows. The first SystemC synthesis applications were predominantly geared toward ASIC flows, so they didn’t do a very good job at inferring FPGA-specific entities such as embedded RAMs, embedded multipliers, and so forth. More recent incarnations do a much better job of this, but the level of sophistication exhibited by different tools is a moving target, so the prospective user is strongly advised to perform some indepth evaluations before slapping a bundle of cash onto the bargaining table. Note that figure 11-5 shows the use of implementationspecific SystemC to drive the ASIC versus FPGA flows. As soon as you start coding at the RTL level and adding timing ■ 203 1901: Marconi sends a radio signal across the Atlantic. 204 ■ The Design Warrior's Guide to FPGAs 1902: America. Millar Hutchinson invents the first electrical hearing aid. concepts, be it in VHDL, Verilog, or SystemC, then achieving an optimal implementation requires that the code be written with a specific target architecture in mind. Once again, having said this, the same SystemC can be used to drive both ASIC and FPGA flows, but there is typically a penalty to be paid. If SystemC code originally targeted toward an FPGA implementation is subsequently used to drive an ASIC flow, the resulting ASIC will typically require more silicon real estate and have higher power consumption as compared to using code created with an ASIC architecture in mind. Similarly, if code originally targeted toward an ASIC implementation is subsequently used to drive an FPGA flow, the ensuing FPGA will typically take a significant performance hit as compared to using code created with an FPGA architecture in mind. This is primarily a result of hard-coding the microarchitecture definition in the source. Love it or loath it Depending on whom you are talking to, folks either love SystemC or they loath it. Most would agree that SystemC 2.0 is very promising and that there’s no other language that provides the same capabilities (some of these capabilities are being added into SystemVerilog, but not all of them). On the downside, a lot of design engineers are reasonably proficient at writing C, but most of them are significantly less familiar with the object-oriented aspects of C++. So requiring them to use SystemC means giving them more power on the one hand, while thrusting them into a world they don’t like or understand on the other. It’s also true that while SystemC can be very useful for verification and high-level system modeling, in some respects it’s still relatively immature toolwise with regard to actual implementation flows. One school of thought says that, although SystemC is difficult to write by hand and also difficult to synthesize, which makes it a somewhat clumsy specification language, it does provide a powerful framework for simulation across languages and levels of abstraction. C/C++ etc.–Based Design Flows At the time of this writing, a number of companies that were strong supporters of SystemC in the United States have grown somewhat less vocal over the last few years. On the other hand, SystemC is gaining some ground in Europe and Asia. What does the future hold? Wait a few years, and I’ll be happy to tell you! Augmented C/C++-based flows What do we mean by augmented C/C++? There are two ways in which standard C/C++ can be augmented to extend its capabilities and the things it can be used to represent. The first is to include special comments, known by some as commented directives or pragmas, into the pure C/C++ code. These comments can subsequently be recognized and interpreted by parsers, precompilers, compilers, and other tools and used to add constructs to the code or modify the way in which it is processed.2 One significant drawback to this approach is that simulation requires the use of proprietary C/C++ compilers as opposed to using standard off-the-shelf compilers. This limits the options customers have and is only viable if standards are developed for multiple EDA vendors to leverage. The other way in which C/C++ can be augmented is to add special keywords and statements into the language. This is a very popular technique, and there are a veritable plethora of such language variants roaming wild and free around the world, each tailored toward a different application area. One downside of this approach is that, once again, it requires proprietary C/C++ compilers; otherwise, tools such as simulators that have not been enhanced to understand these new keywords and statements will crash and burn. A common solution 2 One example of this form of C/C++ augmentation is demonstrated by 0-In Design Automation (www.0in.com) for use with its assertion-based verification (ABV) technology. Another example of particular relevance here is Future Design Automation (www.future-da.com), which employs this technique with its C/C++ to RTL synthesis engine. ■ 205 1902: Robert Bosch invents the first spark plug. 206 ■ The Design Warrior's Guide to FPGAs 1902: Transpacific cable links Canada and Australia. to this problem is to wrap standard #ifdef directives around the new keywords and statements such that a precompiler can be used to discard them as required (this is somewhat inelegant, but it works). In the case of capturing the functionality of hardware for ASIC and FPGA designs, it is necessary to augment standard C/C++ with special statements to support such concepts as clocks, pins, concurrency, synchronization, and resource sharing.3 Assuming that you have an initial model represented in pure C/C++, the first step would be to augment it with clock statements, along with interface statements used to define the input and output pins. You could then use an appropriate synthesis tool to generate an implementation (as discussed below). However, because C/C++ is by nature sequential, the resulting hardware can be horribly slow and inefficient if the synthesis tool is not capable of locating potential parallelisms and exploiting them. For example, assume that we have the following statements in a C/C++ representation of the design: a b c d = 6; = 2; = 9; = a + b; : etc /* /* /* /* Standard Standard Standard Standard C/C++ C/C++ C/C++ C/C++ statement statement statement statement */ */ */ */ By default, each = sign is assumed by the synthesis application to represent one clock cycle. Thus, if the above code were left as is, the augmented C/C++ synthesis tool would generate hardware that loaded variable (register) a with 6 on the first clock, then b with 2 on the next clock, then c with 9 on the next clock, and so forth. Thus, by hardware standards, this would run horribly slowly. 3 A big player in this form of C/C++ augmentation for ASIC and FPGA design capture, simulation, and synthesis is Celoxica (www.celoxica.com) with its Handel-C language. C/C++ etc.–Based Design Flows Of course, most synthesis tools would be capable of locating and exploiting the potential parallelisms in the above example, but they might well miss more complex cases that require human consideration and intervention. For the purposes of these discussions, however, we shall continue to work with this simple test case. The point is that an augmented C/C++ language will have keywords like “parallel” (or “par”) and “sequential” (or “seq”) that will instruct the downstream synthesis application as to which statements should be executed in parallel, and so forth. For example: parallel; a = 6; b = 2; c = 9; sequential; d = a + b; : etc /* /* /* /* /* /* Augmented Standard Standard Standard Augmented Standard C/C++ C/C++ C/C++ C/C++ C/C++ C/C++ statement statement statement statement statement statement */ */ */ */ */ */ In this case, the parallel statement instructs the synthesis tool that the following statements can be implemented concurrently, while the sequential statement implies that the preceding operations must occur prior to any subsequent actions taking place. Of course, these parallel and sequential statements can be nested as required. Things become more complex in the case of loops, depending on whether the designer wishes to unravel them partially or fully. Just to give a point of reference, we might visualize a loop as being something like “for i = 1 to 10 in increments of 1 do xxxx, yyyy, and zzzz”. In some cases, it may be possible to simply associate a parallel or sequential statement with the loop, but if more subtlety is required, the designer may be obliged to completely rewrite these constructs. It may also be necessary to add “share” statements if resource sharing is required, and “channel” statements to share signals between expressions, and the list goes on. ■ 207 As was previously noted, tools such as simulators that have not been enhanced to understand these new keywords and statements will “crashand-burn” when presented with this representation. One solution is to “wrap” standard “#ifdef” directives around the new keywords and statements such that a precompiler can be used to discard them as required. However, this means that the simulator and synthesis engines will be working on different views of the design, which is typically not a good idea. The other solution is to use a proprietary simulator, but this may not have the power, capacity, or capabilities of your existing simulation technology. 208 ■ The Design Warrior's Guide to FPGAs 1902: US Navy installs radiotelephones aboard ships. Augmented C/C++ design-flow alternatives As usual, one might go various ways here. As we previously discussed, in the case of a design that begins life as a suite of algorithms, it is very common to start off by creating a C or C++ representation. Following verification, this C/C++ model can be incrementally modified by adding statements for clocks, pins, concurrency, synchronization, and resource sharing so as to make the model suitable for the appropriate synthesis utility. Alternatively, the design might be captured using the augmented C/C++ language from the get-go. Irrespective of the actual route we might take to get there, let’s assume that we are in possession of an augmented C/C++ representation of a design that is suitable for synthesis. Once again, there are two main design-flow alternatives, which are (1) to translate the augmented C/C++ into Verilog or VHDL at the RTL level of abstraction automatically and to then use conventional RTL synthesis technology, or (2) to use an appropriate augmented C/C++ synthesis engine. And, once again, one school of thought says that synthesizing the augmented C/C++ directly into the implementation- level netlist offers the cleanest, fastest, and most efficient route. Others say that the RTL Verilog/VHDL level is the natural staging post for design integration and that today’s RTL synthesis technology is extremely mature and powerful. Both of these flows can be applied to ASIC or FPGA targets (Figure 11-6). The first augmented C/C++ synthesis applications were predominantly geared toward ASIC flows. This meant that these early incarnations didn’t do a tremendous job when it came to inferring FPGA-specific entities such as embedded RAMs, embedded multipliers, and so forth. More recent versions of these tools do a much better job at this, but, as usual, the prospective user is strongly advised to perform some in-depth evaluations before handing over any hard-earned cash. Note that figure 11-6 shows the use of implementationspecific code to drive the ASIC versus FPGA flows because C/C++ etc.–Based Design Flows ASIC target Auto-RTL Translation Verilog / VHDL RTL RTL Synthesis Augmented C/C++ Gate-level netlist Augmented C/C++ Synthesis Implementationspecific code Auto-RTL Translation Verilog / VHDL RTL RTL Synthesis Augmented C/C++ FPGA target LUT/CLBlevel netlist Augmented C/C++ Synthesis Figure 11-6. Alternative augmented C/C++ flows. achieving an optimal implementation requires that the code be written with a specific target architecture in mind. In reality, the same code can be used to drive both ASIC and FPGA flows, but there is usually a penalty to be paid (see the discussions on SystemC for more details). Pure C/C++-based flows Last, but not least, we come to pure C/C++-based flows.4 In reality, the term pure C/C++ actually refers to industrystandard C/C++ that is minimally augmented with SystemC data types to allow specific bit widths to be associated with variables and constants. Although relatively new, pure C/C++-based flows offer a number of advantages as compared to other C-based flows and traditional Verilog-/VHDL-based flows: ■ 4 Creating pure C/C++ is fast and efficient: Pure untimed C/C++ representations are more compact and easier to At the time of this writing, perhaps the best example of a pure C/C++ based flow is provided by Precision C Synthesis from Mentor (www.mentor.com). Also of interest is the SPARK C-to-VHDL synthesis tool developed at the Center for Embedded Computer Systems, University of California, San Diego and Irvine (www.cecs.uci.edu/~spark). ■ 209 1904: England. John Ambrose Fleming invents the vacuum tube diode rectifier. 210 ■ The Design Warrior's Guide to FPGAs 1904: First practical photoelectric cell is developed. ■ ■ ■ create and understand than equivalent SystemC and augmented C/C++ representations (and they are much more compact than their RTL equivalents, requiring perhaps 1/10th to 1/100th of the code). Verifying C/C++ is fast and efficient: A pure untimed C/C++ representation will simulate significantly faster than a timed SystemC or augmented C/C++ model and 100 to 10,000 times faster than an equivalent RTL representation. In fact, pure C/C++ models are already widely created and used by system designers for algorithm and system validation. Evaluating alternative implementations is fast and efficient: Modifying and reverifying pure untimed C/C++ to perform a series of what-if evaluations of alternative microarchitecture implementations is fast and efficient. This facilitates the design team’s ability to arrive at fundamentally superior microarchitecture solutions. In turn, this can result in significantly smaller and faster designs as compared to flows based on traditional hand-coded RTL methods. Accommodating specification changes is relatively easy: If any changes to the specification are made during the course of the project, it’s relatively easy to implement and evaluate these changes in a pure untimed C/C++ representation, thereby allowing the changes to be folded into the resulting implementation. Furthermore, as noted earlier in this chapter, one of the most significant problems associated with existing SystemC and augmented C/C++-based design flows is that the implementation intelligence associated with the design has to be hard-coded into the model, which therefore becomes implementation specific. A key aspect associated with a pure untimed C/C++-based design flow is that the code presented to the synthesis engine is just what someone would write if he or she didn’t have any C/C++ etc.–Based Design Flows preconceived hardware implementation or target device architecture in mind. This means that the C/C++ code that system designers write today is an ideal input to this form of synthesis. The only modification typically required to use a pure C/C++ model with the synthesis engine is to add a single special comment to the source code to indicate the top of the functional portion of the design (anything conceptually above this point is considered to form part of the testbench). As opposed to adding intelligence to the source code (thereby locking it into a target implementation), all of the intelligence is provided by the user controlling and guiding the synthesis engine itself (Figure 11-7). Verilog / VHDL RTL User interaction and guidence RTL Synthesis Gate-level netlist ASIC target Pure C/C++ Pure C/C++ Synthesis Auto-generated, implementation-specific FPGA target - Non-implementation-specific - Easy to create - Fast to simulate - Easy to modify Verilog / VHDL RTL RTL Synthesis LUT/CLBlevel netlist Figure 11-7. A pure untimed C/C++-based design flow. Once the synthesis engine has parsed the source code, the user can use it to perform microarchitecture trade-offs and evaluate their effects in terms of size and speed. The synthesis engine analyzes the code, identifies its various constructs and operators, along with their associated data and memory dependencies, and automatically provides for parallelism wherever possible. The engine also provides a graphical interface that allows the user to specify how different elements should be handled. For example, the interface allows the user to associate ports with registers or RAM blocks; it identifies constructs like loops and allows the user to specify on an ■ 211 1904: First ultraviolet lamps are introduced. 212 ■ The Design Warrior's Guide to FPGAs 1904: Telephone answering machine is invented. individual basis whether they should be fully unraveled, partially unraveled, or left alone; it allows the user to specify whether or not loops and other constructs should be pipelined; it allows the user to perform resource sharing on specific entities; and so forth. These evaluations are performed on the fly, and the synthesis engine reports total size/area and latency in terms of clock cycles and I/O delays (or throughput time/cycles in the case of pipelined designs). The user-defined configuration associated with each what-if scenario can be named, saved, and reused as required (it would be almost impossible to perform these trade-offs in a timely manner using a conventional hand-coded RTL-based flow). The fact that the pure untimed C/C++ source code used by the synthesis engine is not required to contain any implementation intelligence and that all such intelligence is supplied by controlling the engine itself means that the same source code can be easily retargeted to alternative microarchitectures and different implementation technologies. Once the user’s evaluations are completed, clicking the “Go” button causes the synthesis engine to generate corresponding RTL VHDL. This code can subsequently be used by conventional logic synthesis or physically aware synthesis applications to generate the netlist used to drive the downstream implementation (place-and-route, etc.) tools. As usual, it would be possible to synthesize the pure untimed C/C++ directly into a gate-level netlist (this alternative is not shown in figure 11-7). However, generating the intermediate RTL provides a comfort zone for the engineers by allowing them to check that they are satisfied with the implementation decisions that have been made during the course of the C/C++ to RTL translation. Furthermore, generating intermediate RTL is useful because this is the level of abstraction where hardware design engineers generally stitch together the various functional blocks forming their designs. Large portions of today’s designs C/C++ etc.–Based Design Flows ■ 213 are typically presented in the form of IP blocks represented in RTL. This means that the intermediate RTL step shown in figure 11-7 is a useful point in the design flow for integrating and verifying the entire hardware system. The design engineers can then take full advantage of their existing RTL synthesis technology, which is mature, robust, and well understood. Different levels of synthesis abstraction The fundamental difference between the various C/C++-based flows presented in this chapter is the level of synthesis abstraction each can support. For example, although SystemC offers significant system-level, algorithmic, and transaction-level modeling capabilities, its synthesizable subset is at a relatively low level of abstraction. Similarly, although augmented C/C++ representations are closer to pure C/C++ than are their SystemC counterparts, which means that they simulate much more quickly, their synthesizable subset remains significantly lower than would be ideal. This lack of synthesis abstraction causes the timed SystemC and augmented C/C++ representations to be implementation specific. In turn, this makes them difficult to create and modify and significantly reduces their flexibility with regard to performing what-if evaluations and retargeting them toward alternative implementation technologies (Figure 11-8). By comparison, the latest generation of pure untimed C/C++ synthesis technology supports a high level of synthesis abstraction. Non-implementation-specific C/C++ models are very compact and can be quickly and easily created and modified. By means of the synthesis engine itself, the user can quickly and easily perform what-if evaluations and retarget the design toward alternative implementation technologies. The end result is that a pure C/C++-based design flow can dramatically speed implementation and increase design flexibility as compared to other C/C++-based flows. Before anyone starts to pen irate letters claiming the author is antiSystemC, it should be reiterated that the discussions presented here are focused on the use of the various flavors of C/C++ in the context of FPGA implementation flows. In this case, the tool-chain used to progress SystemC representations through to actual implementations is relatively immature and unsophisticated. When it comes to system-level modeling and verification applications, however, SystemC can be extremely efficacious (many observers see SystemC and SystemVerilog being used in conjunction with each other, with SystemC being employed for the initial system-level design representation, and then SystemVerilog being used to “flesh out” the implementation-level details). More abstract, less implementationspecific Untimed C Domain (Non-implementation-specific) Timed C Domain (Implementation-specific) RTL Domain (Implementation-specific) Augmented C/C++ Similarly, if one is coming from a software background and is working on embedded software applications and hardware/software co-design and coverification, then SystemC is considered by many to be “the bees knees” as it were. Pure C/C++ The Design Warrior's Guide to FPGAs SystemC ■ Verilog and VHDL 214 Less abstract, more implementationspecific Figure 11-8. Different levels of C/C++ synthesis abstraction. One point that we haven’t really considered is that, when you create a representation of your design in one of the flavors of C/C++ discussed here, you often create a testbench in the same language. Such a testbench typically employs language constructs that aren’t understood by any of the downstream tools like C/C++ to RTL generators. So in the past, you typically had to handtranslate the testbench from your C/C++ representation into an RTL equivalent for use with your VHDL/Verilog simulator. Mixed-language design and verification environments Last, but not least, we should note that a number of EDA companies can provide mixed-level design and verification environments that can support the cosimulation of models specified at multiple levels of abstraction. In some cases, this may simply involve linking a C/C++ model to a Verilog simulator via its programming language interface (PLI) or to a VHDL simulator via its foreign language interface (FLI). Alternatively, one might find a SystemC environment with the ability to accept blocks represented in Verilog or VHDL. And then there are very sophisticated environments that start off with a graphical block-based editor showing the design’s major functional units, where the contents of each block can be represented using the following: ■ ■ ■ ■ VHDL Verilog SystemVerilog SystemC C/C++ etc.–Based Design Flows ■ ■ Handel-C Pure C/C++ The top-level design might be in a traditional HDL that calls submodules in the various HDLs and in one or more flavors of C/C++. Alternatively, the top-level design might be in one of the flavors of C/C++ that calls submodules in the various languages. In this type of environment, the VHDL, Verilog, and SystemVerilog representations are usually handled by a single-kernel simulation engine. This engine is then cosimulated with appropriate engines for the various flavors of C/C++. Furthermore, this type of environment will incorporate source-code debuggers that support the various flavors of C/C++; it will allow testbenches to be created using any of the languages; and supporting tools like graphical waveform displays will be capable of displaying signals and variables associated with any of the language blocks.5 In reality, the various mixed-language design and verification environment solution combinations and permutations change on an almost weekly basis, so you need to take a good look at what’s out there before you leap into the fray. 5 A good example of a mixed-language simulation and verification environment of this type that is focused on FPGA—and, to a lesser extent, ASIC—designs is offered by Aldec Inc. (www.aldec.com). Another good example is ModelSim® from Mentor; this includes native SystemC support, thereby allowing single-kernel simulation between VHDL, Verilog, and SystemC. ■ 215 One advantage of a mixed-language design and verification environment is that you can continue to use your original C/C++ testbench to drive the downstream version of your design in VHDL/Verilog at the RTL and gate levels of abstraction. (You may need to “tweak” a few things, but that’s a lot better than rewriting everything from the ground up.) Chapter 12 DSP-Based Design Flows Introducing DSP The term digital signal processing, or DSP, refers to the branch of electronics concerned with the representation and manipulation of signals in digital form. This form of processing includes compression, decompression, modulation, error correction, filtering, and otherwise manipulating audio (voice, music, etc.), video, image, and silimar data for such applications as telecommunications, radar, and image processing (including medical imaging). In many cases, the data to be processed starts out as a signal in the real (analog) world. This analog signal is periodically sampled, with each sample being converted into a digital equivalent by means of an analog-to-digital (A/D) converter (Figure 12-1). A/D Analog input signal Analog domain DSP Digital input samples D/A Modified output samples Digital domain Analog output signal DSP is pronounced by spelling it out as “D-S-P.” Analog is spelled “analogue” in England (and it is also pronounced with a really cool accent over there). Analog-to-digital (A/D) converters may also be referred to as ADCs. Analog domain Figure 12-1. What is DSP? These samples are then processed in the digital domain. In many cases, the processed digital samples are subsequently Digital-to-analog (D/A) converters may also be referred to as DACs. 218 ■ The Design Warrior's Guide to FPGAs The term CODEC is often bandied around by folks working in the DSP arena. This sometimes stands for COmpressor/ DECompressor; that is, something that compresses and decompresses data. In telecommunications, however, it typically stands for COder/ DECoder; that is, something that encodes and decodes a signal. CODECs can be implemented in software, hardware, or as a mixture of both. converted into an analog equivalent by means of a digital-toanalog (D/A) converter. DSP occurs all over the place—in cell phones and telephone systems; CD, DVD, and MP3 players; cable desktop boxes; wireless and medical equipment; electronic vision systems; … the list goes on. This means that the overall DSP market is huge; in fact, some estimates put it at $10 billion in 2003! Alternative DSP implementations Pick a device, any device, but don’t let me see which one As usual, nothing is simple because DSP tasks can be implemented in a number of different ways: ■ ■ ■ ■ A general-purpose microprocessor (µP): This may also be referred to as a central processing unit (CPU) or a microprocessor unit (MPU). The processor can perform DSP by running an appropriate DSP algorithm. A digital signal processor (DSP): This is a special form of microprocessor chip (or core, as discussed below) that has been designed to perform DSP tasks much faster and more efficiently than can be achieved by means of a general-purpose microprocessor. Dedicated ASIC hardware: For the purposes of these discussions, we will assume that this refers to a custom hardware implementation that executes the DSP task. However, we should also note that the DSP task could be implemented in software by including a microprocessor or DSP core on the ASIC. Dedicated FPGA hardware: For the purposes of these discussions, we will assume that this refers to a custom hardware implementation that executes the DSP task. Once again, however, we should also note that the DSP functionality could be implemented in software by means of an embedded microprocessor DSP-Based Design Flows core on the FPGA (at the time of this writing, dedicated DSP hard cores do not exist for FPGAs). System-level evaluation and algorithmic verification Irrespective of the final implementation technology (µP, DSP, ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment (we consider this in more detail later in this chapter). Although this book attempts to avoid focusing on companies and products as far as possible, it would be rather coy of us not to mention that—at the time of this writing—the de facto industry standard for DSP algorithmic verification is MATLAB®1 from The MathWorks (www.mathworks.com).2 For the purposes of these discussions, therefore, we shall refer to MATLAB as necessary. However, it should be noted that there are a number of other very powerful tools and environments available to DSP developers. For example, Simulink® from The MathWorks has a certain following; the Signal Processing Worksystem (SPW) environment from CoWare3 (www.coware.com) is very popular, especially in telecom markets; and tools from Elanix (www.elanix.com) also find favor with many designers. Software running on a DSP core Let’s assume that our new DSP algorithm is to be implemented using a microprocessor or DSP chip (or core). In this case, the flow might be as shown in Figure 12-2. 1 MATLAB and Simulink are registered trademarks of The MathWorks Inc. 2 It should be noted that MATLAB and Simulink can be used for a wide range of tasks, including control system design and analysis, image processing, financial modeling, and so forth. 3 EDA is a fast-moving beast. For example, SPW came under the auspices of Cadence when I first started penning this chapter, but it fell under the purview of CoWare by the time I was half-way through! ■ 219 1904: Telephone answering machine is invented. 220 ■ The Design Warrior's Guide to FPGAs 1906: America. First radio program of voice and music is broadcast. Auto C/C++ Generation Original Concept Algorithmic Verification Handcrafted C/C++ Compile / Assemble Machine Code Handcrafted Assembly Figure 12-2. A simple design flow for a software DSP realization. The process commences with someone having an idea for a new algorithm or suite of algorithms. This new concept typically undergoes verification using tools such as MATLAB as discussed above. In some cases, one might leap directly from the concept into handcrafting C/C++ (or assembly language). Once the algorithms have been verified, they have to be regenerated in C/C++ or in assembly language. MATLAB can be used to generate C/C++ tuned for the target DSP core automatically, but in some cases, design teams may prefer to perform this translation step by hand because they feel that they can achieve a more optimal representation this way. As yet another alternative, one might first auto-generate C/C++ code from the algorithmic verification environment, analyze and profile this code to determine any performance bottlenecks, and then recode the most critical portions by hand. (This is a good example of the old 80:20 rule, in which you spend 80 percent of your time working on the most critical 20 percent of the design.) Once you have your C/C++ (or assembly language) representation, you compile it (or assemble it) into the machine code that will ultimately be executed by the microprocessor or DSP core. This type of implementation is very flexible because any desired changes can be addressed relatively quickly and easily by simply modifying and recompiling the source code. However, this also results in the slowest performance for the DSP algorithm because microprocessor and DSP chips are both DSP-Based Design Flows classed as Turing machines. This means that their primary role in life is to process instructions, so both of these devices operate as follows: ■ ■ ■ ■ ■ ■ ■ Fetch an instruction. Decode the instruction. Fetch a piece of data. Perform an operation on the data. Store the result somewhere. : Fetch another instruction and start all over again. Of course, the DSP algorithm actually runs on hardware in the form of the microprocessor or DSP, but we consider this to be a software implementation because the actual (physical) manifestation of the algorithm is the program that is executed on the chip. ■ 221 In 1937, while still a graduate student, the eccentric English genius Alan Turing wrote his ground-breaking paper “On Computable Numbers with an Application to the Entscheidungsproblem.” Since Turing did not have access to a real computer (not unreasonably as they did not exist at the time), he invented his own as an abstract “paper exercise.” This theoretical model, which became known as a Turing machine, subsequently inspired many “thought experiments.” Dedicated DSP hardware There are myriad ways in which one might implement a DSP algorithm in an ASIC or FPGA—the latter option being the focus of this chapter, of course. But before we hurl ourselves into the mire, let’s first consider how different architectures can affect the speed and area (in terms of silicon real estate) of the implementation. DSP algorithms typically require huge numbers of multiplications and additions. As a really simple example, let’s assume that we have a new DSP algorithm that contains an expression something like the following: Y = (A * B) + (C * D) + (E * F) + (G * H); As usual, this is a generic syntax that does not favor any particular HDL and is used only for the purposes of these discussions. Of course, this would be a minuscule element in a horrendously complex algorithm. But, at the end of the day, DSP algorithms tend to contain a lot of this type of thing. For the nontechnical reader, each of the variable names (A, B, C, etc.) in this equation is assumed to represent a bus (group) of binary signals. Also, when you multiply two binary values of the same width together, the result is twice the width (so if A and B are each 16 bits wide, the result of multiplying them together will be 32 bits wide). 222 ■ The Design Warrior's Guide to FPGAs 1906: Dunwoody and Pickard build a crystal-and-catwhisker-radio. The point is that we can exploit the parallelism inherent in hardware to perform DSP functions much more quickly than can be achieved by means of software running on a DSP core. For example, suppose that all of the multiplications were performed in parallel (simultaneously) followed by two stages of additions (Figure 12-3). A x B + C x Speed = D + E x F Y Area = + G x H Figure 12-3. A parallel implementation of the function. Remembering that multipliers are relatively large and complex and that adders are sort of large, this implementation will be very fast, but will consume a correspondingly large amount of chip resources. As an alternative, we might employ resource sharing (sharing some of the multipliers and adders between multiple operations) and opt for a solution that is a mixture of parallel and serial (Figure 12-4). This solution requires the addition of four 2:1 multiplexers and a register (remember that each of these will be the same multibit width as their respective signal paths). However, multiplexers and registers consume much less area than the DSP-Based Design Flows ■ 223 2:1 muxes A E x Register B + F D Q Y + C G x Speed = D H Area = sel clock Figure 12-4. An in-between implementation of the function. two multipliers and adder that are no longer required as compared to our initial solution. On the downside, this approach is slower, because we must first perform the (A * B) and (C * D) multiplications, add the results together, add this total to the existing contents of the register (which will have been initialized to contain zero), and store the result in the register. Next, we must perform the (E * F) and (G * H) multiplications, add these results together, add this total to the existing contents of the register (which currently contains the results from the first set of multiplications and additions), and store this result in the register. As yet another alternative, we might decide to use a fully serial solution (Figure 12-5). This latter implementation is very efficient in terms of area because it requires only a single multiplier and a single adder. This is the slowest implementation, however, because we must first perform the (A * B) multiplication, add the result to the existing contents of the register (which will have been initial- The process of trading off different datapath and control implementations is commonly known as microarchitecture exploration (see also chapter 11 for more discussions on this point). 224 ■ The Design Warrior's Guide to FPGAs 1906: First tungsten-filament lamps are introduced. 4:1 muxes A C E G Register + D Q Y x B D F H Speed = sel Area = clock Figure 12-5. A serial implementation of the function. ized to contain zero), and store the total in the register. Next, we must perform the (C * D) multiplication, add this result to the existing contents of the register, and store this new total in the register. And so forth for the remaining multiplication operations. (Note that when we say “this is the slowest implementation,” we are referring to these hardware solutions, but even the slowest hardware implementation remains much, much faster than a software equivalent running on a microprocessor or DSP.) DSP-related embedded FPGA resources As previously discussed in chapter 4, some functions like multipliers are inherently slow if they are implemented by connecting a large number of programmable logic blocks together inside an FPGA. Because these functions are required by a lot of applications, many FPGAs incorporate special hard-wired multiplier blocks. (These are typically located in close proximity to embedded RAM blocks because these functions are often used in conjunction with each other.) Similarly, some FPGAs offer dedicated adder blocks. One operation that is very common in DSP-type applications is called a multiply-and-accumulate. As its name would suggest, DSP-Based Design Flows this function multiplies two numbers together and adds the result into a running total stored in an accumulator (register). Hence, it is commonly referred to as a MAC, which stands for multiply, add, and accumulate (Figure 12-6). Multiplier Adder Accumulator A[n:0] xx B[n:0] ++ Y[(2n - 1):0] MAC Figure 12-6. The functions forming a MAC. Note that the multiplier, adder, and register portions of the serial implementation of our function shown in figure 12-5 offer a classic example of a MAC. If the FPGA you are working with supplies only embedded multipliers, you would be obliged to implement this function by combining the multiplier with an adder formed from a number of programmable logic blocks, while the result would be stored in a block RAM or in a number of distributed RAMs. Life becomes a little easier if the FPGA also provides embedded adders, and some FPGAs provide entire MACs as embedded functions. FPGA-centric design flows for DSPs Arrgggh! I’m quivering with fear (but let’s call it anticipation) as I’m poised to pen these words. This is because, at the time of this writing, the idea of using FPGAs to perform DSP is still relatively new. Thus, there really are no definitive design flows or methodologies here—everyone seems to have ■ 225 1907: America. Lee de Forest creates a three-element vacuum tube amplifier (the triode). 226 ■ The Design Warrior's Guide to FPGAs his or her unique way of doing things, and whichever option you choose, you’ll almost certainly end up breaking new ground one way or another. Domain-specific languages DSL is pronounced by spelling it out as “D-S-L.” FFT is pronounced by spelling it out as “F-F-T.” The input stimulus to a MATLAB simulation might come from one or more mathematical functions such as a sine-wave generator, or it might be provided in the form of real-world data (for example, an audio or video file). The way of the world is that electronic designs increase in size and complexity over time. In order to manage this problem while maintaining—or, more usually, increasing— productivity, it is necessary to keep raising the level of abstraction used to capture the design’s functionality and verify its intent. For this reason the gate-level schematics discussed in chapter 8 were superceded by the RTL representations in VHDL and Verilog discussed in chapter 9. Similarly, the drive toward C-based flows as discussed in chapter 11 is powered by the desire to capture complex concepts quickly and easily while facilitating architectural analysis and exploration. In the case of specialist areas such as DSPs, system architects and design engineers can achieve a dramatic improvement in productivity by means of domain-specific languages (DSLs), which provide more concise ways of representing specific tasks than do general-purpose languages such as C/C++ and SystemC. One such language is MATLAB, which allows DSP designers to represent a signal transformation, such as an FFT, that can potentially take up an entire FPGA, using a single line of code4 along the lines of y = fft(x); Actually, the term MATLAB refers both to a language and an algorithmic-level simulation environment. In order to avoid confusion, it is common to talk about M-code (meaning “MATLAB code”) and M-files (files containing MATLAB code). Some engineers in the trenches occasionally refer to 4 Note that the semicolon shown in this example MATLAB statement is optional. If present, it serves to suppress the output display. DSP-Based Design Flows the “M language,” but this is not argot favored by the folks at The MathWorks. In addition to sophisticated transformation operators like the FFT shown above, there are also much simpler transformations like adders, subtractors, multipliers, logical operators, matrix arithmetic, and so forth. The more complex transformations like an FFT can be formed from these fundamental entities if required. The output from each transformation can be used as the input to one or more downstream transformations, and so forth, until the entire system has been represented at this high level of abstraction. One important point is that such a system-level representation does not initially imply a hardware or software implementation. In the case of DSP core, for example, it could be that the entire function is implemented in software as discussed earlier in this chapter. Alternatively, the system architects could partition the design such that some functions are implemented in software, while other performance-critical tasks are implemented in hardware using dedicated ASIC or FPGA fabric. In this case, one typically needs to have access to a hardware or software codesign environment (see also chapter 13). For the purposes of these discussions, however, we shall assume pure hardware implementations. System-level design and simulation environments System-level design and simulation environments are conceptually at a higher level than DSLs. One well-known example of this genre is Simulink from The MathWorks. Depending on whom one is talking to, there may be a perception that Simulink is simply a graphical user interface to MATLAB. In reality, however, it is an independent dynamic modeling application that works with MATLAB. If you are using Simulink, you typically commence the design process by creating a graphical block diagram of your system showing a schematic of functional blocks and the connections between them. Each of these blocks may be user- ■ 227 M-files can contain scripts (actions to be performed) or transformations or a mixture of both. Also M-files can call other M-files in a hierarchical manner. The primary (top-level) M-file typically contains a script that defines the simulation run. This script might prompt the user for information like the values of filter coefficients that are to be used, the name of an input stimulus file, and so forth, and then call other M-files and pass them these user-defined values as required. 228 ■ The Design Warrior's Guide to FPGAs First developed in 1962, FORTRAN (whose name was derived from its original use: formula translation) was one of the earliest high-level programming languages. defined, or they may originate in one of the libraries supplied with Simulink (these include DSP, communications, and control function block sets). In the case of a user-defined block, you can “push” into that block and represent its contents as a new graphical block diagram. You can also create blocks containing MATLAB functions, M-code, C/C++, FORTRAN … the list goes on. Once you’ve captured the design’s intent, you use Simulink to simulate and verify its functionality. As with MATLAB, the input stimulus to a Simulink simulation might come from one or more mathematical functions, such as sine-wave generators, or it might be provided in the form of real-world data such as audio or video files. In many cases, it comes as a mixture of both; for example, real-world data might be augmented with pseudorandom noise supplied by a Simulink block. The point here is that there’s no hard-and-fast rule. Some DSP designers prefer to use MATLAB as their starting point, while others opt for Simulink (this latter case is much rarer in the scheme of things). Some folks say that this preference depends on the user’s background (software DSP development versus ASIC/FPGA designs), but others say that this is a load of tosh. And it really doesn’t matter, because, if the truth is told, the reasons behind who does what in this regard pale into insignificance compared to the horrors that are to come. Floating-point versus fixed-point representations Irrespective as to whether one opts for Simulink or MATLAB (or a similar environment from another vendor) as a starting point, the first-pass model of the system is almost invariably described using floating-point representations. In the context of the decimal number system, this refers to numbers like 1.235 × 103 (that is, a fractional number raised to some power of 10). In the context of applications like MATLAB, equivalent binary values are represented inside the computer using the IEEE standard for double-precision floating-point numbers. DSP-Based Design Flows Floating-point numbers of this type have the advantage of providing extremely accurate values across a tremendous dynamic range. However, implementing floating-point calculations of this type in dedicated FPGA or ASIC hardware requires a humongous amount of silicon resources, and the result is painfully slow (in hardware terms). Thus, at some stage, the design will be migrated over to use fixed-point representations, which refers to numbers having a fixed number of bits to represent their integer and fractional portions. This process is commonly referred to as quantization. This is totally system/algorithm dependent, and it may take some considerable amount of experimentation to determine the optimum balance between using the fewest number of bits to represent a set of values (thereby decreasing the amount of silicon resources required and speeding the calculations), while maintaining sufficient accuracy to perform the task in hand. (One can think of this trade-off in terms of how much noise the designer is willing to accept for a given number of bits.) In some cases, designers may spend days deciding “should we use 14, 15, or 16 bits to represent these particular values?” And, just to increase the fun, it may be best to vary the number of bits used to represent values at different locations in the system/algorithm. Things start to get really fun in that the conversion from floating-point to fixed-point representations may take place upstream in the system/algorithmic design and verification environment, or downstream in the C/C++ code. This is shown in more detail in the “System/algorithmic level to C/C++” section below. Suffice it to say that if one is working in a MATLAB environment, these conversions can be performed by passing the floating-point signals through special transformation functions called quantizers. Alternatively, if one is working in a Simulink environment, the conversions can be performed by running the floating-point signals through special fixed-point blocks. ■ 229 1907: Lee de Forest begins regular radio music broadcasts. 230 ■ The Design Warrior's Guide to FPGAs 1908: Charles Fredrick Cross invents cellophane. System/algorithmic level to RTL (manual translation) At the time of this writing, many DSP design teams commence by performing their system-level evaluations and algorithmic validation in MATLAB (or the equivalent) using floating-point representations. (It is also very common to include an intermediate step in which a fixed-point C/C++ model is created for use in rapid simulation/validation.) At this point, many design teams bounce directly into handcoding fixed-point RTL equivalents of the design in VHDL or Verilog (figure 12-7a). Alternatively, they may first transition the floating-point representations into their fixed-point counterparts at the system/algorithmic level, and then hand-code the RTL in VHDL or Verilog (Figure 12-7b). Original Concept System/Algorithmic Verification (Floating-point) System/Algorithmic Verification (Fixed-point) (a) (b) Handcraft Verilog/VHDL RTL (Fixed-point) To standard RTL-based simulation and synthesis Figure 12-7. Manual RTL generation. There are, of course, a number of problems with this flow, not the least being that there is a significant conceptual and representational divide between the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representations in VHDL or Verilog. DSP-Based Design Flows Because the system/algorithmic and RTL domains are so different, manual translation from one to the other is timeconsuming and prone to error. There is also the fact that the resulting RTL is implementation specific because realizing the optimal design in an FPGA requires a different RTL coding style from that used for an optimal ASIC implementation. Another consideration is that manually modifying and reverifying RTL to perform a series of what-if evaluations of alternative microarchitecture implementations is extremely time-consuming (such evaluations may include performing certain operations in parallel versus sequential, pipelining portions of the design versus nonpipelining, sharing common resources—for example, two operations sharing a single multiplier—versus using dedicated resources, etc.) Similarly, if any changes are made to the original specification during the course of the project, it’s relatively easy to implement and evaluate these changes in the system-/ algorithmic-level representations, but subsequently folding these changes into the RTL by hand can be painful and timeconsuming. Of course, once an RTL representation of the design has been created, we can assume the use of the downstream logicsynthesis-based flows that were introduced in chapter 9. System/algorithmic level to RTL (automatic-generation) As was noted in the previous section, performing system-/ algorithmic-level-to -RTL translation manually is timeconsuming and prone to error. There are alternatives, however, because some system-/algorithmic-level design environments offer direct VHDL or Verilog RTL code generation (Figure 12-8). As usual, the system-/algorithmic-level design would commence by using floating-point representations. In one version of the flow, the system/algorithmic environment is used to migrate these representations into their fixed-point counter- ■ 231 1909: General Electric introduces the world’s first electrical toaster. 232 ■ The Design Warrior's Guide to FPGAs 1909: Leo Baekeland patterns an artificial plastic that he calls Bakelite. Original Concept (a) (b) System/Algorithmic Environment System/Algorithmic Environment System/Algorithmic Verification (Floating-point) System/Algorithmic Verification (Floating-point) Third-party Environment System/Algorithmic Verification (Fixed-point) Auto-interactive quantization (Fixed-point) Auto-generate Verilog/VHDL RTL (Fixed-point) Auto-generate Verilog/VHDL RTL (Fixed-point) (a) (b) To standard RTL-based simulation and synthesis Figure 12-8. Direct RTL generation. parts and then to generate the equivalent RTL in VHDL or Verilog automatically (Figure 12-8a)5. Alternatively, a third-party environment might be used to take the floating-point system-/algorithmic-level representation, autointeractively quantize it into its fixed-point counterpart, and then automatically generate the equivalent RTL in VHDL or Verilog (figure 12-8b)6. As before, once an RTL representation of the design has been created, we can assume the use of the downstream logicsynthesis-based flows that were introduced in chapter 9. 5 A good example of this type of environment is offered by Elanix Inc. (www.elanix.com). 6 An example of this type of environment is offered by AccellChip Inc. (www.accelchip.com), whose environment can accept floating-point MATLAB M-files, output their fixed-point equivalents for verification, and then use these new M-files to auto-generate RTL. DSP-Based Design Flows ■ 233 System/algorithmic level to C/C++ etc. Due to the problems associated with exploring the design at the RTL level, there is an increasing trend to use a stepping-stone approach. This involves transitioning from the system-/algorithmic-level domain into to some sort of C/C++ representation, which itself is subsequently migrated into an RTL equivalent. One reason this is attractive is that the majority of DSP design teams already generate a C/C++ model for use as a golden (reference) model, in which case this sort of comes for free as far as the downstream RTL design engineer is concerned. Of course, the first thing to decide is when and where in the flow one should transition from floating-point to fixedpoint representations (Figure 12-9). Original Concept Simulink/MATLAB (or equivalent) System/Algorithmic Verification (Floating-point) Handcraft C/C++ (Floating-point) System/Algorithmic Verification (Fixed-point) Auto-generate C/C++ (Floating-point) Hand-convert C/C++ (Fixed-point) Handcraft C/C++ (Fixed-point) Auto-generate C/C++ (Fixed-point) Direct to pure C/C++ synthesis, or hand-convert to Handel-C then Handel-C synthesis, or hand-convert to SystemC then SystemC synthesis, or ... Figure 12-9. Migrating from floating point to fixed point. Frighteningly enough, Figure 12-9 shows only a subset of the various potential flows. For example, in the case of the handcrafted options, as opposed to first hand-coding the It is somewhat difficult to qualify the relative effort associated with alternative paths through these flows. As a rule of thumb, one might make the following points: a) Manual MATLAB to C/C++ translation is relatively easy, being in the order of hours to days (automatic translation is typically used only for simulation or DSP code generation depending on how critical the performance is). b) Manual exploration of quantization effects is relatively easy, especially for experienced system designers (autointeractive quantization is used less frequently). Also, many designers rely on noise analysis to guide them in this process. c) Manual MATLAB or C/C++ to RTL translation is relatively hard, being in the order of weeks to months. Automation in this area provides a lot of value assuming it is possible to achieve sufficient quality of results. 234 ■ The Design Warrior's Guide to FPGAs d) MATLAB/Simulinkbased automated flows that rely on IP core generation are typically not well suited to designs that include substantal original content. C/C++ and then gradually transmogrifying this representation into Handel-C or SystemC, one could hand-code directly into these languages. However, the main thing to remember is that once we have a fixed-point representation in one of the flavors of C/C++, we can assume the use of the downstream C/C++ flows introduced in chapter 11 (one flow of particular interest in this area is the pure untimed C/C++ approach used by Precision C from Mentor). Block-level IP environments Nothing is simple in this world because there is always just one more way to do things. As an example, one might create a library of DSP functional blocks at the system/ algorithmic level of abstraction along with a one-to-one equivalent library of blocks at the RTL level of abstraction in VHDL or Verilog. The idea here is that you could then capture and verify your design using a hierarchy of functional blocks specified at the system/algorithmic level of abstraction. Once you were happy with your design, you could then generate a structural netlist instantiating the RTL-level blocks, and use this to drive downstream simulation and synthesis tools. (These blocks would have to be parameterized at all levels of abstraction so as to allow you to specify such things as bus widths and so forth.) As an alternative, the larger FPGA vendors typically offer IP core generators (in this context, the term core is considered to refer to a block that performs a specific logical function; it does not refer to a microprocessor or DSP core). In several cases, these core generators have been integrated into system-/ algorithmic-level environments. This means that you can create a design based on a collection of these blocks in the system-/algorithmic-level environment, specify any parameters associated with these blocks, and perform your system-/algorithmic-level verification. DSP-Based Design Flows Later, when you’re ready to rock and roll, the core generator will automatically generate the hardware models corresponding to each of these blocks.7 (The system-/ algorithmic-level models and the hardware models ensuing from the core generator are bit identical and cycle identical.) In some cases the hardware blocks will be generated as synthesizable RTL in VHDL or Verilog. Alternatively, they may be presented as firm cores at the LUT/CLB level of abstraction, thereby making the maximum use of the targeted FPGA’s internal resources. One big drawback associated with this approach is that, by their very nature, IP blocks are based on hard-coded microarchitectures. This means that the ability to create highly tuned implementations to address specific design goals is somewhat diminished. The end result is that IP-based flows may achieve an implementation faster with less risk, but such an implementation may be less optimal in terms of area, performance, and power as compared to a custom hardware implementation. Don’t forget the testbench! One point that the folks selling you DSP design tools often neglect to mention is the test bench. For example, let’s assume that your flow involves taking your system-/algorithmic-level design and hand-translating it into RTL. In that case, you are going to have to do the same with your testbench. In many cases, this is a nontrivial task that can take days or weeks! Or let’s say that your flow is based on taking your floatingpoint system-/algorithmic-level design and hand-translating it into floating-point C/C++, at which point you will wish to verify this new representation. Then you might take your floating-point C/C++ and hand-translate it into fixed-point C/C++, at which point you will wish to verify this representation. And then you might take your fixed-point C/C++ and 7 A good example of this type of approach is the integration of Simulink with the System Generator utility from Xilinx (www.xilinx.com). ■ 235 1909: Marconi shares Noble prize in physics for his contribution to telegraphy. 236 ■ The Design Warrior's Guide to FPGAs 1909: Radio distress signals save 1900 lives after two ships collide. (hopefully) automatically synthesize an equivalent RTL representation, at which point … but you get my drift.8 The problem is that at each stage you are going to have to do the same thing with your testbench9 (unless you do something cunning as discussed in the next (and last—hurray!) section. Mixed DSP and VHDL/Verilog etc. environments In the previous chapter, we noted that a number of EDA companies can provide mixed-level design and verification environments that can support the cosimulation of models specified at multiple levels of abstraction. For example, one might start off with a graphical block-based editor showing the design’s major functional units, where the contents of each block can be represented using ■ ■ ■ ■ ■ ■ VHDL Verilog SystemVerilog SystemC Handel-C Pure C/C++ In this case, the top-level design might be in a traditional HDL that calls submodules represented in the various HDLs and in one or more flavors of C/C++. Alternatively, the toplevel design might be in one of the flavors of C/C++ that calls submodules in the other languages. More recently, integrations between system-/algorithmiclevel and implementation-level environments have become available. The way in which this works depends on who is doing what and what that person is trying to do (sorry, I don’t 8 Don’t laugh, because I personally know of one HUGE system house that does things in just this way! 9 With regards to the C/C++ to RTL stage of the process, even if you have a C/C++ to RTL synthesis engine, your testbench will typically contain language constructs are aren’t amenable to synthesis, which means that you’re back to doing things by hand. DSP-Based Design Flows mean to be cryptic). For example, a system architect working at the system/algorithmic level (e.g., in MATLAB) might decide to replace one or more blocks with equivalent representations in VHDL or Verilog at the RTL level of abstraction. Alternatively, a design engineer working in VHDL or Verilog at the RTL level of abstraction might decide to call one or more blocks at the system/algorithmic level of abstraction. Both of these cases require cosimulation between the system-/algorithmic-level environment and the VHDL/Verilog environment, the main difference being who calls whom. Of course, this sounds easy if you say it quickly, but there is a whole host of considerations to be addressed, such as synchronizing the concept of time between the two domains and specifying how different signal types are translated as they pass from one domain to the other (and back again). This really is a case of treating any canned demonstration with a healthy amount of suspicion. If you are planning on doing this sort of thing, you need to sit down with the vendor’s engineer and work your own example through from beginning to end. Call me an old cynic if you will, but my advice is to let their engineer guide you, while keeping your hands firmly on the keyboard and mouse. (You’d be amazed how much activity can go on in just a few seconds should you turn your head in response to the age-old question, “Good grief! Did you see what just flew by the window?”) ■ 237 1910: America. First installation of teleprinters on postal lines between New York City and Boston. Chapter 13 Embedded Processor-Based Design Flows Introduction For the purposes of this book, we are concerned only with electronic systems that include one or more FPGAs on the printed circuit board (PCB). The vast majority of such systems also make use of a general-purpose microprocessor, or µP, to perform a variety of control and data-processing applications.1 This is often referred to as the central processing unit (CPU) or microprocessor unit (MPU). Until recently, the CPU and its peripherals typically appeared in the form of discrete chips on the circuit board. There are an almost infinite number of possible scenarios here, but the two main ones involve the way in which the CPU is connected to its memory (Figure 13-1). Circuit Board CPU CPU FPGA Processor bus MEM MEM Some “Stuff” “Stuff” More “Stuff” “Stuff” (a) Memory connected to CPU via general-purpose processor bus MEM MEM (TCM) CPU CPU Circuit Board Dedicated memory bus FPGA FPGA Processor bus Some “Stuff” “Stuff” More “Stuff” “Stuff” (b) Tightly-coupled memory (TCM) connected to CPU via dedicated bus Figure 13-1. Two scenarios at the circuit board level. 1 Alternatively, one might use a microcontroller (µC) device, which combines a CPU core with selected peripherals and specialized inputs and outputs. PCB is pronounced by spelling it out as “P-C-B.” CPU and MPU are pronounced by spelling them out as “C-P-U” and “M-P-U,” respectively. 240 ■ The Design Warrior's Guide to FPGAs 1910: First electric washing machines are introduced. In both of these scenarios, the CPU is connected to an FPGA and some other stuff via a general-purpose processor bus. (By “stuff” we predominantly mean peripheral devices such as counter timers, interrupt controllers, communications devices, etc.) In some cases, the main memory (MEM) will also be connected to the CPU by means of the main processor bus, as shown in figure 13-1a (actually, this connection will be via a special peripheral called a memory controller, which is not shown here because we’re trying to keep things simple). Alternatively, the memory may be connected directly to the CPU by means of a dedicated memory bus, as shown in Figure 13-1b). The point is that presenting the CPU and its various peripheral devices in the form of dedicated chips on the circuit board costs money and occupies real estate. It also impacts the reliability of the board because every solder joint (connection point) is a potential failure mechanism. One alternative is to embed the CPU along with some of its peripherals in the FPGA itself (Figure 13-2).2 Circuit Board Embedded CPU FPGA FPGA Embedded “stuff” Processor bus MEM MEM (TCM) Embedded CPU Circuit Board Dedicated memory bus Embedded “stuff” FPGA FPGA Processor bus MEM MEM More “Stuff” “Stuff” (a) Memory connected to CPU via general-purpose processor bus More “Stuff” “Stuff” (b) Tightly-coupled memory (TCM) connected to CPU via dedicated bus Figure 13-2. Two scenarios at the FPGA level. 2 Another alternative would be to embed a microprocessor core in an ASIC, but that’s a tale for another book! Embedded Processor-Based Design Flows It is common for a relatively small amount of memory used by the CPU to be included locally in the FPGA. At the time of this writing, however, it is rare for all of the CPU’s memory to be included in the FPGA. Creating an FPGA design of this type brings a whole slew of new problems to the table. First of all, the system architects have to decide which functions will be implemented in software (as instructions to be executed by the CPU) and which functions will be implemented in hardware (using the main FPGA fabric). Next, the design environment must support the concept of coverification, in which the hardware and embedded software portions of the system can be verified together to ensure that everything works as it should. Both of these topics are considered in more detail later in this chapter. Hard versus soft cores Hard cores A hard microprocessor core is one that is implemented as a dedicated, predefined (hardwired) block (these cores are only available in certain device families). Each of the main FPGA vendors has opted for a particular processor type to implement its hard cores. For example, Altera offer embedded ARM processors, QuickLogic have opted for MIPS-based solutions, and Xilinx sports PowerPC cores. Of course, each vendor will be delighted to explain at great length why its implementation is far superior to any of the others (the problem of deciding which one actually is better is only compounded by the fact that different processors may be better suited to different tasks). As noted in chapter 4, there are two main approaches for integrating such cores into the FPGA. The first is to locate it in a strip to the side of the main FPGA fabric (Figure 13-3). In this scenario, all of the components are typically formed on the same silicon chip, although they could also be formed on two chips and packaged as a multichip module (MCM). ■ 241 In addition to the microprocessor core itself, each FPGA vendor also supports an associated processor bus. For example, Altera and QuickLogic both support the AMBA bus from ARM (this is an open specification that can be downloaded from www.arm.com free of any charges). By comparison, Xilinx embedded cores make use of the CoreConnect bus from IBM. CoreConnect has two flavors. The main 64-bit bus is known as the processor local bus (PLB). This can be used in conjunction with one or more 32-bit on-chip peripheral busses (OPBs). MCM is pronounced by spelling it out as “M-C-M.” 242 ■ The Design Warrior's Guide to FPGAs Main FPGA fabric 1910: France. George Claude introduces neon lamps. The “Stripe” uP RAM I/O Microprocessor core, special RAM, peripherals and I/O, etc. etc. Figure 13-3. Bird’s-eye view of chip with embedded core outside of the main fabric. One advantage of this implementation is that the main FPGA fabric is identical for devices with and without the embedded microprocessor core, which can make things easier for the design tools used by the engineers. The other advantage is that the FPGA vendor can bundle a whole load of additional functions in the strip to complement the microprocessor core, such as memory and special peripherals.3 An alternative is to embed one or more microprocessor cores directly into the main FPGA fabric. One, two, and even four core implementations are currently available at the time of this writing (Figure 13-4). In this case, the design tools have to be able to take account of the presence of these blocks in the fabric; any memory used by the core is formed from embedded RAM blocks, and any peripheral functions are formed from groups of general-purpose programmable logic blocks. Proponents of this scheme can argue that there are inherent speed advan- 3 This approach is favored by vendors such as Altera (www.altera.com) and QuickLogic (www.quicklogic.com). Embedded Processor-Based Design Flows uP uP uP uP ■ 243 uP (a) One embedded core (b) Four embedded cores Figure 13-4. Bird’s-eye view of chips with embedded cores inside the main fabric. tages to be gained from having the microprocessor core in intimate proximity to the main FPGA fabric.4 Soft microprocessor cores As opposed to embedding a microprocessor physically into the fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor. These are typically called “soft cores,” but they may be more precisely categorized as either soft or firm, depending on the way in which the microprocessor’s functionality is mapped onto the logic blocks. For example, if the core is provided in the form of an RTL netlist that will be synthesized with the other logic, then this truly is a soft implementation. Alternatively, if the core is presented in the form of a placed and routed block of LUTs/CLBs, then this would typically be considered a firm implementation. In both of these cases, all of the peripheral devices like counter timers, interrupt controllers, memory controllers, communications functions, and so forth are also implemented as 4 This approach is favored by Xilinx (www.xilinx.com), who also provide a multitude of peripherals in the form of soft IP cores. One tool of interest in the soft core arena is LisaTek from CoWare Inc. (www.coware.com). Using a special language you define a required instruction set and microarchitecture (resources, pipelining, cycle timing) associated with a desired microprocessor. LisaTek takes this definition generates the corresponding RTL for your soft core, along with associated software tools such as a C compiler, assembler, linker, and instruction set simulator (ISS). 244 ■ The Design Warrior's Guide to FPGAs The Nios is based on a SPARC architecture using the concept of register windows, while the MicroBlaze is based on a classical RISC architecture. IDE is pronounced by spelling it out as “I-D-E.” Depending on who you are talking to and the FPGA or RTOS vendor in question, the ‘D’ in IDE can stand for “design” or “development.” QuickLogic offer a 9-bit soft microcontroller that goes under the catchy name of Q90C1xx. (Having a 9-bit data word can be useful for certain communication functions.) soft or firm cores (the FPGA vendors are typically able to supply a large library of such cores). Soft cores are slower and simpler than their hard-core counterparts (of course they are still incredibly fast in human terms). However, in addition to being practically free, they also have the advantages that you only have to implement a core if you need it and that you can instantiate as many cores as you require until you run out of resources in the form of programmable logic blocks. Once again, each of the main FPGA vendors has opted for a particular processor type to implement its soft cores. For example, Altera offers the Nios, while Xilinx sports the MicroBlaze. The Nios has both 16-bit and 32-bit architectural variants, which operate on 16-bit or 32-bit chunks of data, respectively (both variants share the same 16-bit-wide instruction set). By comparison, the MicroBlaze is a true 32-bit machine (that is, it has 32-bit-wide instruction words and performs its magic on 32-bit chunks of data). Once again, each of the vendors will be more than happy to tell you why its soft core rules and how its competitors’ offerings fail to make the grade (sorry, you’re on your own here). One cool thing about the integrated development environment (IDE) fielded by Xilinx is that it treats the PowerPC hard core and the MicroBlaze soft core identically. This includes both processors being based on the same CoreConnect processor bus and sharing common soft peripheral IP cores. All of this makes it relatively easy to migrate from one processor to the other. Also of interest is the fact that Xilinx offers a small 8-bit soft core called the PicoBlaze, which can be implemented using only 150 logic cells (give or take a handful). By comparison, the MicroBlaze requires around 1,000 logic cells5 5 For the purposes of these discussions, a logic cell can be assumed to contain a 4-input LUT, a register element, and various other bits and pieces like multiplexers and fast carry logic. Embedded Processor-Based Design Flows (which is still extremely reasonable for a 32-bit processor implementation, especially when one is playing with FPGAs that can contain 70,0006 or more such cells.) Partitioning a design into its hardware and software components As noted in chapter 4, almost any portion of an electronic design can be realized in hardware (using logic gates and registers, etc.) or software (as instructions to be executed on a microprocessor). One of the main partitioning criteria is how fast you wish the various functions to perform their tasks: ■ ■ ■ Picosecond and nanosecond logic: This has to run insanely fast, which mandates that it be implemented in hardware (in the FPGA fabric). Microsecond logic: This is reasonably fast and can be implemented either in hardware or software (this type of logic is where you spend the bulk of your time deciding which way to go). Millisecond logic: This is the logic used to implement interfaces such as reading switch positions and flashing light-emitting diodes, or LEDs. It’s a pain slowing the hardware down to implement this sort of function (using huge counters to generate delays, for example). Thus, it’s often better to implement these tasks as microprocessor code (because processors give you lousy speed—compared to dedicated hardware—but fantastic complexity). The trick is to solve every problem in the most costeffective way. Certain functions belong in hardware, others cry out for a software realization, and some functions can go either way depending on how you feel you can best use the resources 6 This 70,000 value was true when I ate my breakfast this morning, but it will doubtless have increased by the time you come to read this book. ■ 245 Some cynics say that those aspects of the design that are well understood are implemented in hardware, while any portions of the design that are somewhat undefined at the beginning of the design process are often relegated to a software realization (on the basis that the software can be “tweaked” right up until the last minute). 246 ■ The Design Warrior's Guide to FPGAs RTOS is pronounced by spelling it out as “R-T-O-S.” Real-time systems are those in which the correctness of a computation or action depends not only on how it is performed but also when it is performed. (both chip-level resources and hardware/software engineers) available to you. It is possible to envisage an “ideal” electronic system level (ESL) environment in which the system architects initially capture the design via a graphical interface as a collection of functional blocks that are connected together. Each of these blocks could then be provided with a system-/algorithmiclevel SystemC representation, for example, and the entire design could be verified prior to any decisions being made as to which portions of the design were to be implemented in hardware and software. When it comes to the partitioning process itself, we might dream of having the ability to tag each graphical block with the mouse and select a hardware or software option for its implementation. All we would then have to do would be to click the “Go” button, and the environment would take care of synthesizing the hardware, compiling the software, and pulling everything together. And then we return to the real world with a resounding thud. Actually, a number of next-generation design environments show promise, and new tools and techniques are arriving on an almost daily basis. At the time of this writing, however, it is still very common for system architects to partition a design into its hardware and software portions by hand, and to then pass these top-level functions over to the appropriate engineers and hope for the best. With regard to the software portion of the design, this might be something as simple as a state machine used to control a human-level interface (reading the state of switches and controlling display devices). Although the state machine itself may be quite tricky, this level of software is certainly not rocket science. At the other end of the spectrum, one might have incredibly complex software requirements, including ■ ■ System initialization routines and a hardware abstraction layer A hardware diagnostic test suite Embedded Processor-Based Design Flows ■ ■ ■ A real-time operating system (RTOS) RTOS device drivers Any embedded application code This code will typically be captured in C/C++ and then compiled down to the machine instructions that will be run on the processor core (in extreme cases where one is attempting to squeeze the last drop of performance out of the design, certain routines may be handcrafted in assembly code). At the same time, the hardware design engineers will typically be capturing their portions of the design at the RTL level of abstraction using VHDL or Verilog (or SystemVerilog). Today’s designs are so complex that their hardware and software portions have to be verified together. Unfortunately, wrapping one’s brain around the plethora of coverification alternatives and intricacies can make a grown man (well, me actually) break down and weep. Hardware versus software views of the world One of the biggest problems to overcome when it comes to the coverification of the hardware and software portions of a design is the two totally different worldviews of their creators. The hardware folks typically visualize their portion of the design as blocks of RTL representing such things as registers, logical functions, and the wires connecting them together. When hardware engineers are debugging their portion of the design, they think in terms of an editor showing their RTL source code, a logic simulator, and a graphical waveform display showing signals changing values at specific times. In a typical hardware design environment, clicking on a particular event in the waveform display will automatically locate the corresponding line of RTL code that caused this event to occur. By comparison, the software guys and gals think in terms of C/C++ source code, of registers in the CPU (and in the ■ 247 1911: Dutch physicist Heike Kamerlingh Onnes discovers superconductivity. 248 ■ The Design Warrior's Guide to FPGAs 1912: America. Dr Sidney Russell invents the electric blanket. peripherals), and of the contents of various memory locations. When software engineers are debugging a program, they often wish to single-step through the code one line at a time and watch the values in the various registers changing. Or they might wish to set one or more breakpoints (this refers to placing markers at specific points in the code), run the program until they hit one of those breakpoints, and then pause to see what’s going on. Alternatively, they might wish to specify certain conditions such as a register containing a particular value, then run the program until this condition is met, and once again pause to see what’s happening. When a software developer is writing application code such as a game, he or she has the luxury of being reasonably confident that the hardware (say, a home computer) is reasonably robust and bug-free. However, it’s a different ball game when one is talking about a software engineer creating embedded applications intended to run on hardware that’s being designed at the same time. When a problem occurs, it can be mega tricky determining if it was a fault in the software or if the hardware was to blame. The classic joke is a conversation between the two camps: Software Engineer: “I think I may have hit a hardware problem while running my embedded application.” Hardware Engineer: “At what time did the error occur? Can you give me a test case that isolates the problem?” Software Engineer: “The error occurred at 9:30 this morning, and the test case is my application!” In the case of today’s state-of-the-art coverification environments, the hardware and software worlds are tightly coupled. This means that if the software engineers detect a potential hardware bug, identifying the particular line of code being executed will take the hardware engineers directly to the corresponding simulation time frame in the graphical waveform display. Similarly, if the hardware engineers detect a potential software bug (such as code requesting an illegal Embedded Processor-Based Design Flows hardware transaction), they can use their interface to guide the software team to the corresponding line of source code. Unfortunately, this type of environment can cost a lot of money, so sometimes you have to opt for a less sophisticated solution. Using an FPGA as its own development environment Perhaps the simplest place to start is the scenario where the FPGA is used as its own development environment. The idea here is that you have an SRAM-based FPGA with an embedded processor (hard or soft) mounted on a development board that’s connected to your computer. In addition to the FPGA, this development board will also have a memory device that will be used to store the software programs that are to be run by the embedded CPU (figure 13-5). Figure 13-5. Using an FPGA as its own development environment. Once the system architects have determined which portions of the design are to be implemented in hardware and software, the hardware engineers start to capture their RTL blocks and functions and synthesize them down to a LUT/ CLB-level netlist. Meanwhile, the software engineers start to capture their C/C++ programs and routines and compile them down to machine code. Eventually, the LUT/CLB-level netlist will be loaded into the FPGA via a configuration file, the linked machine code image will be loaded into the memory device, and then you let the system run wild and free (Figure 13-6). ■ 249 1912: Feedback and heterodyne systems usher in modern radio reception. 250 ■ The Design Warrior's Guide to FPGAs 1912: The Titanic sends out radio distress signals when it collides with an iceberg and sinks on its maiden voyage. Figure 13-6. A (very) simple design flow. Also, any of the machine code that is to be embedded in the FPGA’s on-chip RAM blocks would actually be loaded via the configuration file. Improving visibility in the design The main problem with the scenario discussed in the previous section is lack of “visibility” as to what is happening in the hardware portion of the design. One way to mitigate this is to use a virtual logic analyzer to observe what’s happening in the hardware (this is discussed in more detail in Chapter 16). Things can be a little trickier when it comes to determining what’s happening with the software. One point to Embedded Processor-Based Design Flows remember is that—as discussed in chapter 5—an embedded CPU core will have its own dedicated JTAG boundary scan chain (Figure 13-7). JTAG data out JTAG data in FPGA Primary scan chain Internal (core) scan chain CPU Figure 13-7. Embedded processor JTAG boundary scan chain. This is true of both hard cores and the more sophisticated soft cores. In this case, the coverification environment can use the scan chain to monitor the activity on the buses and control signals connecting the CPU to the rest of the system. The CPU’s internal registers can also be accessed via the JTAG port, thereby allowing an external debugger to take control of the device and single-step through instructions, set breakpoints, and so forth. A few coverification alternatives If you really want to get visibility into what’s happening in the hardware portions of design, one approach is to use a logic simulator. In this case, the majority of the system will be modeled and simulated in VHDL or Verilog/SystemVerilog at the RTL level of abstraction. When it comes to the CPU core, however, there are various ways in which to represent this (Figure 13-8). ■ 251 1913: William D.Coolidge invents the hot-tungsten filament X-ray tube. This Coolidge Tube becomes the standard generator for medical X-rays. 252 ■ The Design Warrior's Guide to FPGAs 1914: America. Traffic lights are used for the first time (in Cleveland, Ohio) Figure 13-8. Alternative representations of the CPU. Irrespective of the type of model used to represent the CPU, the embedded software (machine code) portion of the design will be loaded into some form of memory—either embedded memory in the FPGA or external memory devices—and the CPU model will then execute those machine code instructions. Note that figure 13-8 shows a high-level representation of the contents of the FPGA only. If the machine code is to be stored in external memory devices, then these devices would also have to be part of the simulation. In fact, as a general rule of thumb, if the software talks to any stuff, then that stuff needs to be part of the coverification environment. RTL (VHDL or Verilog) Perhaps the simplest option here is when one has an RTL model of the CPU, in which case all of the activity takes place in the logic simulator. One disadvantage of this approach is that a CPU performs tremendous numbers of internal operations in order to perform the simplest task, which equates to incredibly slow simulation runs (you’ll be lucky to be able to simulate 10 to 20 system clocks per second in real time). The other disadvantage is that you have no visibility into what the software is doing at the source code level. All you’ll Embedded Processor-Based Design Flows be able to do is to observe logic values changing on wires and inside registers. And there’s always the fact that whoever supplies the real CPU doesn’t want you to know how it works internally because that supplier may be using cunning proprietary tricks and wish to preserve their IP. In this case, you may well find it very difficult to lay your hands on an RTL model of the CPU at all. C/C++, SystemC, etc. As opposed to using an RTL model, it is very common to have access to some sort of C/C++ model of the CPU. (The proponents of SystemC have a vision of a world in which the CPU and the main peripheral devices all have SystemC models provided as standard for use in this type of design environment.) The compiled version of this CPU model would be linked into the simulation via the programming language interface (PLI) in the case of a Verilog simulator or the foreign language interface (FLI)—or equivalent—in the case of a VHDL simulator. The advantages of such a model are that it will run much faster than its RTL counterpart; that it can be delivered in compiled form, thereby preserving any secret IP; and that, at least in FPGA circles, such a model is usually provided for free (the FPGA vendors are trying to sell chips, not models). One disadvantage of this approach is that the C/C++ model may not provide a 100 percent cycle-accurate representation of the CPU, which has the potential to cause problems if you aren’t careful. But, once again, the main disadvantage of such a model is that its only purpose is to provide an engine to execute the machine code program, which means that you have no visibility into what the software is doing at the source code level. All you’ll be able to do is observe logic values changing on wires and inside registers. ■ 253 Way back in the mists of time, the Logic Modeling Corporation (LMC)— which was subsequently acquired by Synopsys— defined an interface for connecting behavioral models of hardware blocks to logic simulators. This is known as the SWIFT interface, and models—such as CPUs—that comply with this specification may be referred to as SWIFT models. 254 ■ The Design Warrior's Guide to FPGAs Physical chip in hardware modeler Yet another possibility is to use a physical device to represent a hard CPU core. For example, if you are using a PowerPC core in a Xilinx FPGA, you can easily lay your hands on a real PowerPC chip. This chip can be installed in a box called a hardware modeler, which can then be linked into the logic simulation system. The advantage of this approach is that you know the physical model (chip) is going to functionally match your hard core as closely as possible. Some disadvantages are that hardware modelers aren’t cheap and they can be a pain to use. The majority of hardware-modeler-based solutions don’t support source-level debugging, which, once again, means that you have no visibility into what the software is doing at the source code level.7 All you’ll be able to do is to observe logic values changing on wires and inside registers. Instruction set simulator ISS is pronounced by spelling it out as “I_S_S.” As previously noted, in certain cases, the role of the software portion of a design may be somewhat limited. For example, the software may be acting as a state machine used to control some interface. Alternatively, the software’s role may be to initialize certain aspects of the hardware and then sit back and watch the hardware do all of the work. If this is the case, then a C/C++ model or a physical model is probably sufficient—at least as far as the hardware design engineer is concerned. At the other extreme, the hardware portions of the design may exist mainly to act as an interface with the outside world. For example, the hardware may read in a packet of data and store it in the FPGA’s memory, and then the CPU may perform huge amounts of complex processing on this data. In 7 Actually, some hardware modelers do provide a certain amount of source-level debug capability, for example, Simpod Inc. (www.simpod.com) offers an interesting solution. Embedded Processor-Based Design Flows cases like these, it is necessary for the software engineer to have sophisticated source-level debugging capabilities. This requires the use of an instruction set simulator (ISS), which provides a virtual representation of the CPU. Although an ISS will almost certainly be created in C/C++, it will be architected very differently from the C/C++ models of the CPU discussed earlier in this section. This is because the ISS is created at a very high level of abstraction; it thinks in terms of transactions like “get me a word of data from location x in the memory,” and it doesn’t concern itself with details like how signals will behave in the real world. The easiest way to explain how this works is by means of an illustration (Figure 13-9). Figure 13-9. How an ISS fits into the picture. First of all, the software engineers capture their program as C/C++ source code. This is then compiled using the -d (debug) option, which generates a symbol table and other debug-specific information along with the executable machine code image. ■ 255 1914: Better triode improves radio reception. 256 ■ The Design Warrior's Guide to FPGAs 1914: First trans-continental telephone call. When we come to perform the coverification, there are a number of pieces to the puzzle. At one end we have the source-level debugger, whose interface is used by the software engineer to talk to the environment. At the other end we have the logic simulator, which is simulating representations of the memory, stuff like peripheral devices, general-purpose logic, and so forth (for the sake of simplicity, this illustration assumes that all of the program memory resides in the FPGA itself). In the case of the CPU, however, the logic simulator essentially sees a hole where this function should be. To be more precise, the simulator actually sees a set of inputs and outputs corresponding to the CPU. These inputs and outputs are connected to an entity called a bus interface model (BIM), which acts as a translator between the simulator and the ISS. Both the source code and the executable image (along with the symbol table and other debug-centric information) are loaded into the source-level debugger. At the same time, the executable image is loaded into the MEM block. When the user requests the source-level debugger to perform an action like stepping through a line of source code, it issues commands to the ISS. In turn, the ISS will execute high-level transactions such as an instruction fetch, or a memory read/write, or an I/O command. These transactions are passed to the BIM, which causes the appropriate pins to “wiggle” in the simulation world. Similarly, when something connected to the processor bus in the FPGA attempts to talk to the CPU, it will cause the pins driving the BIM to “wriggle.” The BIM will translate these low-level actions into high-level transactions that it passes to the ISS, which will in turn inform the source-level debugger what’s happening. The source-level debugger will then display the state of the program variables, the CPU registers, and other information of this ilk. Embedded Processor-Based Design Flows There are a variety of incredibly sophisticated (often frighteningly expensive) environments of this type on the market.8 Each has its own cunning tricks and capabilities, and some are more appropriate for ASIC designs than FPGAs or vice versa. As usual, however, this is a moving target, so you need to check around to see who is doing what before putting any of your precious money on the table. A rather cunning design environment As far as possible (and insofar as makes sense), this book attempts to steer away from discussing specific companies and products. But there’s an exception to every rule, and this is it, because a company called Altium Ltd. (www.altium.com) has come up with a rather cunning FPGA design environment called Nexar that deserves mention. It’s difficult to know where to start, so let’s kick off by saying that we’re talking about a complete FPGA hardware/ software codesign and coverification environment for around $7,995.9 This environment targets engineers designing things like simple controllers for domestic appliances like washing machines and is based on the fact that you can now purchase FPGAs containing more than 1 million system gates for around $20.10 Nexar includes a hardware development board that plugs into the back of your PC. This development board comes equipped with two daughter cards: one carrying a Xilinx FPGA and the other equipped with an Altera device. Nexar also features a number of soft microprocessor cores that replicate the functionality of industry-standard 8-bit devices like the 8051, Z80, and PIC microcontrollers (a range of 16-bit and 32-bit processor and DSP cores are planned for the future). Also 8 For example, Seamless from Mentor (www.mentor.com), Incisive from Cadence (www.cadence.com), and XoC from Axis Systems (www.axissystems.com). 9 This price was true circa November 2003. 10 Again, this gate-count and price are circa November 2003. ■ 257 1914: Radio message is sent from the ground to an airplane. 258 ■ The Design Warrior's Guide to FPGAs 1915: First trans-atlantic radio telephone conversation included are a library of peripheral devices, a library of around 1,500 component blocks that range from simple gates to more complex functions such as counters, and a small RTOS. By means of a schematic capture interface, the user places blocks representing the processors, peripherals, and various logic functions and wires them together. All of the blocks supplied with Nexar are provided royalty-free. These blocks have been presynthesized, so when you are ready to rock and roll, they can be directly downloaded into the FPGA on the development board. (If necessary, you can also create your own blocks and capture their contents in RTL. These will subsequently be processed by the synthesis engine bundled with Nexar.) Clicking on a processor block allows you to enter the C/C++ source code program to be associated with that processor. This will subsequently be processed by one of the compilers bundled with Nexar. The idea is that everything associated with the design—hardware and software—will be downloaded into the FPGA on the development board. In order to see what’s happening in the hardware, you can include a variety of virtual instrument blocks in your schematic, including logic analyzers, frequency counters, frequency generators, and so forth. When it comes to the software, Nexar provides a source-level debugger that allows you to perform all of the usual tasks like setting breakpoints, specifying watch expressions, singlestepping, stepping over, stepping into, and so on. What can I say? I’ve actually seen one of these little rascals performing its magic, and I was impressed. I really like the fact that this is essentially a turnkey solution, and you get everything (no costly add-ons required) in a package the size of a shoebox. And for the class of designs it is targeting, I personally think that Nexar is going to be a hard act to follow for some time to come. Chapter 14 Modular and Incremental Design Handling things as one big chunk In order to provide a basis for these discussions, let’s consider an FPGA as containing a series of columns, each of which comprises large numbers of programmable logic blocks along with some blocks of RAM and other hard-wired elements such as multipliers or MACs (Figure 14-1). Multipliers RAM blocks Configurable logic blocks Figure 14-1. A column-based architecture. Of course, this illustration is a gross simplification, because a modern device can contain more columns than you can swing a stick at and each column can contain humongous amounts of programmable logic, and so forth. 260 ■ The Design Warrior's Guide to FPGAs 1917: Clarence Birdseye preserves food by means of freezing. When we initially discussed the programming of SRAMbased FPGAs in chapter 5, we stated that we could visualize all of the SRAM configuration cells as comprising a single (long) shift register. For example, consider a simple bird’s-eye view of the surface of the chip showing only the I/O pins/pads and the SRAM configuration cells (Figure 14-2). = I/O pin/pad = SRAM configuration cell Figure 14-2. SRAM configuration cells presented as a single (long) register chain. Once again, we can think of the SRAM configuration cells as a series of columns, each of which maps onto one of the columns of programmable logic shown in figure 14-1. This, too, is a grossly simplified representation because an FPGA can contain tens of millions of these configuration cells, but it will serve our purposes here. The ways in which the two ends of this register chain are made accessible to the outside world will depend on the selected programming mode (this is not relevant to these discussions). In the early days of FPGA-based designs—circa the mid to late 1980s—devices were relatively small in terms of their logic capacity. One by-product of this was that a single design engineer was typically in charge of creating all of the RTL for the device. This RTL was subsequently synthesized, and the ensuing netlist was passed to the place-and-route software, which processed the design in its entirety. Modular and Incremental Design The result was a monolithic configuration file that defined the function of the entire device and would be loaded as one big chunk. This obviously worked well with having the configuration cells presented as a single long register chain, so everyone was happy. Partitioning things into smaller chunks Over time, FPGAs have grown larger and more sophisticated, while the size and complexity of designs have increased by leaps and bounds. One way to address this is to partition the design into functional blocks and to give each block to one or more design engineers. Each of these blocks can be synthesized in isolation. At the end of the day, however, all of the netlists associated with the blocks are gathered back together before being handed over to the place-and-route applications. Once again, place-and-route typically works on the design in its entirety, which can require an overnight run when you’re talking about multimillion-gate designs. Somewhere around 2002, some FPGA vendors started to offer larger devices in which the SRAM configuration cells are presented as multiple (relatively short) register chains (Ffigure 14-3). The idea of presenting the device with these multiple chains may have been conceived with the concepts of modular and incremental design practices in mind. Alternatively, it may have come about for some mundane hardware-related reason, and then some bright spark said, “Just a minute, now that we have these multiple chains, what if we started to support the concepts of modular and incremental design?” If I were a betting man, I’d probably put my money on the latter option, but let’s be charitable and assume that someone somewhere actually knew what he or she was doing (hey, it could happen). However it came about, the end result of this architecture is that, along with associated software applications, it can support the concepts of modular and incremental design. ■ 261 1917: Frank Conrad builds a radio station that eventually becomes KDKA (this call sign is still used to this day). 262 ■ The Design Warrior's Guide to FPGAs = I/O pin/pad = SRAM configuration cell Figure 14-3. SRAM configuration cells presented as multiple (relatively short) register chains. Modular design The terms “block-based” and “bottom-up” may also be associated with modular design. Known as team design by some, this refers to the concept of partitioning a large design into functional blocks and giving each block, along with its associated timing constraints, to a different design engineer or group of engineers. The RTL for each model is captured and synthesized independently, and the final physical netlist is handed over to a system integrator. Ultimately, each block (or a small group of blocks) will be assigned to a specific area in the device. The system integrator is responsible for “stitching” all of these areas together. In a way, this is similar to having a design split across multiple FPGAs, except that everything is in the same device. The primary advantage of this scenario is that the netlist for each area can be run through the place-and-route applications in isolation (these tools will be given constraints restricting them to specific, predefined areas). This means that each team member can complete his or her portion of the design to the point that it fully meets its timing requirements after implementation, not just after synthesis. Modular and Incremental Design Incremental design This refers to the fact that, so long as you’ve tied down the interfaces between blocks/columns, you can modify the RTL associated with a particular block, resynthesize that block, and rerun place-and-route on that block in isolation. This is much, much faster than having to rerun place-and-route on the entire design. Actually, the term isolation as used in the previous paragraph is possibly a tad misleading. It may be more appropriate to say that the incremental design tools “freeze” all of the unchanged blocks in place, and they only reimplement the changed block(s) in the context of the entire design. This provides an advantage over modular design in which the other blocks aren’t present (of course, the modular and incremental design techniques may be used in conjunction with each other). On the downside One problem with the techniques described here is that they can lead to substantial waste of resources because, at the time of this writing, the finest resolution is that of an entire column, so if a particular functional block only occupies, say, 75 percent of the logic in that column, the remaining 25 percent will remain unused and go to waste. (FPGA vendors who support these architectures are talking about providing mechanisms to support finer resolutions in the future.) Another problem is that the methodology described here is almost bound to result in “tall-and-thin” implementations for each functional block because you are essentially restricting the blocks to one or more vertical columns. This is obviously a pain in the case of those functional blocks that would benefit from a “short-and-fat” realization (spanning multiple columns and using small portions of those columns). Perhaps the most significant problem with the early releases of tools and flows using these architectures to support ■ 263 1918: First radio link between UK and Australia. 264 ■ The Design Warrior's Guide to FPGAs 1919: People can dial their own telephone numbers. modular and incremental design practices is that someone (say the system integrator) is obliged to create a floor plan by hand. This poor soul also has to define and place special interface blocks called bus macros that are used to link buses and individual signals crossing from one block to another (Figure 14-4). Bus macros Inter-block connections Figure 14-4. Placing bus macros. The initial implementation of the tools made creating the floor plan and defining and locating the bus macros awkward to say the least. The rumor on the streets is that changes to the software are in the offing that will greatly simplify this process (on the bright side, it couldn’t get any harder). There’s always another way Way back in chapter 10, we introduced the concept of FPGA-centric silicon virtual prototypes, or SVPs. At that time, we noted that some EDA vendors have started to provide tools that support the concept of an FPGA SVP by providing a mixture of floor planning and pre-place-and-route Modular and Incremental Design timing analysis. This is coupled with the ability to perform place-and-route on individual design blocks, which dramatically speeds up the implementation process.1 The point is, if you go back and reread that chapter, you’ll find that the implementation of an FPGA SVP described there fully supports the concepts of modular and incremental design without any of the problems associated with the techniques presented in this chapter. The only problem is that, being much more sophisticated, the tools from an EDA vendor will be substantially more expensive than the offerings from the FPGA vendors. As always, it’s a case of “you pay your money and you make your choice.” 1 At the time of writing, one of the chief proponents of FPGA SVPs—in the form described in this book—is Hier Design (www.hierdesign.com). ■ 265 1919: Short-wave radio is invented. Chapter 15 High-Speed Design and Other PCB Considerations Before we start If you are desperately seeking information on FPGAs containing gigabit serial I/O transceivers, then you’re in the wrong place, and you need to bounce over to chapter 21. We were all so much younger then In many respects, life was so much simpler for FPGA design engineers in the not-so-distant past (let’s say 1990, just to stick a stake in the ground). In those halcyon days, no one gave much thought to the lot of the poor old layout designer tasked with creating the PCB. Here’s the way things went. First of all, even the highestend FPGAs only had around 200 pins, which is relatively few by today’s standards. If these pins were presented in a pin grid array (PGA) package, the pin pitch (the distance between pins) on these devices was around 1/10 inch (2.5 mm), which is absolutely huge by today’s standards. Last but not least, signal delays through devices like FPGAs were massively large compared to the signal delays along circuit board tracks. All of these points led to a fairly simplistic design flow. The process would commence with the system architects creating a very rough floor plan of the circuit board by hand—usually on a whiteboard or a scrap of paper. In fact, “floor plan” is probably too strong a term for what we’re talking about, which was really more of a sketch showing the major components and the major connection paths between them (Figure 15-1). PGA is pronounced by spelling it out as “P-G-A.” A PGA package has an array of pins presented across the bottom face of the device. The circuit board is created with a corresponding set of holes or vias. These devices are attached to the circuit board by pushing each pin through a corresponding hole or via in the board. Circa 1990s, FPGAs presented in PGA packages were predominantly used for military applications; the norm for commercial applications was the plastic quad flat package (PQFP) with pins presented around the perimeter of the device. 268 ■ The Design Warrior's Guide to FPGAs 1919: The concept of flip-flop (memory) circuits is invented. Circuit Board Other Chip Other Chip FPGA Other Chip Other Chip Figure 15-1. The system architects sketch a rough floor plan. Based on this floor plan, the system architects would wave their hands around, make educated guesses about a whole range of things, and eventually pull some input-to-output timing constraints for the FPGA out of the air. Armed with these timing constraints and a specification of the function the FPGA was to perform, the design engineer (remember, there was typically only one engineer per device) would wander off into the sunset to perform his or her machinations. Generally speaking, it was relatively rare for design engineers to worry too much about FPGA pin assignments. To a large extent, they would let the place-and-route software run wild and free, and they would accept any pin assignments it decided upon. Once the FPGA design, including the pin assignments, had been finalized, someone would be tasked with creating a graphical symbol of the device for use with schematic capture, along with a graphical representation of the device’s physical footprint for use in the circuit board layout environment. These symbols would include details as to the signal names associated with the physical pins (and the physical locations of the pins in the case of the layout representation). Meanwhile, the circuit board designer would have been working away in the background placing the other devices and, as far as possible, routing them. It was only after the High-Speed Design and Other PCB Considerations ■ 269 FPGA design had been finalized and the symbol created that the FPGA could be fully integrated into the circuit board environment and the routing completed. This meant that, at the end of the day, it was largely left up to the circuit board designer to make everything work. The bad news was that when we said that the FPGA design had been “finalized,” we really meant that hopefully it was getting close. In the real world, it was almost invariably the case that as soon as the circuit board designer had routed the final track, the FPGA engineer remembered a tweak that just had to be made. Implementing this tweak often ended up modifying the pin assignments, which left the circuit board designer feeling somewhat blue (it was not unknown for strong words to ensue). The times they are a-changing Frightening as it may seem, the simplistic flow discussed above persisted throughout most of the 1990s, but the size and complexity of today’s FPGA devices means that this flow simply can’t stand up under the strain. At the time of this writing, we’re talking about high-end FPGAs containing as many as 1,700 pins presented in ball grid array (BGA) packages with pin pitches of only 1 mm. Furthermore, today’s ICs (including FPGAs) are as fast as lightning compared to their ancestors, which makes the delays associated with the circuit board tracks much more significant. The bottom line is that it is no longer acceptable for the system architects to assign timing constraints to the FPGA in a fairly arbitrary manner and then leave it up to the circuit board designer to make things work at the end of the day. This scenario just won’t fly. Instead, the process needs to start at the board level with the FPGA being treated as a black box (Figure 15-2). In this case, the circuit board layout designer performs board-level timing based on a preliminary placement, and this information is used to calculate realistic constraints to feed to the FPGA design engineer. In the case of modern designs, BGA is pronounced by spelling it out as “B-G-A.” A BGA package has an array of pads presented across the bottom face of the device. The circuit board is created with a corresponding set of pads. Each pad on the FPGA has a small ball of solder attached to it. These devices are attached to the circuit board by placing them in the correct location and then melting the solder balls to form good ballto-pad connections. 270 ■ The Design Warrior's Guide to FPGAs 1919: Walter Schottky invents the Tetrode (the first multiple-grid vacuum tube). Circuit Board Other Chip Other Chip Other Chip Other Chip FPGA as “black box” Figure 15-2. The circuit board designer performs preliminary placement. there could be hundreds or thousands of such timing constraints, and it simply wouldn’t be possible to generate and prioritize them without performing this board-level analysis. But wait; we have to go farther than this. In order to ensure that the FPGA can be routed successfully, it’s now the board designer who has to perform the initial assignments of signals to the FPGA’s I/O pins. In order to do this, new tools are becoming available to the board designer. These tools provide a graphical representation of the physical footprint for the device along with an interactive interface that allows the user to declare signal names and associate them with specific device pins.1 These tools also provide for the auto-generation of the schematic symbol. In the case of devices with 1,000 or more pins, the tool can partition the symbol into multiple parts. One popular push-button option is to create these partitions based on the FPGA’s I/O banks, but it’s also possible to define partitions by hand on a pin-by-pin basis. 1 At the time of this writing, a good example of the current state-of-the-art is the BoardLink Pro application from Mentor (www.mentor.com). High-Speed Design and Other PCB Considerations Once the circuit board designer has performed this upfront work, it’s necessary to have some mechanism by which to transfer these pin assignments over to the FPGA design engineer, who will use them as physical constraints to guide the FPGA’s place-and-route applications. In the real world, there may still be a number of iterations if the FPGA engineer finds it necessary to make modifications to the original pin assignments, but these tend to be minor compared to the horrors seen when using the flow of yesteryear as was introduced earlier in this chapter. FPGA Xchange Until recently, the passing of data back and forth between the board designer and the FPGA design engineer has typically involved a substantial amount of hands-on tweaking to get things to work. That is set to change because a new ASCII file format called FPGA Xchange is being defined by Mentor in conjunction with Altera, Xilinx, and the other major FPGA players. This format will allow the circuit board tools and the FPGA tools to share common definitions of device aspects, such as how signal names have been assigned to physical device pins. This will allow board designers and FPGA engineers to pass data between their two domains quickly and easily. For example, the board designer may create the original pin assignments and use the associated FPGA Xchange file to pass these as constraints to the FPGA engineer’s place-androute tools. The board designer can then proceed to layout the circuit board. Meanwhile, the FPGA engineer may find it necessary to modify some of the pin assignments. These changes would be incorporated into the original FPGA Xchange file, which would subsequently be used by the board-level layout software to rip up any tracks associated with pins that had changed. These tracks can then be rerouted automatically or autointeractively. ■ 271 1921: Albert hull invents the Magnetron (a microwave generator). 272 ■ The Design Warrior's Guide to FPGAs Other things to think about High-speed designs SI is pronounced by spelling it out as “S-I.” There is a common misconception that the term high-speed design means having a fast system clock. In reality, high-speed effects are associated with the speed of edges (the rate at which signals transition from logic 0 to logic 1, and vice versa). The faster the edge, the more significant are signal inegrity (SI) issues such as noise, crosstalk, and the like. Now it’s certainly true that as the frequency of the system clock increases, the speed of edges also has to increase, but you can run into high-speed design problems with even a one megahertz clock if your signals have fast edge rates (and the vast majority of signals have fast edge rates these days). SI analysis One of the nice things about FPGAs is that the vendor has already dealt with the vast majority of SI issues inside the chip; however, it is becoming increasingly important to perform SI analysis at the board level. The best tools aren’t cheap, but neither is creating a board that doesn’t work. So you have to choose whether to perform the SI analysis or just roll the dice and see what happens. SPICE versus IBIS SPICE, which is pronounced like the seasoning, stands for simulation program with integrated circuit emphasis. This analog simulation program was developed by the University of California at Berkeley and was made available for widespread use around the beginning of the 1970s. Performing SI analysis at the board level using SPICE models can be time-consuming. In the early 1990s, Intel created and promoted the input/output buffer information specification (IBIS), which is a modeling format that describes the analog characteristics of drivers and receivers. The reason for Intel’s largesse was that they didn’t want to give detailed SPICE models to customers because these models are at the transistor-capacitor-resistor level, and they can provide a lot of information that a component vendor might not wish its competitors to be aware of. IBIS models are behavioral in nature and any processrelated information is hidden. However, these models are only High-Speed Design and Other PCB Considerations accurate up to some maximum frequency, which can range from 500 megahertz to 1 gigahertz, depending on who you are talking to. After that point you are obliged to use a more accurate model such as SPICE. Another problem is that the language has to be extended in order to accommodate new technologies. For example, IBIS has no mechanism to model the effects of pre-emphasis (see also chapter 21). However, the IBIS syntax is not inherently extensible, and augmenting the language via the various open forum committees is a long-winded process (by the time you get anything done, there’s a new technology to worry about). In late 2002, a proposal was made to augment the IBIS standard (this proposal was called BIRD75, where “BIRD” stands for buffer information resolution document). This proposal would allow calls to external models on a pin-by-pin basis. If adopted, this will allow extensibility, because the external models can be represented in languages such as SPICE, VHDL-AMS, Verilog-A, and so forth. Startup power Some FPGAs can have substantial power supply requirements due to high transient startup currents. Board-level designers need to check with the FPGA team to determine these requirements so as to ensure that the board can supply sufficient power to avoid any problems. Use of internal termination impedances Nearly all modern high-speed I/O standards require that the tracks on the circuit board have specific impedances and associated termination resistors (having the correct values eliminates the reflection and ringing effects that degrade SI and affect system performance). Using termination resistors that are external to the device may necessitate additional layers in the board, resulting in higher costs and longer development times. In the case of FPGAs with hundreds or thousands of pins, it is almost impossible to place these termination resistors within reasonable ■ IBIS is prononuced “eye-bis.” 273 274 ■ The Design Warrior's Guide to FPGAs DCI is pronounced by spelling it out as “D-C-I.” proximity to the device (distances greater than 1 cm from the pin cause problems). For these reasons, some FPGAs include digitally controlled impedance (DCI) capability. Available on both inputs and outputs, DCI termination can be configured to support parallel and series termination schemes. These on-chip resistor values are completely user definable, and the digital implementation of this technology means that their values do not vary with changes in temperature or supply voltages. A simple rule of thumb is that for any signals with rise/fall times of 500 picoseconds or less, external termination resistors cause discontinuities in the signal, in which case you should be using their on-chip counterparts. Pushing data around in parallel versus serial It is common in electronic systems to process groups of bits—called words—in parallel, where the width of the word depends on the system. In the case of 8-bit microprocessors and microcontrollers, for example, words are, perhaps not surprisingly, 8 bits wide. In the days of yore, when device manufacturers agonized over every additional pin on an IC package, it was common for chips to include a function called a universal asynchronous receiver transmitter (UART). Assuming 8-bit words, if the chip wished to send information to the outside world, the UART would convert an 8-bit byte of data from the internal bus into a series of pulses for transmission. Similarly, if the chip wished to access information from the outside world, the UART would accept that information as a series of pulses, collect it into an 8-bit byte, and place that byte on the internal bus. Thus, chips using this technique required only two pins to write and read data: transmit data (TXD) and receive data (RXD). As packaging technologies improved, increasing the number of pins became less of a burden, so it became more and more common to pass entire words of data around. In the case of an 8-bit system, this would require eight tracks on the High-Speed Design and Other PCB Considerations circuit board and eight pins on each chip that was connected to this bus. Over time, it became necessary to push more information around the system and to do so faster. Thus, bus widths increased to 16 bits, then 32 bits, then 64 bits, and so on. At the same time, clock speeds increased from integer multiples of megahertz, to tens of megahertz, to hundreds of megahertz, to thousands of megahertz, where 1,000 megahertz equates to 1 gigahertz. As the speed of the system clock increases, it becomes more and more problematic to route wide buses around a circuit board with any hope of getting the signals where you want them to be, at the time you want them to be there, without running into all sorts of SI problems in the form of noise and crosstalk.2 Thus, for the highest bandwidth applications, designers are turning back to serial data transmission in the form of gigabit transceivers. These are introduced in more detail in Chapter 21. 2 I know this is a long sentence, but that’s appropriate because it’s been a long day! ■ 275 1921: Canadian-American John Augustus Larson invents the polygraph lie detector. Chapter 16 Observing Internal Nodes in an FPGA Lack of visibility One of the problems associated with debugging any chip design—be it an ASIC or an FPGA—is the lack of visibility as to what activity is taking place inside the device. Purely for the sake of these discussions, let’s assume that we have a really simple pipelined design comprising a few registers and logic gates (Figure 16-1). Primary Inputs Primary Outputs Registers Logic Logic & | | | clock Figure 16-1. A very simple pipelined circuit. Obviously, this is something of a nonsense circuit (you’d be amazed how tricky it can be to make something up that doesn’t cloud the issue), but it will serve our purposes here. The problem is that we only have access to the chip’s primary inputs and outputs, so we can’t see what’s happening inside. This isn’t particularly important when the design has been completed and verified, but it’s a pain when we are trying 278 ■ The Design Warrior's Guide to FPGAs 1921: Czech author Karal Capek coins the term robot in his play R.U.R. 1921: First use of quartz crystals to keep radios from wandering off station. to debug the chip to determine why it isn’t doing what we expected it to. One obvious solution would be to make the internal nodes visible by connecting them to primary output pins from the device (Figure 16-2). RA0 RA1 RA2 RA3 RB0 RB1 RB2 RB3 RB0 RA0 & RB1 RA1 RB2 RA2 | RA3 RC0 RC0 RC1 RC1 RC2 RC2 RC3 RC3 | | RB3 clock Figure 16-2. Connecting internal nodes to primary outputs. The downside to this scheme is that most designs are “I/O limited,” which means that the bottleneck is the number of primary I/O pins available on the package. In fact, even if we don’t use any I/O pins to access internal nodes, many FPGA designs already leave a pile of internal resources unused because there aren’t enough I/O pins available to convey all of the required control and data signals into and out of the device. Multiplexing as a solution One simple alternative is to multiplex the main outputs with the internal signals and bring them all out through the same set of output pins (Figure 16-3). Of course, the select control would also require the use of some primary I/O pins. In the example shown here, the simplest case would be to bring the two select control signals directly to the outside world, which would therefore require Observing Internal Nodes in an FPGA Multiplexer ■ 279 1922: First commercial broadcast ($100 for a 10-minute advert). RA0 RA1 RA2 RA3 Internal signals RB0 Out0 RB1 Out1 RB2 Out2 RB3 Out3 Primary output pins RC0 Original outputs RC1 RC2 RC3 Select control Figure 16-3. Multiplexing signals. two I/O pins. Alternatively, we might use a small portion of logic to implement a simple state machine that required only a single I/O to act as a sort of clock to cycle between states, each of which would cause the multiplexer to select a different set of signals (see also the discussions on VirtualWires later in this chapter). The main advantages of this scheme are that it offers great visibility and it’s relatively fast. The main disadvantages are that it’s relatively inflexible and time-consuming to implement because if you wish to change the internal pins that are being monitored, you have to modify the design’s source code and then resynthesize it. Similarly, if you wish to change any trigger conditions that might be used by a state machine to determine which set of signals is to be selected by the multiplexer at any particular time, you once again have to change the design source code. Another point to consider is that if, once you’ve debugged everything, you delete these test structures from your source code, you may introduce new problems into the design, not the least of which is that the routing will change, along with its associated delays. 1923: First neon advertising signs are introduced. 280 ■ The Design Warrior's Guide to FPGAs Special debugging circuitry Some FPGAs include special debugging circuitry that allows you to observe internal nodes. For example, FPGAs from Actel feature two special pins called PRA (“Probe A” ) and PRB (“Probe B”). By means of the embedded debugging circuitry combined with the use of a special debugging utility,1 any internal signal can be connected to either of these pins, allowing the values on that node to be observed and analyzed. The big advantage of this type of scheme is that you don’t need to touch your source code. The disadvantage is that having only two probe pins might be considered a tad limiting when you have potentially hundreds of thousands of internal signals to worry about. OCI is prononcued by spelling it out as “O-C-I.” Virtual logic analyzers Although the schemes discussed above are useful, it is often desirable to have access to more extensive logic analyzer instrumentation to allow for the tracing and debugging of groups of embedded signals, along with the ability to analyze signals in the context of other related signals or under specific triggering events. On-chip instrumentation (OCI) is an analysis approach that facilitates logic debugging by allowing the user to embed diagnostic IP blocks such as virtual logic analyzer applications into their designs. The idea here is to use some of the FPGA’s resources to implement one or more virtual logic analyzer blocks that capture the activity of selected signals. This data will be stored in one or more of the FPGA’s embedded RAM blocks, from whence it can be accessed by the external logic analyzer software by means of the device’s JTAG port (Ffigure 16-4). A portion of the virtual logic analyzer will be devoted to detecting trigger conditions on specific signals, where these 1 This used to be called Actionprobe®. Then a “new and improved” version called Silicon Explorer became available. At the time of this writing, Silicon Explorer II is the flavor of the day, and as for tomorrow … Observing Internal Nodes in an FPGA JTAG (from external virtual logic analyzer program or another internal logic analyzer block) Signals we wish to monitor Virtual Logic Analyzer 281 1923: First photoelectric cell is introduced. 1923: First ship-to-ship communications (people on one ship can talk to people on another). Embedded RAM Block Control Logic Start/Stop conditions to trigger on ■ JTAG (to external virtual logic analyzer program or another internal logic analyzer block) Figure 16-4. Virtual logic analyzers. conditions will be used to start and stop the data capture on the signals being monitored. Depending on the particular virtual logic analyzer implementation you are working with, you may or may not have to modify your design’s source code to include this functionality. The big advantage of this type of scheme is that, even if you do have to include special macros in your source code, it’s relatively easy to implement extremely sophisticated debugging capabilities in your design. In some cases, the FPGA vendor will offer this sort of capability. Good examples of debugging tools of this ilk are Chipscope™ Pro from Xilinx (www.xilinx.com) and SignalTap® II from Altera (www.altera.com). Alternatively, if you are working with devices from a vendor who doesn’t offer this capability, one option is to go to a third party such as First Silicon Solutions (www.fs2.com), which specializes in OCI and debugging for FPGA logic and embedded processors. With regard to a virtual logic analyzer for use in tracing, analyzing, and debugging embedded signals in FPGAs, First Silicon Solutions boasts its configurable logic analyzer module (CLAM). This little scallywag consists of an OCI block (available in both Verilog and VHDL) that is configured and 282 ■ The Design Warrior's Guide to FPGAs 1925: America. Scientist, engineer, and politician Vannevar Bush designs an analogue computer called the Product Intergraph. 1925: First commercial picture/facsimile radio service across the USA. synthesized as part of the design. This block (or blocks if you use more than one) is used in conjunction with control, probe, and display software that resides on your host PC. VirtualWires Sometime around the early 1990s, a company called Virtual Machine Works introduced a technology they called VirtualWires™. Originally intended as a technique for implementing massive multi-FPGA systems, VirtualWires provided a basis for a variety of FPGA-based emulation systems. One reason for mentioning it here is that it bears some similarities to the multiplexing solutions discussed earlier in this chapter. Another reason is that it’s a really cool idea. The problem The starting point for the VirtualWires concept is that you have a large design that is too big to fit into a single FPGA, so you wish to split it across a number of devices. As a simple example, let’s assume we have a design that equates to some number of system gates, but that the largest FPGA available offers only half this number of gates. Thus, an initial knee-jerk solution would almost certainly be to split the design across two devices (Figure 16-5). Note that the logic in this figure is shown as comprising a number of subblocks labeled A through H. This is intended only to provide an aid in visualizing the way in which the logic might be partitioned across the devices. The problem is that the chips typically won’t have enough I/O pins to satisfy the requirements of the main inputs and outputs to the design along with the signals linking the two portions of the design. Prior to VirtualWires (or any similar concept), the only option was to further partition the design across more devices (Figure 16-6). But now we have a new problem in that we are wasting huge amounts of each FPGA’s logic resources, with the result that we are using way too many chips. Observing Internal Nodes in an FPGA Logic for the design A C E G Primary inputs to the design Primary outputs from the design B D F H FPGA 1 A FPGA 2 C E G Primary inputs to the design Primary outputs from the design B D F H Not enough pins Figure 16-5. Not enough pins if we try to split the design across two devices. Figure 16-6. Lots of wasted FPGA logic resources when partitioning across multiple devices. The VirtualWires solution In order to see how the VirtualWires concept addresses our problem, let’s first assume an extreme case in which we have access to some very strange FPGAs that can boast only three pins (two inputs and one output, where one of the inputs ■ 283 1926: America. Dr. Julius Edgar Lilienfield from New York files a patent for what we would now recognize as an npn junction transistor being used in the role of an amplifier. 284 ■ The Design Warrior's Guide to FPGAs 1926: America. First pop-up bread toaster is introduced. 1926: First commercial picture/facsimile radio service across the Atlantic. assumes the role of a clock). In this case, we would probably end up using only a very small amount of each FPGA’s logic resources (Figure 16-7). FPGA (n) FPGA (n + 1) Unused logic Unused logic Data from previous FPGA Data to next FPGA Used logic Used logic System clock Figure 16-7. An extreme case in which each FPGA has only three pins. The idea behind VirtualWires is that, since we are wasting so much of each device’s internal resources anyway, we might as well use some of these resources to implement some special circuitry that allows our single data input to be latched into a number of registers, each of which can be used to drive its own block of logic. Similarly, the outputs from each of the blocks of logic can be multiplexed together and registered (Figure 16-8). FPGA (n) FPGA (n + 1) Registers Registers Logic Data from previous FPGA Logic Mux Mux Logic Logic Logic Logic Logic Logic State machine State machine Data to next FPGA Virtual clock System clock Virtual clock Figure 16-8. A simple example of VirtualWires. Observing Internal Nodes in an FPGA Note that our original system clock has been superceded by a virtual clock, which subdivides each “beat” of the system clock into some number of “ticks.” Also note that a state machine is implemented inside each FPGA. These state machines are used to enable and disable individual registers and also to control the multiplexers and so forth. (Of course, figure 16-8 is not to scale—the state machines and other VirtualWires structures would actually consume relatively little of the logic resources in each device compared to the number of logic blocks that are actually implementing the real design.) On each tick of the virtual clock, the state machine inside each FPGA will enable a register driving one of the logic blocks, thereby allowing the data from the input pin to be loaded into that register. At the same time, the state machine will cause the multiplexer to select the output from one of the logic blocks, and it will store that data in a register driving the output pin, which in turn drives the input to the next FPGA in the chain. In the real world, of course, our FPGAs will have hundreds or thousands of pins. Each input may be used to drive several blocks of logic, and each output will be driven by its own VirtualWires multiplexer that selects data from a number of blocks of logic. To cut a long story short, things will become much more complicated, but the underlying principle remains the same. Last but not least, a key element to the VirtualWires concept is a compiler that takes the original design in the form of a gate-level netlist, partitions this design across multiple FPGAs, automatically creates the state machines and other VirtualWires-related structures, and then generates the configuration files that will be used to load each of the FPGAs. ■ 285 1926: John Logie Baird demonstrates an electromechanical TV system. 1927: First five-electrode vacuum tube (the Pentrode) is introduced. Chapter 17 Intellectual Property Sources of IP Today’s FPGA designs are so big and complex that it would be impractical to create every portion of the design from scratch. One solution is to reuse existing functional blocks for the boring stuff and spend the bulk of your time and resources creating the cunning new portions of the design that feature your “secret sauce” and that will differentiate your design from any competitor offerings. Any existing functional blocks are typically referred to as IP. The three main sources of such IP are (1) internally created blocks from previous designs, (2) FPGA vendors, and (3) third-party IP providers. For the purposes of these discussions, we shall concentrate on the latter two categories. Handcrafted IP One scenario is that the IP provider has handcrafted an IP block starting with an RTL description (the provider might also have used an IP block/core generator application, as discussed later in this chapter). In this case, there are several ways in which the end user might purchase and use such a block (Figure 17-1). IP at the unencrypted RTL level In certain cases, FPGA designers can purchase IP at the RTL level as blocks of unencrypted source code. These blocks can then be integrated into the RTL code for the body of the design (Figure 17-1a). (Note that the IP provider would IP is pronounced by spelling it out as “I-P.” 288 ■ The Design Warrior's Guide to FPGAs IP Provider FPGA Designer Create RTL for IP block Create RTL for body of design (a) Incorporate IP block(s) Synthesis Synthesis Unplaced-andunrouted netlist Unplaced-andunrouted netlist (b) (c) Incorporate IP block(s) Place-and-Route Place-and-Route Placed-and-routed netlist Placed-and-routed netlist Figure 17-1. Alternative potential IP acquisition points. NDA is pronounced by spelling it out as “N-D-A.” already have simulated, synthesized, and verified the IP blocks before handing over the RTL source code). Generally speaking, this is an expensive option because IP providers typically don’t want anyone to see their RTL source code. Certainly, FPGA vendors are usually reluctant to provide unencrypted RTL because they don’t want anyone to retarget it toward a competitor’s device offering. So if you really wish to go this route, whoever is providing the IP will charge you an arm and a leg, and you’ll end up signing all sorts of licensing and nondisclosure agreements (NDAs). Assuming you do manage to lay your hands on unencrypted RTL, one advantage of this approach is that you can modify the code to remove any functions you don’t require in your design (or in some cases you might add new functions). Another advantage, assuming that you purchase the IP from a third party rather than from an FPGA vendor, is that you can quickly and easily retarget the IP across different device families and FPGA vendors. The big disadvantage is that the resulting implementation will typically be less efficient in Intellectual Property ■ 289 terms of resource requirements and performance when compared to an optimized version delivered at the netlist level as discussed below. IP at the encrypted RTL level Unfortunately, at the time of this writing, there is no industry-standard encryption technique for RTL that has popular tool support. This has led companies like Altera and Xilinx to develop their own encryption schemes and tools. RTL encrypted by a particular FPGA vendor’s tools can only be processed by that vendor’s own synthesis tools (or sometimes by a third-party synthesis tool that has been OEM’d by the FPGA vendor). IP at the unplaced-and-unrouted netlist level Perhaps the most common scenario is for FPGA designers to purchase IP at the unplaced-and-unrouted LUT/CLB netlist level (Figure 17-1b). Such netlists are typically provided in encrypted form, either as encrypted EDIF or using some FPGA vendor-specific format. In this case, the IP vendor may also provide a compiled cycle-accurate C/C++ model to be used for functional verification because such a model will simulate much faster than the LUT/CLB netlist-level model. The main advantage of this scenario is that the IP provider has often gone to a lot of effort tuning the synthesis engine and handcrafting certain portions of the function so as to achieve an optimal implementation in term of resource utilization and performance. One disadvantage is that the FPGA designer doesn’t have any ability to remove unwanted functionality. Another disadvantage is that the IP block is tied to a particular FPGA vendor and device family. IP at the placed-and-routed netlist level In certain cases, the FPGA designer may purchase IP at the placed-and-routed LUT/CLB netlist level (Figure 17-1c). Once again, such netlists are typically provided in encrypted EDIF is pronounced “E-DIF;” that is, by spelling out the ‘E’ followed by “dif” to rhyme with “miff.” 290 ■ The Design Warrior's Guide to FPGAs 1927: First public demonstration of long-distance television transmission (basically a Nipkow disk). form, either as encrypted EDIF or using some FPGA vendorspecific format. The reason for having placed-and-routed representations is to obtain the highest levels of performance. In some cases the placements will be relative, which means that the locations of all of the LUT, CLB, and other elements forming the block are fixed with respect to each other, but the block as a whole may be positioned anywhere (suitable) within the FPGA. Alternatively, in the case of IP blocks such as communications or bus protocol functions with specific I/O pin requirements, the placements of the elements forming the block may be absolute, which means that they cannot be changed in any way. Once again, the IP vendor may also provide a compiled cycle-accurate C/C++ model to be used for functional verification because such a model will simulate much faster than the LUT/CLB netlist-level model. IP core generators Another very common practice is for FPGA vendors (sometimes EDA vendors, IP providers, and even small, independent design houses) to provide special tools that act as IP block/core generators. These generator applications are almost invariably parameterized, thereby allowing you to specify the widths and depths, or both of buses and functional elements. First, you get to select from a list of different blocks/cores, and then you get to specify the parameters to be associated with each. Furthermore, in the case of some blocks/cores, the generator application may allow you to select from a list of functional elements that you wish to be included or excluded from the final representation. In the case of a communications block, for example, it might be possible to include or exclude certain error-checking logic. Or in the case of a CPU core, it might be possible to omit certain instructions or addressing modes. This allows the generator application to create the most efficient IP block/core in terms of its resource requirements and performance. Intellectual Property ■ 291 Depending on the origin of the generator application (or sometimes the licensing option you’ve signed up for), its output may be in the form of encrypted or unencrypted RTL source code, an unplaced-and-unrouted netlist, or a placedand-routed netlist. In some cases, the generator may also output a cycle-accurate C/C++ model for use in simulation (Figure 17-2). FPGA Designer Input IP block/core generator RTL for IP block Unplaced-andunrouted netlist Placed-androuted netlist Cycle-accurate C/C++ model Figure 17-2. IP block/core generator. Miscellaneous stuff There is currently a push by the main FPGA vendors to provide special system generator utilities. These tools are essentially IP integrators that allow you to quickly build up very sophisticated designs using the various IP building blocks available from the respective FPGA vendor. These system generator tools essentially spit out netlists for systems defined in some abstract form (as opposed to detailed end-user RTL coding.) These tools aim to change the FPGA design model by providing a system-level design paradigm that sits on top of the standard RTL-based design flow. This concept is of particular interest for designers who don’t write RTL or who prefer to work at a higher level of abstraction (see also Chapter 12). In addition to providing system generators, FPGA vendors are also working to simplify the use of IP by incorporating IPbased design-flow capabilities into their independent development environments (IDEs). IDE is pronounced by spelling it out as “I-D-E.” 292 ■ The Design Warrior's Guide to FPGAs Depending on whom you are talking to, the ‘D’ in IDE can stand for “design” or “development.” Last, but not least, some IP that used to be “soft” is now becoming “hard.” For example, the most current generation of FPGAs contains hard processor, clock manager, Ethernet, and gigabit I/O blocks, among others. These help bring high-end ASIC functionality into standard FPGAs. Over time, it is likely that additional functions of this ilk will be incorporated into the FPGA fabric. Chapter 18 Migrating ASIC Designs to FPGAs and Vice Versa Alternative design scenarios When it comes to creating an FPGA design, there are a number of possible scenarios depending on what you are trying to do (Figure 18-1). Existing design New design Final implementation FPGA Only N/A FPGA FPGA FPGA-to-FPGA FPGA FPGA FPGA FPGA-to-ASIC N/A FPGA ASIC ASIC-to-FPGA ASIC FPGA FPGA Figure 18-1. Alternative design scenarios. FPGA only This refers to a design that is intended for an FPGA implementation only. In this case, one might use any of the design flows and tools introduced elsewhere in this book. FPGA-to-FPGA This refers to taking an existing FPGA-based design and migrating it to a new FPGA technology (the new technology is often presented in the form of a new device family from the same FPGA vendor you used to implement the original design, but you may be moving to a new vendor also). 294 ■ The Design Warrior's Guide to FPGAs With this scenario, it is rare that you will be performing a simple one-to-one migration, which means taking the contents of an existing component and migrating them directly to a new device. It is much more common to migrate the functionality from multiple existing FPGAs to a single new FPGA. Alternatively, you might be gathering the functionality of one or more existing FPGAs, plus a load of surrounding discrete logic, and bundling it all into a new device. In these cases, the typical route is to gather all of the RTL code describing the original devices and discrete logic into a single design. The code may be tweaked so as to take advantage of any new features available in the targeted device and then resynthesized. FPGA-to-ASIC Literally as this book was heading to press, Synopsys (www.synopsys.com) made a rather interesting announcement. Using its well-known Design Compiler® ASIC synthesis engine as a base, they’ve created an FPGA-optimized version called Design Compiler FPGA. Among other things, DC FPGA features some innovative new Adaptive Optimization™ Technology that looks to be very interesting. This refers to using one or more FPGAs to prototype an ASIC design. One big issue here is that, unless you’re working with a small to medium ASIC, it is often necessary to partition the design across multiple FPGAs. Some EDA and FPGA vendors either have (or used to have) applications that will perform this partitioning automatically,1 but tools like this come and go with the seasons. Also, their features and capabilities, along with the quality of their results, can change on an almost weekly basis (which is my roundabout way of telling you that you’ll have to evaluate the latest offerings for yourself). Another consideration is that functions like RAMs configured to act as FIFO memories or dual-port memories have specific realizations when they are implemented using embedded RAM blocks in FPGAs. These realizations are typically different from the way in which these functions will be implemented in an ASIC, which may cause problems. One solution is to create your own RTL library of ASIC functions for such things as multipliers, comparators, memory blocks, and the 1 A good example of an application that provides this sort of functionality is Certify® from Synplicity (www.synplicity.com). Migrating ASIC Designs to FPGAs and Vice Versa like that will give you a one-for-one mapping with their FPGA counterparts. Unfortunately, this means instantiating these elements in the RTL code for your design, as opposed to using generic RTL and letting the synthesis engine handle everything (so it’s a balancing act like everything else in engineering). As we discussed in Chapter 7, a design intended for an FPGA implementation typically contains fewer levels of logic between register stages than would a pure ASIC design. In some cases, it’s best to create the RTL code associated with the design with the final ASIC implementation in mind and just take the hit with regard to reduced performance in the FPGA prototype. Alternatively, one might generate two flavors of the RTL—one for use with the FPGA prototype and the other to provide the final ASIC. But this is generally regarded to be a horrible way to do things because it’s easy for the two representations to lose synchronization and end up going in two totally different directions. One way around this might be to use the pure C/C++ based tools introduced in chapter 11. As you may recall, the idea here is that, as opposed to adding intelligence to the RTL source code by hand (thereby locking it into a target implementation), all of the intelligence is provided by your controlling and guiding the C/C++ synthesis engine itself (Figure 18-2). Verilog / VHDL RTL User interaction and guidence RTL Synthesis LUT/CLBlevel netlist FPGA target Pure C/C++ Auto-generated, implementation-specific Pure C/C++ Synthesis ASIC target - Non-implementation-specific - Easy to create - Fast to simulate - Easy to modify Verilog / VHDL RTL RTL Synthesis Figure 18-2. A pure C/C++–based design flow. Gate-level netlist ■ 295 But the main point is that DC ASIC and DC FPGA can use the same RTL source code, constraints, etc. to create both ASIC and FPGA implementations of the same design. (Each engine can be instructed to use different microarchitecture schemes such as resource sharing and the number of pipeline stages. Furthermore, DC FPGA can perform automatic transformation on any ASIC-centric clock-gating embedded in the RTL.) All of this makes the combination of DC FPGA and DC ASIC very interesting in the context of using FPGAs as prototypes for final ASIC implementations. 296 ■ The Design Warrior's Guide to FPGAs 1927: Harold Stephen Black conceives the idea of negative feedback, which, amongst other things makes Hi-Fi amplifiers possible. Once the synthesis engine has parsed the C/C++ source code, you can use it to perform microarchitecture tradeoffs and evaluate their effects in terms of size and speed. The user-defined configuration associated with each “what-if” scenario can be named, saved, and reused as required. Thus, you could first create a configuration for use as an FPGA prototype and, once this had been verified, you could create a second configuration to be used for the final ASIC implementation. The key point is that the same C/C++ source code is used to drive both flows. Another point to ponder is that a modern ASIC design can contain an unbelievable number of clock domains and subdomains (we’re talking about hundreds of domains/subdomains here). By comparison, an FPGA has a limited number of primary clock domains (on the order of 10). This means that if you’re using one or more FPGAs to prototype your ASIC, you’re going to have to put a lot of thought into how you handle your clocks. Last but not least, there’s an interesting European Patent numbered EP0437491 (B1), which, when you read it—and, good grief, it’s soooo boring—seems to lock down the idea of using multiple programmable devices like FPGAs to temporarily realize a design intended for final implementation as an ASIC. In reality, I think this patent was probably targeted toward using FPGAs to create a logic emulator, but the way it’s worded would prevent anyone from using two or more FPGAs to prototype an ASIC. ASIC-to-FPGA This refers to taking an existing ASIC design and migrating it to an FPGA. The reasons for doing this are wide and varied, but they often involve the desire to tweak an existing ASIC’s functionality without spending vast amounts of money. Alternatively, the original ASIC technology may have become obsolete, but parts might still be required to support ongoing contracts (this is often the case with regard to military programs). One point of interest is that the latest Migrating ASIC Designs to FPGAs and Vice Versa generation of FPGAs has usually jumped so far so fast that it’s possible to place an entire ASIC design from just a few years ago into a single modern FPGA (if you do have to partition the design across multiple FPGAs, then there are tools to aid you in this task, as discussed in the “FPGA-to-ASIC” section above). First of all, you are going to have to go through your RTL code with a fine-tooth comb to remove (or at least evaluate) any asynchronous logic, combinatorial loops, delay chains, and things of this ilk (see also Chapter 7). In the case of flip-flops with both set and reset inputs, you might wish to recode these to use only one or the other (see also Chapter 7). You might also wish to look for any latches and redesign the circuit to use flip-flops instead. Also, you should keep a watchful eye open for statements like if-then-else without the else clause because, in these cases, synthesis tools will infer latches (see also Chapter 9). In the case of clocks, you will have to ensure that your target FPGA provides enough clock domains to handle the requirements of the original ASIC design—otherwise, you’ll have to redesign your clock circuitry. Furthermore, if your original ASIC design made use of clock-gating techniques, you will have to strip these out and possibly replace them with clock-enable equivalents (see also Chapter 7). Once again, some FPGA and EDA vendors provide synthesis tools that can automatically convert an ASIC design using gated clocks to an equivalent FPGA design using clocks with enables.2 In the case of complex functional elements such as memory blocks (e.g., FIFOs and dual-port RAMs), it will probably be necessary to tweak the RTL code to fit the design into the FPGA. In some cases, this will involve replacing generic RTL statements (that will be processed by the synthesis engine) with calls to instantiate specific subcircuits or FPGA elements. 2 A good example of an application that provides this sort of functionality is Amplify® from Synplicity (www.synplicity.com). ■ 297 1927: America. Philo Farnsworth assembles a complete electronic TV system. 298 ■ The Design Warrior's Guide to FPGAs 1928: America. First quartz crystal clock is introduced. Last, but not least, the original pipelined ASIC design probably had more levels of logic between register elements than you would like in the FPGA implementation if you wish to maintain performance. Most modern logic synthesis and physically aware tools provide retiming capability, which allows them to move logic back and forth across pipeline register boundaries to achieve better timing (the physically aware synthesis engines typically do a much better job at this; see also chapter 19). It’s also true that your modern FPGA is probably based on a later technology node (say, 130 nano) than your original ASIC design (say, 250 nano). This gives the FPGA an inherent speed advantage, which serves to offset its inherent track-delay disadvantages. At the end of the day, however, you may still end up having to hand-tweak the code to add in more pipeline stages. Chapter 19 Simulation, Synthesis, Verification, etc. Design Tools Introduction Design engineers typically need to use a tremendous variety of tools to capture, verify, synthesize, and implement their designs. Introducing all of these tools would require a book in itself,1 so this chapter focuses on some of the more significant contenders in the context of FPGA designs (along with a couple I threw in for interest’s sake): ■ ■ ■ ■ ■ ■ Simulation (cycle-based, event-driven, etc.) Synthesis (logic/HDL versus physically aware) Timing analysis (static versus dynamic) Verification in general Formal verification Miscellaneous Simulation (cycle-based, event-driven, etc.) What are event-driven logic simulators? Logic simulation is currently one of the main verification tools in the design (or verification) engineer’s arsenal. The most common form of logic simulation is known as event driven because, perhaps not surprisingly, these tools see the world as a series of discrete events. As an example, consider a very simple circuit comprising an OR gate driving both a BUF (buffer) gate and a brace of NOT (inverting) gates, as shown in Figure 19-1. 1 I’d be more than happy to write such a book if anyone would be prepared to fund the effort! 300 ■ The Design Warrior's Guide to FPGAs 1928: John Logie Baird demonsrates colr TV on an electronic TV system. BUF out1 OR in1 | g2 w1 in2 NOT NOT g1 w2 out2 g4 g3 Figure 19-1. An example circuit. Just to keep things simple, let’s assume that NOT gates have a delay of 5 picoseconds (ps), BUF gates have a delay of 10 ps, and OR gates have a delay of 15 ps. On this basis, let’s consider what will happen when a signal change occurs on one of the input pins (Figure 19-2). t2 t1 t3 t4 in1 in2 w1 w2 out1 out2 15 ps 5 ps 5 ps Figure 19-2. Results from an event-driven simulation. Internally, the simulator maintains something called an event wheel onto which it places events that are to be “actioned” at some time in the future. When the first event occurs on input in1 at a time we might refer to as t1, the simu- Simulation, Synthesis, Verification, etc. Design Tools lator looks to see what this input is connected to, which happens to be our OR gate. We are assuming that the OR gate has a delay of 15 ps, so the simulator schedules an event on the output of the OR gate—a rising (0 to 1) transition on wire w1—for 15 ps in the future at time t2. The simulator then checks if any further actions need to be performed at the current time (t1), then it looks at the event wheel to see what is to occur next. In the case of our example, the next event happens to be the one we just scheduled at time t2, which was for a rising transition on wire w1. At the same time as the simulator is performing this action, it looks to see what wire w1 is connected to, which is BUF gate g2 and NOT gate g3. As NOT gate g3 has a delay of 5 ps, the simulator schedules a falling (1 to 0) transition on its output, wire w2, for 5 ps in the future at time t3. Similarly, as BUF gate g2 has a delay of 10 ps, the simulator schedules a rising (0 to 1) transition on its output, output out1, for 10 ps in the future at time t4. And so it goes until all of the events triggered by the initial transition on input in1 have been satisfied. The advantage of this event-driven approach is that simulators based on this technique can be used to represent almost any form of design, including synchronous and asynchronous circuits, combinatorial feedback loops, and so forth. These simulators also offer extremely good visibility into the design for debugging purposes, and they can evaluate the effects of delay-related narrow pulses and glitches that are very difficult to find using other techniques (see also the discussions on delays in the next section). The big disadvantage associated with these simulators is that they are extremely computeintensive and correspondingly slow. A brief overview of the evolution of event-driven logic simulators As we discussed in chapter 8, the first event-driven digital logic simulators (circa the late 1960s and early 1970s) were based on the concept of simulation primitives. At a minimum, ■ 301 1928: John Logie Baird invents a videodisc to record television programs. 302 ■ The Design Warrior's Guide to FPGAs 1929: Joseph Schick invents the electric razor. these primitive elements would include logic gates such as BUF, NOT, AND, NAND, OR, NOR, XOR, and XNOR, along with a number of tri-state buffers. Some simulators also offered a selection of registers and latches as primitive elements, while others required you to create these functions as subcircuits formed from a collection of the more primitive logic gates. At that time, the functionality of the design would be captured using a standard text editor as a gate-level netlist. Similarly, the testbench would be captured as a textual (tabular) stimulus file. The simulator would accept the netlist and testbench along with any control files and command-line instructions; it would use the netlist to build a model of the circuit in the computer’s memory; it would apply the stimulus from the testbench to this model; and it would output results in the form of a textual (tabular) file (Figure 19-3). BEGIN CIRCUIT=TEST INPUT SET_A, SET-B, DATA, CLOCK, CLEAR_A, CLEAR_B; OUTPUT Q, N_Q; WIRE SET, N_DATA, CLEAR; GATE G1=NAND (IN1=SET_A, IN2=SET_B, OUT1=SET); GATE G2=NOT (IN1=DATA, OUT1=N_DATA); GATE G3=OR (IN1=CLEAR_A, IN2=CLEAR_B, OUT1=CLEAR); GATE G4=DFF (IN1=SET, IN2=N_DATA, IN3=CLOCK, IN4=CLEAR, OUT1=Q, OUT2=N_Q); END CIRCUIT=TEST; Textual gate-level netlist C C L L S S C E E E E D L A A T T A O R R _ _ T C _ _ A B A K A B ----------1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 TIME ----0 500 1000 1500 2000 2500 : etc. ; ; ; ; ; ; Set up Rising edge Falling edge Set data Rising edge Clear active Logic Simulator C C L L S S C E E E E D L A A T T A O R R _ _ T C _ _ A B A K A B ----------1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 TIME ----0 5 10 500 520 1000 1500 1505 2000 2020 2500 2510 2530 : etc. N _ C D L S A E N E T A _ T A R Q Q ----- --X X X X X X 0 X X X 0 0 0 X X 0 0 0 X X 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 Textual (tabular) results file (stimulus and response) Textual (tabular) stimulus Figure 19-3. Running a logic simulator. RTL is pronounced by spelling it out as “R-T-L.” Over time things started to become a little more sophisticated. First, schematic capture packages were used to capture the design and to generate the gate-level netlist. Next, special Simulation, Synthesis, Verification, etc. Design Tools display tools were used to read in the textual results files and to present the results as graphical waveforms. In some cases, these waveform tools were also used to capture the testbench in a graphical manner and to generate the tabular stimulus file. Still later, the creators of digital simulators started to experiment with more sophisticated languages that could describe logical functions at higher levels of abstraction such as the register transfer level, or RTL. A good example of such a language was the GenRad Hardware Description Language (GHDL) used by the System HILO simulator. Similarly, more sophisticated testbench languages started to evolve, such as the GenRad Waveform Description Language (GWDL). Languages of this type could support complex constructs like loops, and they could even access the current state of the circuit and vary their tests accordingly (along the lines of,“ If this output is a logic 0, then jump to Test B or else jump to Test C”). In some respects, these early languages were ahead of their time. For example, GWDL had a really useful feature in that, in addition to specifying the input stimulus (e.g., “input-A = 0”), you could also specify the expected output response (e.g., “output-Y == 1”). (Note the use of one equals sign to assign a value to an input and of a pair of equal signs to indicate an expected response.) If you then used a special STROBE statement, the simulator would check to see if the actual response (from the circuit) matched the expected response (specified in the waveform) and generate a warning if there was a discrepancy between the two. As the years passed by, industry-standard HDLs such as Verilog and VHDL started to appear. These had the advantage that the same language could be used to represent both the functionality of the circuit and the testbench.2 (See also the discussions on special verification languages like e in the “Verification in general” section later in this chapter.) 2 The chief architect of the Verilog language—Phil Moorby—was also one of the designers of the original HILO language and simulator. ■ 303 VCD is pronounced by spelling it out as “V-C-D.” FSDB is pronounced by spelling it out as “F-S-D-B.” 304 ■ The Design Warrior's Guide to FPGAs SDF is pronounced by spelling it out as “S-D-F.” As opposed to using the ‘X’ character to represent “unknown” or “don’t know,” data books typically use it to represent “don’t care.” By comparison, hardware description languages tend to use ‘?’ or ‘-’ to represent “don’t care” values. Also, “don’t care” values cannot be assigned to outputs as driven states. Instead, they are used to specify how a model’s inputs should respond to different combinations of signals. Digital simulation logic value systems (such as the cross-product versus interval-value approaches) and various aspects of unknown X values are introduced in more detail in my book Designus Maximus Unleashed (Banned in Ala- Also, standard file formats for capturing simulation output results, such as the value change dump (VCD) format, started to appear on the scene. This facilitated third-party EDA companies creating sophisticated waveform display and analysis tools that could work with the outputs from multiple simulators. (A more recent entry here is the Fast Signal Database™ (FSDB) format from Novas Software (www.novas.com), which provides much smaller file sizes than VCD while offering extremely fast information-retrieval capabilities.) Similarly, innovations like the standard delay format (SDF) specification facilitated third-party EDA companies’ creating sophisticated timing analysis tools that could evaluate circuits, generate timing reports highlighting potential problems, and output SDF files that could be used to provide more accurate timing simulations (see also the discussion on alternative delay formats below). Logic values and different logic value systems The overwhelming majority of today’s digital electronics systems are based on binary logic with digits called bits; that is, logic gates using two different voltages to represent the binary digits 0 and 1 or the Boolean logic values True and False. Some experiments have been performed on tertiary logic, which is based on three different logic levels and whose digits are referred to as trits. Thus far, however, this technology hasn’t made any inroads into commercial applications (for which what’s left of my brain is truly thankful). But we digress. The minimum set of logic values required to represent the operation of binary logic gates is 0 and 1. The next step is the ability to represent unknown values, for which we typically use the character X. These unknown values may be used to represent a variety of conditions, such as the contents of an uninitialized register or the clash resulting from two gates driving the same wire with opposing logical values. And it’s also nice to be able to represent high-impedance values driven by the outputs of tri-state gates, for which we typically use the character Z. Simulation, Synthesis, Verification, etc. Design Tools But the 0, 1, X, and Z states are only the tip of the iceberg. More advanced logic simulators have ways to associate different drive strengths with the outputs of different gates. This is combined with ways in which to resolve and represent situations where multiple gates are driving the same wire with different logic values of different strengths. Just to make life fun, of course, VHDL and Verilog handle this sort of thing in somewhat different ways. Mixed-language simulation The problem with having two industry-standard languages like Verilog and VHDL is that it’s not long before you find yourself with different portions of a design represented in different languages. Anything you design from scratch will obviously be written in the language du jour favored by your company. However, problems can arise if you wish to reuse legacy code that is in the other language. Similarly, you may wish to purchase blocks of IP from a third party, but this IP may be available only in the language you aren’t currently using yourself. And there’s also the case where your company merges with, commences a joint project with, another company, where the two companies are entrenched in design flows using disparate languages. This leads to the concept of mixed-language simulation, of which there have historically been several flavors. One technique used in the early days was to translate the “foreign” language (the one you weren’t using) into the language you were working with. This was painful to say the least because the different languages supported different logic states and language constructs (even similar language statements had different semantics). The end result was that when you simulated the translated design, it rarely behaved the way you expected it to, so this approach is rarely used today. Another technique was to have both a VHDL simulator and a Verilog simulator and to cosimulate the two simulation kernels. In this case the performance of the ensuing simulation was sadly lacking because each kernel was forever stopping bama), ISBN 0-7506-9089-5 ■ 305 306 ■ The Design Warrior's Guide to FPGAs STA is pronounced by spelling it out as “S-T-A.” while it waited for the other to complete an action. Thus, once again, this approach is rarely used today. The optimum solution is to have a single-kernel simulator that supports designs represented as a mixture of VHDL and Verilog blocks. All of the big boys in EDA have their own version of such a tool, and some go far beyond anything envisaged in the past because they can support multiple languages such as Verilog, SystemVerilog, VHDL, SystemC, and PSL (where PSL is introduced in more detail in the “Formal verification” section in this chapter).3 Alternative delay formats How you decide to represent delays in the models you are creating for use with an event-driven simulator depends on two things: (a) the delay modeling capabilities of the simulator itself and (b) where in the flow (and with what tools) you intend to perform your timing analysis. A very common scenario is for static timing analysis (STA) to be performed externally from the simulation (this is discussed in more detail later in this chapter). In this case, logic gates (and more complex statements) may be modeled with zero (0 timebase unit) delays or unit (1 timebase unit) delays, where the term timebase unit refers to the smallest time segment recognized by the simulator. Alternatively, we might associate more sophisticated delays with logic gates (and more complex statements) for use in the simulation itself. The first level of complexity is to separate rising delays from falling delays at the output from the gate (or more complex statement). For historical reasons, a rising (0-to-1) delay is often referred to as LH (standing for “low-to-high”). Correspondingly, a falling (1-to-0) delay may be referred to as HL (meaning “high-to-low”). For example, consider what happens if we were to apply a 12 ps positive- 3 A good example of this type of single-kernel solution is ModelSim® from Mentor Graphics (www.mentor.com). Simulation, Synthesis, Verification, etc. Design Tools going (0-1-0) pulse to the input of a simple buffer gate with delays of LH = 5 ps and HL = 8 ps (Figure 19-4). LH = 5 ps HL = 8 ps in1 out1 BUF 12 ns in1 out1 15 ps 5 ps 8 ps Figure 19-4. Separating LH and HL delays. Not surprisingly, the output of the gate rises 5 ps after the rising edge is applied to the input, and it falls 8 ps after the falling edge is applied to the input. The really interesting point is that, due to the unbalanced delays, the 12 ps input pulse has been stretched to 15 ps at the output of the gate, where the additional 3 ps reflect the difference between the LH and HL values. Similarly, if a negative-going 12 ps (1-0-1) pulse were applied to the input of this gate, the corresponding pulse at the output would shrink to only 9 ps (try sketching this out on a piece of paper for yourself). In addition to LH and HL delays, simulators also support minimum:typical:maximum (min:typ:max) values for each delay. For example, consider a positive-going pulse of 16 ps presented to the input of a buffer gate with rising and falling delays specified as 6:8:10 ps and 7:9:11 ps, respectively (Figure 19-5). ■ 307 1929: British mechanical TVs roll off the production line. 308 ■ The Design Warrior's Guide to FPGAs LH = 6:8:10 ps HL = 7:9:11 ps in1 out1 BUF 16 ps in1 out1 (min) 6 ps 7 ps out1 (typ) 8 ps 9 ps out1 (max) TTL (which is pronounced by spelling it out as “T-T-L”) refers to bipolar junction transistors (BJTs) connected together in a certain fashion. BJT is pronounced by spelling it out as “B-J-T.” 10 ps 11 ps Figure 19-5. Supporting min:typ:max delays. This range of values is intended to accommodate variations in the operating conditions such as temperature and voltage. It also covers variations in the manufacturing process because some chips may run slightly faster or slower than others of the same type. Similarly, gates in one area of a chip (e.g., an ASIC or an FPGA) may switch faster or slower than identical gates in another area of the chip. (See also the discussions on timing analysis, particularly dynamic timing analysis, later in this chapter). In the early days, all of the input-to-output delays associated with a multi-input gate (or more complex statement) were identical. For example, consider a 3-input AND gate with an output called y and inputs a, b, and c. In this case, any LH and HL delays would be identical for the paths a-to-y, b-to-y, and c-to-y. Initially, this didn’t cause any problems because it matched the way in which delays were specified in data books. Over time, however, data books began to specify Simulation, Synthesis, Verification, etc. Design Tools individual input-to-output delays, so simulators had to be enhanced to support this capability. Another point to consider is what will happen when a narrow pulse is applied to the input of a gate (or more complex statement). By “narrow” we mean a pulse that is smaller than the propagation delay of the gate. The first logic simulators were largely targeted toward simple ICs implemented in transistor-transistor logic (TTL) being used at the circuit board level. These chips typically rejected narrow pulses, so that’s what the simulators did. This became known as the inertial delay model. As a simple example, consider two positive-going pulses of 8 ps and 4 ps applied to a buffer gate whose min:typ:max rising and falling delays are all set to 6 ps (Figure 19-6). LH = 6:6:6 ps HL = 6:6:6 ps in1 out1 BUF 4 ps 8 ps in1 Passes out1 6 ps Rejected 6 ps Figure 19-6. The inertial delay model rejects any pulse that is narrower than the gate’s propagation delay. By comparison, logic gates implemented in later technologies such as emitter-coupled logic (ECL) would pass pulses that were narrower than the propagation delay of the gate. In order to accommodate this, some simulators were equipped with a mode called the transport delay model. Once again, consider ■ 309 ECL (which is pronounced by spelling it out as “E-C-L”) refers to bipolar junction transistors connected together in a different fashion to TTL. Logic gates implemented in ECL switch faster than their TTL counterparts, but they also consume more power (and thus dissipate more heat). 310 ■ The Design Warrior's Guide to FPGAs 1929: Experiments begin on electronic colour television. two positive-going pulses of 8 ps and 4 ps applied to a buffer gate whose min:typ:max rising and falling delays are all set to 6 ps (Figure 19-7). LH = 6:6:6 ps HL = 6:6:6 ps in1 out1 BUF 4 ps 8 ps in1 Passes out1 6 ps 6 ps Passes 6 ps 6 ps Figure 19-7. The transport delay model propagates any pulse, irrespective of its width. The problem with both the inertial and transport delay models is that they only provide for extreme cases, so the creators of some simulators started to experiment with more sophisticated narrow-pulse handling techniques, such as the three-band delay model.4 In this case, each delay may be qualified with two values called r (for “reject”) and p (for “pass), specified as percentages of the total delay. For example, assume we have a buffer gate whose min:typ:max delays have all been set to 6 ps qualified by r and p values of 33 percent and 66 percent, respectively (Figure 19-8). Any pulses presented to the input that are greater than or equal to the p value will propagate; any pulses that are less than the r value will be completely rejected; and any pulses that fall between these two extremes will be propagated as a 4 The System HILO simulator from GenRad started to employ the 3-band delay model shortly before it disappeared off the face of the planet. Simulation, Synthesis, Verification, etc. Design Tools LH = 6:6:6 ps (33:66%) HL = 6:6:6 ps (33:66%) in1 BUF 3 ps 1 ps in1 Pass out1 X 6 ps Rejected 6 ps 6 ps 6 ps 311 1929: First ship-to shore communications (passenger can call relatives at home … at a price). out1 5 ps ■ Ambiguous Figure 19-8. The three-band delay model. pulse with an unknown X value to indicate that they are ambiguous because we don’t know whether or not they will propagate through the gate in the real world. (Setting both r and p to 100 percent equates to an inertial delay model, while setting them both to 0 percent reflects a pure transport delay model.) Cycle-based simulators An alternative to the event-driven approach is to use a cycle-based simulation technique. This is particularly well suited to pipelined designs in which “islands” of combinational logic are sandwiched between blocks of registers (Figure 19-9). In this case, a cycle-based simulator will throw away any timing information associated with the gates forming the combinational logic and convert this logic into a series of Boolean operations that can be directly implemented using the CPU’s logical machine code instructions. Given an appropriate circuit with appropriate activity, cycle-based simulators may offer significant run-time advantages over their event-driven counterparts. The downside, 312 ■ The Design Warrior's Guide to FPGAs 1929: Germany. Magnetic sound recording on plastic tape. Registers Combinatorial Logic Registers Combinatorial Logic Registers Data In etc. Clock Figure 19-9. A simple pipelined design. however, is that they typically only work with 0 and 1 logic values (no X or Z values, and no drive strength representations). Also, cycle-based simulators can’t represent asynchronous logic or combinatorial feedback loops. These days it’s rare to see anyone using a pure cycle-based simulator. However, several event-driven simulators have been augmented to have hybrid capabilities. In this case, if you instruct the simulator to aim for extreme performance (as opposed to timing accuracy), it will automatically handle some portions of the circuit using an event-driven approach and other portions using cycle-based techniques. Choosing the best logic simulator in the world! Choosing a logic simulator is, as with anything else in engineering, a balancing act. If you are a small startup and cost is your overriding metric, for example, then bounce over to the discussions on creating an open-source-based flow in Chapter 25. One point to consider is whether or not you require mixed-language capability. If you are a small startup, you may be planning on using only one language, but remember that any IP you decide to purchase down the road may not be available in this language. Having a solution that can work with VHDL, Verilog, and SystemVerilog would be a good start, and if it can also handle SystemC along with one or more formal verification languages, then it will probably stand you in good stead for some time to come. Simulation, Synthesis, Verification, etc. Design Tools Generally speaking, performance is the number-one criterion for most folks. The trick here is how to determine the performance of a simulator without being bamboozled. The only way to really do this is to have your own benchmark design and to run it on a number of simulators. Creating a good benchmark design is a nontrivial exercise, but it’s way better than using a design supplied by an EDA vendor (because such a design will be tuned to favor their solution, while delivering a swift knee to the metaphorical groins of competing tools). However, there’s more to life than raw performance. You also need to look for a good interactive debugging solution such that when you detect a problem, you can stop the simulator and poke around the design. All simulators are not created equal in this department. Different tools have different levels of capability; in some cases, even if the simulator does let you do what you want, you may have to jump through hoops to get there. So the trick here is—after running your performance benchmark—bring up the same circuit with a known bug and see how easy it is (and how long it takes) to detect and isolate the little rapscallion. In reality, some simulators that give you the performance you require do such a poor job in this department that you are obliged to use third-party postsimulation analysis tools.5 Another thing to consider is the capacity of the simulator. The tools supplied by the big boys in EDA essentially have no capacity limitations, but simulators from smaller vendors might be based on ported 32-bit code if you were to look under the hood. Of course, if you are only going to work with smaller designs (say, equivalent to 500,000 gates or less), then you will probably be okay with the simulators supplied by the FPGA vendors (these are typically “lite” versions of the tools supplied by the big EDA vendors). 5 Novas Software Inc. (www.novas.com) are at the top of the pile here with their Debussy® and Verdi™ tools. ■ 313 1929: The first car radio is installed. 314 ■ The Design Warrior's Guide to FPGAs 1930: America. Sliced bread is introduced. Of course, you will have your own criteria in addition to the topics raised above, such as the quality of the code coverage and performance analysis provided by the various tools. These used to be the province of specialist third-party tools, but most of the larger simulators now provide some level of integrated code coverage and performance analysis in the simulation environment itself. However, different simulators offer different feature sets (see also the discussions on code coverage and performance analysis in the “Miscellaneous” section later in this chapter). Synthesis (logic/HDL versus physically aware) Logic/HDL synthesis technology Traditional logic synthesis tools appeared on the scene around the early to mid-1980s. Depending on whom you are talking to, these tools are now often referred to as HDL synthesis technology. The role of the original logic/HDL synthesis tools was to take an RTL representation of an ASIC design along with a set of timing constraints and to generate a corresponding gate-level netlist. During this process, the synthesis application performed a variety of minimizations and optimizations (including optimizing for area and timing). Around the middle of the 1990s, synthesis tools were augmented to understand the concept of FPGA architectures. These architecturally aware applications could output a LUT/CLB-level netlist, which would subsequently be passed to the FPGA vendor’s place-and-route software (Figure 19-10). In real terms, the FPGA designs generated by architecturally aware synthesis tools were 15 to 20 percent faster than their counterparts created using traditional gate-level synthesis offerings. Physically aware synthesis technology The problem with traditional logic/HDL synthesis is that it was developed when logic gates accounted for most of the Simulation, Synthesis, Verification, etc. Design Tools Architecturally-aware logic/HDL synthesis Place-and-Route (FPGA Vendor) RTL Placed-and-routed Unplaced-and-unrouted LUT/CLB netlist LUT/CLB netlist Figure 19-10. Traditional logic/HDL synthesis. delays in a timing path, while track delays were relatively insignificant. This meant that the synthesis tools could use simple wire-load models to evaluate the effects of the track delays. (These models were along the lines of, One load gate on a wire equates to x pF of capacitance; two load gates on a wire equates to y pF of capacitance; etc.) The synthesis tool would then estimate the delay associated with each track as a function of its load and the strength of the gate driving the wire. This technique was adequate for the designs of the time, which were implemented in multimicron technologies and which contained relatively few logic gates by today’s standards. By comparison, modern designs can contain tens of millions of logic gates, and their deep submicron feature sizes mean that track delays can account for up to 80 percent of a delay path. When using traditional logic/HDL synthesis technology on this class of design, the timing estimations made by the synthesis tool bear so little resemblance to reality that achieving timing closure can be well-nigh impossible. For this reason, ASIC flows started to see the use of physically aware synthesis somewhere around 1996, and FPGA flows began to adopt similar techniques circa 2000 or 2001. Of course there are a variety of different definition, as to exactly what the term physically aware synthesis implies. The core concept is to use physical information earlier in the synthesis process, but what does this actually mean? For example, some companies have added interactive floor-planning capabilities ■ 315 1930: America. Vannevar Bush designs an analogue computer called a Differential Analyzer. 316 ■ The Design Warrior's Guide to FPGAs 1933: Edwin Howard Armstrong conceives a new system for radio communication: wideband frequency’s modulation (FM). to the front of their synthesis engines, and they class this as being physical synthesis or physically aware synthesis. For most folks, however, physically aware synthesis means taking actual placement information associated with the various logical elements in the design, using this information to estimate accurate track delays, and using these delays to fine-tune the placement and perform other optimizations. Interestingly enough, physically aware synthesis commences with a firstpass run using a relatively traditional logic/HDL synthesis engine (Figure 19-11). Architecturally-aware logic/HDL synthesis Place (FPGA Vendor) Physically-aware synthesis Place-and-Route (FPGA Vendor) RTL Unplaced-and-unrouted LUT/CLB netlist Placed LUT/CLB netlist Placed/optimized LUT/CLB netlist Placed-and-routed LUT/CLB netlist Figure 19-11. Physically aware synthesis. Retiming, replication, and resynthesis There are a number of terms that one tends to hear in the context of physical synthesis, including retiming, replication, and resynthesis.6 The first, retiming, is based on the concept of balancing out positive and negative slacks throughout the design. In this context, positive slack refers to a path with some delay available that you are not using, while negative slack refers to a path that is using more delay than is available to it. For example, let’s assume a pipelined design whose clock frequency is such that the maximum register-to-register delay is 15 ps. Now let’s assume that we have a situation as shown in Figure 19-12a, whereby the longest timing path in the first block of combinational logic is 10 ps (which means it has a 6 These concepts may also be used with traditional logic/HDL synthesis, but they are significantly more efficacious when applied in the context of physically aware synthesis. Simulation, Synthesis, Verification, etc. Design Tools Registers Registers 10 ps Registers Data In etc. Clock (a) Before retiming “Push” some logic across the register boundary etc. 15 ps 317 1934: Half the homes in the USA have radios. 20 ps Data In ■ 15 ps Clock (b) After retiming Figure 19-12. Retiming. positive slack of 5 ps), while the longest path in the next block of combinational logic is 20 ps (which means it has a negative slack of 5ps). Once the initial path timing, including routing delays, has been calculated, combinational logic is moved across register boundaries (or vice versa, depending on your point of view) to steal from paths with positive slack and donate to paths with negative slack (Figure 19-12b). Retiming is very common in physically aware FPGA design flows because registers are plentiful in FPGA devices. Replication is similar to retiming, but it focuses on breaking up long interconnect. For example, let’s assume that we have a register with 4 ps of positive slack on its input. Now let’s assume that this register is driving three paths, whose loads each see negative slack (Figure 19-13a). By replicating the register and placing the copies close to each load, we can redistribute the slack so as to make all of the timing paths work (Figure 19-13b). 318 ■ The Design Warrior's Guide to FPGAs 1935: All-electronic VHF television comes out of the lab. +4 ps -2 ps +1 ps +1 ps -1 ps +1 ps +2 ps -1 ps +1 ps +2 ps Register (a) Before replication (b) After replication Figure 19-13. Replication. Last, but not least, the concept of resynthesis is based on the fact that there are many different ways of implementing (and placing) different functions. Resynthesis uses the physical placement information to perform local optimizations on critical paths by means of operations like logic restructuring, reclustering, substitution, and possible elimination of gates and wires. Choosing the best synthesis tool in the world! STA is pronounced by spelling it out as “S-T-A.” Come on, be serious, you didn’t really expect to find the answer to this here, did you? In the real world, the capabilities of the various synthesis engines, along with associated features like autointeractive floor planning, change on an almost daily basis, and the various vendors are constantly leapfrogging each other. There’s also the fact that different engines may work better (or worse) with different FPGA vendors’ architectures. One thing to look for is the ability (or lack thereof) of the engine to infer things automatically, like clocking elements and embedded functions, from your source code or constraints files without your having to define them explicitly. At the end of the day, however, you are on your own when it comes to evaluating and ranking the various offerings (but please feel free to e-mail me to let me know how you get on at max@techbites.com. Simulation, Synthesis, Verification, etc. Design Tools Timing analysis (static versus dynamic) Static timing analysis The most common form of timing verification in use today is classed as STA. Conceptually, this is quite simple, although in practice things are, as usual, more complex than they might at first appear. The timing analyzer essentially sums all of the gate and track delays forming each path to give you the total input-tooutput delays for each path. (In the case of pipelined designs, the analyzer calculates delays from one bank of registers to the next.) Prior to place-and-route, the analyzer may make estimations as to track delays. Following place-and-route, the analyzer will employ extracted parasitic values (for resistance and capacitance) associated with the physical tracks to provide more accurate results. The analyzer will report any paths that fail to meet their original timing constraints, and it will also warn of potential timing problems (e.g., setup and hold violations) associated with signals being presented to the inputs of any registers or latches. STA is particularly well suited to classical synchronous designs and pipelined architectures. The main advantages of STA are that it is relatively fast, it doesn’t require a test bench, and it exhaustively tests every possible path into the ground. On the other hand, static timing analyzers are little rascals when it comes to detecting false paths that will never be exercised during the course of the design’s normal operation. Also, these tools aren’t at their best with designs employing latches, asynchronous circuits, and combinational feedback loops. Statistical static timing analysis STA is a mainstay of modern ASIC and FPGA design flows, but it’s starting to run into problems with the latest process technology nodes. At the time of this writing, the 90nano node is coming online, with the 45-nano node expected around 2007. ■ 319 320 ■ The Design Warrior's Guide to FPGAs CMP is pronounced by spelling it out as “C-M-P.” SSTA is so new at the time of writing that no one knows how to pronounce it, but my guess is that folks will say “Statistical S-T-A” (or spell it out as “S-S-T-A.”) DTA is pronounced by spelling it out as “D-T-A”. As previously discussed, in the case of modern silicon chips, interconnect delays dominate logic delays, especially with respect to FPGA architectures. In turn, interconnect delays are dependent on parasitic capacitance, resistance, and inductance values, which are themselves functions of the topology and cross-sectional shape of the wires. The problem is that, in the case of the latest technology process nodes, photolithographic processes are no longer capable of producing exact shapes. Thus, as opposed to working with squares and rectangles, we are now working with circles and ellipsoids. Feature sizes like the widths of tracks are now so small that small variations in the etching process cause deviations that, although slight, are significant with relation to the main feature size. (These irregularities are made more significant by the fact that in the case of high-frequency designs, the so-called skin-effect comes into play, which refers to the fact that high-frequency signals travel only through the outer surface, or skin, of the conductor.) Furthermore, there are variations in the vertical plane of the track’s cross section caused by processes like chemical mechanical polishing (CMP). As an overall result, it’s becoming increasingly difficult to calculate track delays accurately. Of course, it is possible to use the traditional engineering fallback of guard-banding (using worst-case estimations), but excessively conservative design practices result in device performance significantly below the silicon’s full potential, which is an extremely unattractive option in today’s highly competitive marketplace. In fact, the effects of geometry variations are causing the probability distributions of delays to become so wide that worst-case numbers may actually be slower than in an earlier process technology! One potential solution is the concept of the statistical static timing analyzer (SSTA). This is based on generating a probability function for the delay associated with each signal for each segment of a track, then evaluating the total delay probability functions of signals as they propagate through entire paths. At the time of this writing, there are no commercially Simulation, Synthesis, Verification, etc. Design Tools deliverable SSTA products, but a number of folks in EDA and the academic arena are looking into this technology. Dynamic timing analysis Another form of timing verification, known as dynamic timing analysis (DTA), really isn’t seen much these days, but it is mentioned here for the sake of interest. This form of verification is based on the use of an event-driven simulator, and it does require the use of a testbench. The key difference between a standard event-driven simulator and a dynamic timing analyzer is that the former only uses a single minimum (min), typical (typ), or maximum (max) delay for each path, while the latter uses a delay pair (either min:typ, typ:max, or min:max). For example, consider how the two simulators would evaluate a simple buffer gate (Figure 19-14). LH = 3:5:7 ps HL = 3:5:7 ps in1 out1 BUF in1 Standard simulator using ‘typ’ delay out1 5 ps 5 ps DTA simulator using ‘min:max’ delay pair out1 3 ps 3 ps 7 ps 7 ps Figure 19-14. Standard event-driven simulator versus dynamic timing analyzer. In the case of the standard simulator, a signal change at the input to the gate will cause an event to be scheduled for some specific time in the future. By comparison, in the case of the dynamic timing analyzer, assuming a min:max delay pair, the gate’s output will begin to transition after the minimum delay, ■ 321 322 ■ The Design Warrior's Guide to FPGAs but it won’t end its transition until it reaches the maximum delay. The ambiguity between these two values is different from an unknown X state, because we know that a good 0-to-1 or a 1-to-0 transition is going to take place, we just don’t know when. For this reason, we introduce two new states called “Gone high, but don’t know when” and “Gone low, but don’t know when.”7 DTA can detect subtle, potential problems that are almost impossible to find using any other form of timing analysis. Unfortunately, these tools are so compute intensive that you don’t really see them around much these days, but who knows what the future holds? Verification in general Verification IP DUT is pronounced by spelling it out as “D-U-T”. As designs increase in complexity, verifying their functionality consumes more and more time and resources. Such verification includes implementing a verification environment, creating a testbench, performing logic simulations, analyzing the results to detect and isolate problems, and so forth. In fact, verifying one of today’s high-end ASIC, SoC, or FPGA designs can consume 70 percent or more of the total development effort from initial concept to final implementation. One way to alleviate this problem is to make use of verification IP. The idea here is that the design, which is referred to as the device under test (DUT) for the purposes of verification, typically communicates with the outside world using standard interfaces and protocols. Furthermore, the DUT is typically communicating with devices such as microprocessors, peripherals, arbiters, and the like. 7 Dynamic timing analysis is discussed in a tad more detail in my book Designus Maximus Unleashed (Banned in Alabama), ISBN 0-7506-9089-5 Simulation, Synthesis, Verification, etc. Design Tools The most commonly used technique for performing functional verification is to use an industry-standard event-driven logic simulator. One way to test the DUT would be to create a testbench describing the precise bit-level signals to be applied to the input pins and the bit-level responses expected at the outputs. However, the protocols for the various interfaces and buses are now so complex that it is simply not possible to create a test suite in this manner. Another technique would be to use RTL models of all of the external devices forming the rest of the system. However, many of these devices are extremely proprietary and RTL models may not be readily available. Furthermore, simulating an entire system using fully functional models of all of the processor and I/O devices would be prohibitively expensive in terms of time and computing requirements. The solution is to use verification IP in the form of bus functional models (BFMs) to represent the processors and the I/O agents forming the system under test (Figure 19-15).8 BFMs of processors, I/O agents, arbiters, etc. DUT (RTL) BFM High-level transaction request from testbench or verification environment BFM Complex signals at the “bit twiddling” level High-level transaction result to testbench or verification environment These could be the same BFM Figure 19-15. Using verification IP in the form of BFMs. 8 One source of very sophisticated verification IP is TransEDA PLC (www.transeda.com). ■ 323 BFM is pronounced by spelling it out as “B-F-M.” 324 ■ The Design Warrior's Guide to FPGAs 1935: Audio tape recordings go on sale. A BFM doesn’t replicate the entire functionality of the device it represents; instead, it emulates the way the device works at the bus interface level by generating and accepting transactions. In this context, the term transaction refers to a high-level bus event such as performing a read or write cycle. The verification environment (or testbench) can instruct a BFM to perform a specific transaction like a memory write. The BFM then generates the complex low-level (“bittwiddling”) signal interactions on the bus driving the DUT’s interface transparently to the user. Similarly, when the DUT (the design) responds with a complex pattern of signals, another BFM (or maybe the original BFM) can interpret these signals and translate them back into corresponding high-level transactions. (See also the discussions on verification environments and creating testbenches below.) It should be noted that, although they are much smaller and simpler (and hence simulate much faster) than fully functional models of the devices they represent, BFMs are by no means trivial. For example, sophisticated BFMs, which are often created as cycle-accurate, bit-accurate C/C++ models, may include internal caches (along with the ability to initialize them), internal buffers, configuration registers, write-back queues, and so forth. Also, BFMs can provide a tremendous range of parameters that provide low-level control of such things as address timing, snoop timing, data wait states for different memory devices, and the like. Verification environments and creating testbenches When I was a young man starting out in simulation, we created test vectors (stimulus and response) to be used with our simulations as tabular ASCII text files containing logic 0 and 1 values (or hexadecimal values if you were lucky). At that time, the designs we were trying to test were incredibly simple compared to today’s monsters, so an English translation of our tests would be something along the lines of Simulation, Synthesis, Verification, etc. Design Tools ■ 325 At time 1,000 make the reset signal go into its active state. At time 2,000 make the reset signal go into its inactive state. At time 2,500 check to see that the 8-bit data bus is 00000000. At time … and so it went. Over time, designs became more complex, and the way in which they could be verified became more sophisticated with the advent of high-level languages that could be used to specify stimulus and expected response. These languages sported a variety of features such as loop constructs and the ability to vary the tests depending on the state of the outputs (e.g., “If the status bus has a value of 010, then jump to test xyz”). At some stage, folks started referring to these tests as testbenches.9 The current state of play is that many of today’s designs are now so complex that it’s well nigh impossible to create an adequate test bench by hand. This has paved the way for sophisticated verification environments and languages. Perhaps the most sophisticated of the languages, known by some as hardware verification languages (HVLs), is the aspect-oriented e offering from Verisity Design (www.verisity.com).10 In case you were wondering, e doesn’t actually stand for anything now, but originally it was intended to reflect the idea of “English-like” in that it has a natural language feel to it. You can use e to specify directed tests if you wish, but you would typically only wish to do this for special cases. Instead, the concept behind e, which you can think of as a blend of C and Verilog with a hint of Pascal, is more about declaring valid ranges and sequences of input values (along with their invalid counterparts) and high-level verification strategies. This e description is then used by an appropriate verification environ9 To be a tad more pedantic, the term “testbench” really refers to the infrastructure supporting test execution. 10 By and large, the industry tends to view proprietary languages with suspicion, so Verisity are working with the IEEE to make e an industry-standard language. At the time of this writing, the IEEE working group P1647 has been established and the e language reference manual (LRM) has been published. HVL is pronounced by spelling it out as “H-V-L.” 326 ■ The Design Warrior's Guide to FPGAs VCD is pronounced by spelling it out as “V-C-D.” ment to guide the simulations. Speaking of which, the first (and only, at the time of this writing) verification environment to make full use of the power of e is Verisity’s Specman Elite®. We can think of Specman as being a cross between a compiler and an event-driven simulator that links to and controls the standard HDL eventdriven simulators you are already using. Specman uses your e program to generate stimuli that are applied to your design (via your HDL simulator) on the fly. It also monitors the results and the functional coverage of the simulations and reacts to what it sees by dynamically retargeting subsequent stimuli to address any remaining coverage holes. Analyzing simulation results Almost every simulator comes equipped with a graphical waveform viewer that can be used to display results interactively (as the simulator runs) or to accept and display postsimulation results from a value change dump (VCD) file. Sad to relate, however, some of these tools are not as effective as one might hope when it comes to really analyzing this information and tracking down problems. In this case, you might wish to use a tool from a third-party vendor.11 In conversation, one almost invariably says “formal verification” (I’ve never heard anyone spelling it out as “F-V”). Formal verification Although large computer and chip companies like IBM, Intel, and Motorola have been developing and using formal tools internally for decades (since around the mid-1980s), the whole field of formal verification (FV) is still relatively new to a lot of folks. This is particularly true in the FPGA arena, where the adoption of formal verification is lagging behind its use in 11 In the context of classical waveform analysis, debugging, and display tools, one of the acknowledged industry leaders is Novas Software Inc. (www.novas.com) with its Debussy® offering. Another tool from Novas that is well worth looking at is Verdi™, which provides an extremely innovative and powerful way of extracting, visualizing, analyzing, exploring, and debugging a design’s temporal behavior across multiple clock cycles. Simulation, Synthesis, Verification, etc. Design Tools ASIC design flows. Having said this, formal verification can be such an incredibly powerful tool that more and more folks are starting to use it in earnest. One big problem is that formal verification is still so new to mainstream usage that there are a lot of players, all of whom are happily charging around in a bewildering variety of different directions. Also, as opposed to a lack of standards, there are now so many different offerings that the mind boggles. The confusion is only increased by the fact that almost everyone you talk to puts his or her unique spin on things (if, for example, you ask 20 EDA vendors to define and differentiate the terms assertion and property, your brains will leak out of your ears at the diametrically opposing responses).12 Trying to unravel this morass is a daunting task to say the least. However, there is nothing to fear but fear itself, as my dear old dad used to say, so let’s take a stab at rending the veils asunder and describing formal verification in a way that we can all understand. ■ 327 Formal tools were originally developed for internal use by large computer and chip companies. One of the first commercially available formal tools to be widely accepted was an equivalency checker called Design VERIFYer®, which was introduced in 1993 by Chrysalis Symbolic Design Inc. Different flavors of formal verification In the not-so-distant past, the term formal verification was considered synonymous with equivalency checking for the majority of design engineers. In this context, an equivalency checker is a tool that uses formal (rigorous mathematical) techniques to compare two different representations of a design—say an RTL description with a gate-level netlist—to determine whether or not they have the same input-to-output functionality. In fact, equivalency checking may be considered a subclass of formal verification called model checking, which refers to techniques used to explore the state-space of a system to test whether or not certain properties, typically specified in the form of assertions, are true. (Definitions of terms like property and assertion are presented a little later in this section.) 12 I speak from painful experience on this point! Model checking tools were also first developed by large companies for internal use. The introduction of Design inSIGHT® by Chrysalis in 1996 signaled the first commercial rollout of model checking technology. 328 ■ The Design Warrior's Guide to FPGAs For the purposes of the remainder of our discussions here, we shall understand formal verification to refer to model checking. It should be noted, however, that there is another category of formal verification known as automated reasoning, which uses logic to prove, much like a formal mathematical proof, that an implementation meets an associated specification. But just what is formal verification, and why is it so cool? ABV is pronounced by spelling it out as “A-B-V.” In order to provide a starting point for our discussions, let’s assume we have a design comprising a number of subblocks and that we are currently working with one of these blocks, whose role in life is to perform some specific function. In addition to the HDL representation that defines the functionality of this block, we can also associate one or more assertions/properties with that block (these assertions/properties may be associated with signals at the interface to the block or with signals and registers internal to the block). A very simple assertion/property might be along the lines of “Signals A and B should never be active (low) at the same time.” But these statements can also extend to extremely complex transaction-level constructs, such as “When a PCI write command is received, then a memory write command of type xxxx must be issued within 5 to 36 clock cycles.” Thus, assertions/properties allow you to describe the behavior of a time-based system in a formal and rigorous manner that provides an unambiguous and universal representation of the design’s intent (try saying that quickly). Furthermore, assertions/properties can be used to describe both expected and prohibited behavior. The fact that assertions/properties are both human- and machine-readable makes them ideal for the purposes of capturing an executable specification, but they go far beyond this. Let’s return to considering a very simple assertion/property such as “Signals A and B should never be active (low) at the same time.” One term you will hear a lot is assertion-based veri- Simulation, Synthesis, Verification, etc. Design Tools fication (ABV), which comes in several flavors: simulation, static formal verification, and dynamic formal verification. In the case of static formal verification, an appropriate tool reads in the functional description of the design (typically at the RTL level of abstraction) and then exhaustively analyzes the logic to ensure that this particular condition can never occur. By comparison, in the case of dynamic formal verification, an appropriately augmented logic simulator will sum up to a certain point, then pause and automatically invoke an associated formal verification tool (this is discussed in more detail below). Of course, assertions/properties can be associated with the design at any level, from individual blocks, to the interfaces linking blocks, to the entire system. This leads to a very important point, that of verification reuse. Prior to formal verification, there was very little in the way of verification reuse. For example, when you purchase an IP core, it will typically come equipped with an associated testbench that focuses on the I/O signals at the core’s boundary. This allows you to verify the core in isolation, but once you’ve integrated the core into the middle of your design, its testbench is essentially useless to you. Now consider purchasing an IP core that comes equipped with a suite of predefined assertions/properties, like “Signal A should never exhibit a rising transition within three clocks of Signal B going active.” These assertions/properties provide an excellent mechanism for communicating interface assumptions from the IP developer to downstream users. Furthermore, these assertions/properties remain true and can be evaluated by the verification environment, even when this IP core is integrated into your design. With regard to assertions/properties associated with the system’s primary inputs and outputs, the verification environment may use these to automatically create stimuli to drive the design. Furthermore, you can use assertions/properties throughout the design to augment code and functional coverage analysis (see also the “Miscellaneous” section below) so as to ■ 329 1935: England. First demonstration of Radar at Daventry. 330 ■ The Design Warrior's Guide to FPGAs 1936: America. Efficiency expert August Dvorak invents a new typewriter layout called the Dvorak Keyboard. ensure that specific sequences of actions or conditions have been performed. Terminology and definitions Now that we’ve introduced the overall concept of the model checking aspects of formal verification, we are better equipped to wade through some terminology and definitions. To be fair, this is relatively uncharted water (“Here be dragons”); the following was gleaned from talking with lots of folks and then desperately trying to rationalize the discrepancies between the tales they told. ■ ■ Assertions/properties: The term property comes from the model checking domain and refers to a specific functional behavior of the design that you want to (formally) verify (e.g., “after a request, we expect a grant within 10 clock cycles”). By comparison, the term assertion stems from the simulation domain and refers to a specific functional behavior of the design that you want to monitor during simulation (and flag a violation if that assertion “fires”). Today, with the use of formal tools and simulation tools in unified environments and methodologies, the terms property and assertion tend to be used interchangeably; that is, a property is an assertion and vice versa. In general, we understand an assertion/property to be a statement about a specific attribute associated with the design that is expected to be true. Thus, assertions/properties can be used as checkers/monitors or as targets of formal proofs, and they are usually used to identify/trap undesirable behavior. Constraints: The term constraint also derives from the model checking space. Formal model checkers consider all possible allowed input combinations when performing their magic and working on a proof. Thus, there is often a need to constrain the inputs to their legal behavior; otherwise, the tool would report false nega- Simulation, Synthesis, Verification, etc. Design Tools ■ ■ tives, which are property violations that would not normally occur in the actual design. As with properties, constraints can be simple or complex. In some cases, constraints can be interpreted as properties to be proven. For example, an input constraint associated with one module could also be an output property of the module driving this input. So, properties and constraints may be dual in nature. (The term constraint is also used in the “constrained random simulation” domain, in which case the constraint is typically used to specify a range of values that can be used to drive a bus.) Event: An event is similar to an assertion/property, and in general events may be considered a subset of assertions/properties. However, while assertions/properties are typically used to trap undesirable behavior, events may be used to specify desirable behavior for the purposes of functional coverage analysis. In some cases, assertions/properties may consist of a sequence of events. Also, events can be used to specify the window within which an assertion/property is to be tested (e.g., “After a, b, c, we expect d to be true, until e occurs,” where a, b, c, and e are all events, and d is the behavior being verified). Measuring the occurrence of events and assertions/properties yields quantitative data as to which corner cases and other attributes of the design have been verified. Statistics about events and assertions/properties can also be used to generate functional coverage metrics for a design. Procedural: The term procedural refers to an assertion/property/event/constraint that is described within the context of an executing process or set of sequential statements, such as a VHDL process or a Verilog “always” block (thus, these are sometimes called “incontext” assertions/properties). In this case, the assertion/property is built into the logic of the design and ■ 331 1936: America. Psychologist Benjamin Burack constructs the first electrical logic machine (but he doesn’t publish anything about it until 1949). 332 ■ The Design Warrior's Guide to FPGAs 1936: First electronic speech synthesis (Vodar). ■ ■ will be evaluated based on the path taken through a set of sequential statements. Declarative: The term declarative refers to an assertion/property/event/constraint that exists within the structural context of the design and is evaluated along with all of the other structural elements in the design (for example, a module that takes the form of a structural instantiation). Another way to view this is that a declarative assertion/property is always “on/active,” unlike its procedural counterpart that is only “on/active” when a specific path is taken/executed through the HDL code. Pragma: The term pragma is an abbreviation for “pragmatic information,” which refers to special pseudocomment directives that can be interpreted and used by parsers/compilers and other tools. (Note that this is a general-purpose term, and pragma-based techniques are used in a variety of tools in addition to formal verification technology.) Alternative assertion/property specification techniques This is where the fun really starts, because there are various ways in which assertions/properties and so forth can be implemented, as summarized below. ■ Special languages: This refers to using a formal property/assertion language that has been specially constructed for the purpose of specifying assertions/ properties with maximum efficiency. Languages of this type, of which Sugar, PSL, and OVA are good examples, are very powerful in creating sophisticated, regular, and temporal expressions, and they allow complex behavior to be specified with very little code (Sugar, PSL, and OVA are introduced in more detail later in this chapter). Such languages are often used to define assertions/ properties in “side-files” that are maintained outside Simulation, Synthesis, Verification, etc. Design Tools ■ ■ the main HDL design representation. These side-files may be accessed during parser/compile time and implemented in a declarative fashion. Alternatively, a parser/compiler/simulator may be augmented so as to allow statements in the special language to be embedded directly in the HDL as in-line code or as pragmas (see the definition of “pragma” in the previous section); in both of these cases, the statements may be implemented in a declarative and/or procedural manner (see the definitions of “declarative” and “procedural” in the previous section). Special statements in the HDL itself: Right from the get-go, VHDL came equipped with a simple assert statement that checks the value of a Boolean expression and displays a user-specified text string if the expression evaluates False. The original Verilog did not include such a statement, but SystemVerilog has been augmented to include this capability. The advantage of this technique is that these statements are ignored by synthesis engines, so you don’t have to do anything special to prevent them from being physically implemented as logic gates in the final design. The disadvantage is that they are relatively simplistic compared to special assertion/property languages and are not well equipped to specify complex temporal sequences (although SystemVerilog is somewhat better than VHDL in this respect). Models written in the HDL and called from within the HDL: This concept refers to having access to a library of internally or externally developed models. These models represent assertions/properties using standard HDL statements, and they may be instantiated in the design like any other blocks. However, these instantiations will be wrapped by synthesis OFF/ON pragmas to ensure that they aren’t physically implemented. A good example of this approach is the open verification library (OVL) from the Accellera standards committee ■ 333 1936: Fluorescent lighting is introduced. 334 ■ The Design Warrior's Guide to FPGAs 1936: The Munich Olympics are televised ■ (www.accellera.org), as discussed in the next section. Models written in the HDL and accessed via pragmas: This is similar in concept to the previous approach in that it involves a library of models that represent assertions/properties using standard HDL statements. However, as opposed to instantiating these models directly from the main design code, they are pointed to by pragmas. A good example of this technique is the CheckerWare® library from 0-In Design Automation (www.0-In.com). For example, consider a design containing the following line of Verilog code: reg [5:0] STATE_VAR; // 0in one_hot The left-hand side of this statement declares a 6-bit register called STATE_VAR, which we can assume is going to be used to hold the state variables associated with an FSM. Meanwhile, the right-hand side (“0in one–hot”) is a pragma. Most tools will simply treat this pragma as a comment and ignore it, but 0-In’s tools will use it to call a corresponding “one-hot” assertion/property model from their CheckerWare library. Note that the 0-In implementation means that you don’t need to specify the variable, the clocking, or the bit-width of the assertion; this type of information is all picked up automatically. Also, depending on a pragma’s position in the code, it may be implemented in a declarative or procedural manner. Static formal versus dynamic formal This is a little tricky to wrap one’s brain around, so let’s take things step by step. First of all, you can use assertions/properties in a simulation environment. In this case, if you have an assertion/property along the lines of “Signals A and B should never be active (low) at the same time,” then if this illegal case occurs during the course of a simulation, a Simulation, Synthesis, Verification, etc. Design Tools warning flag will be raised, and the fact this happened can be logged. Simulators can cover a lot of ground, but they require some sort of testbench or a verification environment that is dynamically generating stimulus. Another consideration is that some portions of a design are going to be difficult to verify via simulation because they are deeply buried in the design, making them difficult to control from the primary inputs. Alternatively, some areas of a design that have large amounts of complex interactions with other state machines or external agents will be difficult to control. At the other end of the spectrum is static formal verification. These tools are incredibly rigorous and they examine 100 percent of the state space without having to simulate anything. Their disadvantage is that they can typically be used for small portions of the design only, because the state space increases exponentially with complex properties, and one can quickly run into a “state space explosion.” By comparison, logic simulators, which can also be used to test for assertions, can cover a lot of ground, but they do require stimuli, and they don’t cover every possible case. In order to address these issues, some solutions combine both techniques. For example, they may use simulation to reach a corner condition and then automatically pause the simulator and invoke a static formal verification engine to exhaustively evaluate that corner condition. (In this context, a general definition of a “corner condition” or “corner case” is a hard-to-exercise or hard-to-reach functional condition associated with the design.) Once the corner condition has been evaluated, control will automatically be returned to the simulator, which will then proceed on its merry way. This combination of simulation and traditional static formal verification is referred to as dynamic formal verification. As one simple example of where this might be applicable, consider a FIFO memory, whose “Full” and “Empty” states may be regarded as corner cases. Reaching the “Full” state will require a lot of clock cycles, which is best achieved using simu- ■ 335 1937: American. George Robert Stibitz, a scientist at Bell Labs, builds a simple digital calculator machine based on relays called the Model K. 336 ■ The Design Warrior's Guide to FPGAs OVA is pronounced by spelling it out as “O-V-A.” With regards to OVA, the original version drew on Synopsys’s strength in simulation technologies. The folks at Synopsys subsequently desired to extend OVA to support formal property verification, so they partnered with the guys and gals at Intel to build on their experience in formal verification with their internally developed ForSpec assertion language. The result was OVA 2.0, which included powerful constructs for both static and dynamic formal verification. OVL is pronounced by spelling it out as “O-V-L.” PSL is pronounced by spelling it out as “P-S-L.” lation. But exhaustively evaluating attributes/properties associated with this corner case, such as the fact that it should not be possible to write any more data while the FIFO is full, is best achieved using static techniques. Once again, a good example of this dynamic formal verification approach is provided by 0-In. Corner cases are explicitly defined as such in their CheckerWare library models. When a corner case is reached during simulation, the simulator is paused, and a static tool is used to analyze that corner case in more detail. Summary of different languages, etc. This is where things could start to get really confusing if we’re not careful (so let’s be careful). We’ll begin with something called Vera®, which began life with work done at Sun Microsystems in the early 1990s. It was provided to Systems Science Corporation somewhere around the mid-1990s, which was in turn acquired by Synopsys in 1998. Vera is essentially an entire verification environment, similar to, but perhaps not as sophisticated as, the e verification language/environment introduced earlier in this chapter. Vera encapsulates testbench features and assertion-based capabilities, and Synopsys promoted it as a stand-alone product (with integration into the Synopsys logic simulator). Sometime later, due to popular demand, Synopsys opened things up to for third-party use by making OpenVera™ and OpenVera Assertions (OVA) available. Somewhere around this time, SystemVerilog was equipped with its first pass at an assert statement. Meanwhile, due to the increasing interest in formal verification technology, one of the Accellera standards committees started to look around for a formal verification language it could adopt as an industry standard. A number of languages were evaluated, including OVA, but in 2002, the committee eventually opted for the Sugar language from IBM. Just to add to the fun and frivolity, Synopsys then donated OVA to the Accellera committee in Simulation, Synthesis, Verification, etc. Design Tools charge of SystemVerilog (this was a different committee from the one evaluating formal property languages). Yet another Accellera committee ended up in charge of something called the open verification library, or OVL, which refers to a library of assertion/property models available in both VHDL and Verilog 2K1. So now we have the assert statements in VHDL and SystemVerilog, OVL (the library of models), OVA (the assertion language), and the property specification language (PSL), which is the Accellera version of IBM’s Sugar language (Figure 19-16).13 The advantage of PSL is that it has a life of its own in that it can be used independently of the languages used to represent the functionality of the design itself. The disadvantage Verification Style Black Box A lot of this middle ground is covered by IP and interface protocols in the form of verification IP monitors and protocol checkers PSL (Black box at the system level) IP Gray Box White Box The effects start to diminish as we approach the system level, but they are persistent SystemVerilog Cone of influence of (White box at the block level) SystemVerilog with OVA OVL Block Level (Design Engineer) Sub-system Level System Level (Verification Engineer) Figure 19-16. Trying to put everything into context and perspective. 13 Don’t make the common mistake of referring to “PSL/Sugar” as a single/combined language. There’s PSL and there’s Sugar and they’re not the same thing. PSL is the Accellera standard, while Sugar is the language used inside IBM. ■ 337 1937: England. Graduate student Alan Turing invents a theoretical (thought experiment) computer called the Turing Machine. 338 ■ The Design Warrior's Guide to FPGAs 1937: England. Graduate student Alan Turing writes his groundbreaking paper On Computable Numbers with an Application to the Entscheidungsproblem. is that it doesn’t look like anything the hardware description languages design engineers are familiar with, such as VHDL, Verilog, C/C++, and the like. There is some talk of spawning various flavors of PSL, such as a VHDL PSL, a Verilog PSL, a SystemC PSL, and so forth; the syntax would differ among these flavors so as to match the target language, but their semantics would be identical. It’s important to note that figure 19-16 just reflects one view of the world, and not everyone will agree with it (some folks will consider this to be a brilliant summation of an incredibly confusing situation, while others will regard it as being a gross simplification at best and utter twaddle at worst). Miscellaneous HDL to C conversion As we discussed in chapter 11, there is an increasing push toward capturing designs at higher levels of abstraction such as C/C++. In addition to facilitating architectural exploration, high-level (behavioral and/or algorithmic) C/C++ models can simulate hundreds or thousands of times faster than can their HDL/RTL counterparts. Having said this, many design engineers still prefer to work in their RTL comfort zone. The problem is that when you are simulating an entire SoC with an embedded processor core, memory, peripherals, and other logic all represented in RTL, you are lucky to achieve simulation speeds of more than a couple of hertz (that is, a few cycles of the main system clock for each second in real time). In order to address this problem, some EDA companies are starting to offer ways to translate your “Golden RTL” models into faster-simulating alternatives that can achieve kilohertz simulation speeds.14 This is fast enough to allow you to run 14 One interesting solution is the VTOC™ (Verilog-to-C) translator from Tenison Technology Ltd. (www.tenison.com). Another is the SPEEDCompiler™ and DesignPlayer™ concept from Carbon Design Systems Inc. (www.carbondesignsystems.com). Simulation, Synthesis, Verification, etc. Design Tools software on your hardware representation for milliseconds of real run time. In turn, this allows you to test critical foundation software, such as drivers, diagnostics, and firmware, thereby facilitating system validation and verification to occur much faster than with traditional methods. Code coverage, etc. In the not-so-distant past, code coverage tools were specialist items provided by third-party EDA vendors. However, this capability is now considered important enough that all of the big boys have code coverage integrated into their verification (simulation) environments, but, of course, the feature sets vary among offerings. By now, it may not surprise you to learn that there are a lot of different flavors of code coverage, summarized briefly in order of increasing sophistication as follows: ■ ■ ■ ■ ■ Basic code coverage: This is just line coverage; that is, how many times each line in the source code is hit (executed). Branch coverage: This refers to conditional statements like if-then-else; how many times do you go down the then path and how many down the else path. Condition coverage: This refers to statements along the lines of “if (a OR b == TRUE) then.” In this case, we are interested in the number of times the then path was taken because variable a was TRUE compared to the number of times variable b was TRUE. Expression coverage: This refers to expressions like “a = (b AND c) OR !d”. In this case, we are interested in analyzing the expression to determine all of the possible combinations of input values and also which combinations triggered a change in the output and which variables were never tested. State coverage: This refers to analyzing state machines to determine which states were visited and which ones were neglected, as well as which guard conditions and ■ 339 1937: Pulse-code modulation points the way towards digital radio transmission. 340 ■ The Design Warrior's Guide to FPGAs 1938: American Claude E. Shannon publishes an article (based on his master’s thesis at MIT) showing how Boolean Algebra can be used to design digital circuits. ■ ■ paths between states are taken, and which aren’t, and so forth. You can derive this sort of information from line coverage, but you have to read between the lines (pun intended). Functional coverage: This refers to analyzing which transaction-level events (e.g., memory-read and memory-write transactions) and which specific combinations and permutations of these events have been exercised. Assertion/property coverage: This refers to a verification environment that can gather, organize, and make available for analysis the results from all of the different simulation-driven, static formal, and dynamic formal assertion-/property-based verification engines. This form of coverage can actually be spilt into two camps: specification-level coverage and implementationlevel coverage. In this context, specification-level coverage measures verification activity with respect to items in the high-level functional or macroarchitecture definition. This includes the I/O behaviors of the design, the types of transactions that can be processed (including the relationships of different transaction types to each other), and the data transformations that must occur. By comparison, implementation-level coverage measures verification activity with respect to microarchitectural details of the actual implementation. This refers to design decisions that are embedded in the RTL that result in implementation-specific corner cases, for example, the depth of a FIFO buffer and the corner cases for its “high-water mark” and “full” conditions. Such implementation details are rarely visible at the specification level. Performance analysis One final feature that’s important in a modern verification environment is its ability to perform performance analysis. This refers to having some way of analyzing and reporting exactly Simulation, Synthesis, Verification, etc. Design Tools where the simulator is spending its time. This allows you to focus on high-activity areas of your design, which may reap huge rewards in terms of final system performance. ■ 341 1938: Argentina. Hungarian Lazro Biro invents and patterns the first ballpoint pen. 1938: Germany. Konrad Zuse finishes the construction of the first working mechanical digital computer (the ZI) 1938: John Logie Baird demonstrated live TV in colour. 1938: America. Radio drama War of the Worlds causes wide spread panic. 1938: Television broadcasts can be taped and edited. 1938: Walter Schottky discovers the existence of holes in the band structure of semiconductors and explains metal/ semiconductor interface rectification. Chapter 20 Choosing the Right Device So many choices Many aspects of life would be so much simpler if we were presented with fewer alternatives. For example, ordering a seemingly simple American Sunday brunch comprising eggs, bacon, hash browns (fried potatoes), and toast can take an inordinate amount of time because there are so many options to choose from. First, your waitress is going to ask you how you want your eggs (sunny-side up, over-easy, over-medium, over-hard, scrambled, poached, hard-boiled, in an omelet, etc.). Next, you will be asked if you want American or Canadian bacon; should your hash browns be complemented by onions, tomatoes, cheese, ham, chili, or any combination thereof; would you like the bread for your toast to be white, rye, whole wheat, stone ground, sourdough ... The frightening thing is that the complexity of ordering brunch pales in comparison to choosing an FPGA because there are so many product families from the different vendors. Product lines and families from the same vendor overlap; product lines and families from different vendors both overlap and, at the same time, sport different features and capabilities; and things are constantly changing, seemingly on a daily basis. If only there were a tool Before we start, it’s worth noting that size isn’t everything in the FPGA design world. You really need to base your FPGA selection on your design needs, such as number of I/O pins, 344 ■ The Design Warrior's Guide to FPGAs 1939: America. George Robert Stibitz builds a digital calculator called the Complex Number Calculator. 1939: America John Vincent Atanasoff (and Clifford Berry) may or may not have constructed the first truly electronic special-purpose digital computer called the ABC (but it didn’t work till 1942). available logic resources, availability of special functional blocks, and so forth. Another consideration is whether you already have dealings with a certain FPGA vendor and product family, or whether you are plunging into an FPGA design for the very first time. If you already have a history with a vendor and are familiar with using its components, tools, and design flows, then you will typically stay within that vendor’s offerings unless there’s an overriding reason for change. For the purposes of the remainder of these discussions, however, we’ll assume that we are starting from ground zero and have no particular affiliation with any vendor. In this case, choosing the optimum device for a particular design is a daunting task. Becoming familiar with the architectures, resources, and capabilities associated with the various product families from the different FPGA vendors demands a considerable amount of time and effort. In the real world, time-to-market pressures are so intense that design engineers typically have sufficient time to make only high-level evaluations before settling on a particular vendor, family, and device. In this case, the selected FPGA is almost certainly not the optimum component for the design, but this is the way of the world. Given a choice, it would be wonderful to have access to some sort of FPGA selection wizard application (preferably Web based). This would allow you to choose a particular vendor, a selection of vendors, or make the search open to all vendors. For the purposes of a basic design, the wizard should then prompt you to enter estimates for such things as ASIC equivalent gates or FPGA system gates (assuming there are good definitions as to what equivalent gates and system gates are—see also chapter 4). The wizard should also prompt for details on I/O pin requirements, I/O interface technologies, acceptable packaging options, and so forth. In the case of a more advanced design, the wizard should prompt you for any specialist options such as gigabit transceiv- Choosing the Right Device ers or embedded functions like multipliers, adders, MACs, RAMs (both distributed and block RAM), and so forth. The wizard should also allow you to specify if you need access to embedded processor cores (hard or soft) along with selections of associated peripherals. Last, but not least, it would be nice if the wizard would prompt you as to any IP requirements (hey, since we’re dreaming, let’s dream on a grand scale). Finally, clicking the “Go” button would generate a report detailing the leading contenders and their capabilities (and costs). Returning to the real world with a sickening thump, we remember that no such utility actually exists at this time1, so we have to perform all of these evaluations by hand, but wouldn’t it be nice … Of course, creating this sort of application would be nontrivial, and maintaining it would be demanding and time-consuming, but I’m sure that system houses or design engineers would happily pay some sort of fee for such a service should anyone be brave enough to pick up the challenge and run with it. Technology One of your first choices is going to be deciding on the underlying FPGA technology. Your main options are as follows: ■ 1 SRAM based: Although very flexible, this requires an external configuration device and can take up to a few seconds to be configured when the system is first powered up. Early versions of these devices could have substantial power supply requirements due to high transient startup currents, but this problem has been addressed in the current generation of devices. One key advantage of this option is that it is based on standard CMOS technology and doesn’t require any esoteric process steps. There used to be tools like this to aid in selecting PLDs, but that was a significantly less complex solution space. ■ 345 1939: Bell Labs begin testing high-frequency radar. 1939: Light-emitting diodes (LEDs) are patented by Messers Bay and Szigeti. 346 ■ The Design Warrior's Guide to FPGAs 1939: England. Regular TV broadcasts begin. 1940: America. George Robert Stibitz performs first example of remote computing between New York and New Hampshire. ■ ■ This means that SRAM-based FPGAs are at the forefront of the components available with the most current technology node. Antifuse based: Considered by many to offer the most security with regard to design IP, this also provides advantages like low power consumption, instant-on availability, and no requirement for any external configuration devices (which saves circuit board cost, space, and weight). Antifuse-based devices are also more radiation hardened than any of the other technologies, which makes them of particular interest for aerospace-type applications. On the downside, this technology is a pain to prototype with because it’s OTP. Antifuse devices are also typically one or more generations behind the most current technology node because they require additional process steps compared to standard CMOS components. FLASH based: Although considered to be more secure than SRAM-based devices, these are slightly less secure than antifuse components with regard to design IP. FLASH-based FPGAs don’t require any external configuration devices, but they can be reconfigured while resident in the system if required. In the same way as antifuse components, FLASH-based devices provide advantages like instant-on capability, but are also typically one or more generations behind the most current technology node because they require additional process steps compared to standard CMOS components. Also, these devices typically offer a much smaller logic (system) gate-count than their SRAM-based counterparts. Basic resources and packaging Once you’ve decided on the underlying technology, you need to determine which devices will satisfy your basic Choosing the Right Device resource and packaging requirements. In the case of core resources, most designs are pin limited, and it’s typically only in the case of designs featuring sophisticated algorithmic processing like color space conversion that you will find yourself logic limited. Regardless of the type of design, you will need to decide on the number of I/O pins you are going to require and the approximate number of fundamental logical entities (LUTs and registers). As discussed in chapter 4, the combination of a LUT, register, and associated logic is called a logic element (LE) by some and a logic cell (LC) by others. It is typically more useful to think in these terms as opposed to higher-level structures like slices and configurable logic blocks (CLBs) or logic array blocks (LABs) because the definition of these more sophisticated structures can vary between device families. Next, you need to determine which components contain a sufficient number of clock domains and associated PLLs, DLLs, or digital clock managers (DCMs). Last, but not least, if you have any particular packaging requirements in mind, it would be a really good idea to ensure that the FPGA family that has caught your eye is actually available in your desired package. (I know this seems obvious, but would you care to place a bet that no one ever slipped up on this point before?) General-purpose I/O interfaces The next point to ponder is which components have configurable general-purpose I/O blocks that support the signaling standard(s) and termination technologies required to interface with the other components on the circuit board. Let’s assume that way back at the beginning of the design process, the system architects selected one or more I/O standards for use on the circuit board. Ideally, you will find an FPGA that supports this standard and also provides all of the other capabilities you require. If not, you have several options: ■ 347 1940: Bell Labs conceives the idea of cell phones (but the technology won’t exist to bring them to market for another 30 years). 1941: First microwave transmissions. 348 ■ The Design Warrior's Guide to FPGAs 1941: First touch-tone phone system (too expensive for general use). 1942: Germany between 1942 and 1945/6, Konrad Zuse develops the idea for a high-level computer programming language called Plankakul. ■ ■ If your original FPGA selection doesn’t provide any must-have capabilities or functionality, you may decide to opt for another family of FPGAs (possibly from another vendor). If your original FPGA selection does provide some must-have capabilities or functionality, you may decide to use some external bridging devices (this is expensive and consumes board real estate). Alternatively, in conjunction with the rest of the system team, you may decide to change the circuit board architecture (this can be really expensive if the system design has progressed to any significant level). Embedded multipliers, RAMs, etc. At some stage you will need to estimate the amount of distributed RAM and the number of embedded block RAMs you are going to require (along with the required widths and depths of the block RAMs). Similarly, you will need to muse over the number of special embedded functions (and their widths and capabilities) like multipliers and adders. In the case of DSP-centric designs, some FPGAs may contain embedded functions like MACs that will be particularly useful for this class of design problem and may help to steer your component selection decisions. Embedded processor cores If you wish to use an embedded processor core in your design, you will need to decide whether or not a soft core will suffice (such a core may be implemented across a number of device families) or if a hard core is the order of the day (see also the discussion in Chapter 13). In the case of a soft core, you may decide to use the offering supplied by an FPGA vendor. In this case, you are going to become locked into using that vendor, so you need to evaluate the various alternatives carefully before taking the plunge. Alternatively, you may decide to use a third-party Choosing the Right Device soft-core solution that can be implemented using devices from multiple vendors.2 If you decide on a hard core, you have little option but to become locked into a particular vendor. One consideration that may affect your decision process is your existing experience with different types of processors. Let’s say that you hold a black belt in designing systems based around the PowerPC, for example. In such a case, you would want to preserve your investment in PowerPC design tools and flows (and your experience and knowledge in using such tools and flows). Thus, you would probably decide on an FPGA offering from Xilinx because they support the PowerPC. Alternatively, if you are a guru with respect to ARM or MIPS processors, then selecting devices from Altera or QuickLogic, respectively, may be the way to go. Gigabit I/O capabilities If your system requires the use of gigabit transceivers, then points to consider are the number of such transceivers in the device and the particular standard that’s been selected by your system architects at the circuit board level (see also Chapter 21). IP availability Each of the FPGA vendors has an IP portfolio. In many cases there will be significant overlap between vendors, but more esoteric functions may only be available from selected vendors, which may have an impact on your component selection. Alternatively, you may decide to purchase your IP from a third-party provider. In such a case, this IP may be available for use with multiple FPGAs from different vendors, or it may only be available for use with a subset of vendors (and a subset of device families from those vendors). 2 An example of this type of solution is the Nexar offering from Altium Ltd. (www.altium.com), which was introduced in Chapter 13. ■ 349 1943: Germany. Konrad Zuse starts work on his general-purpose relay-based computer called the Z4. 1944: America. Howard Aiken and team finish building an electromechanical .computer called the Harvard Mark I (also known as the IBM ASCC). 350 ■ The Design Warrior's Guide to FPGAs 1945: America. Hungarian/American mathematician Johann (John) Von Neumann publishes a paper entitled First draft on a report on the EDVAC. 1945: Percy L Spensor invents the Microwave Oven (the first units go on sale in 1947) One further point: We commonly think of IP in terms of hardware design functions, but some IP may come in the form of software routines.3 For example, consider a communications function that might be realized as a hardware implementation in the FPGA fabric or as a software stack running on the embedded processor. In the latter case, you might decide to purchase the software stack routines from a third party, in which case you are essentially acquiring software IP. Speed grades Once you’ve decided on a particular FPGA component for your design, one final decision is the speed grade of this device. The FPGA vendors’ traditional pricing model makes the performance (speed grade) of a device a major factor with regard to the cost of that device. As a rule of thumb, moving up a speed grade will increase performance by 12 to 15 percent, but the cost of the device will increase by 20 to 30 percent. Conversely, if you can manipulate the architecture of your design to improve performance by 12 to 15 percent (say, by adding additional pipelining stages), then you can drop a speed grade and save 20 to 30 percent on the cost of your silicon (FPGA). If you are only contemplating a single device for prototyping applications, then this may not be a particularly significant factor for you. On the other hand, if you are going to be purchasing hundreds or thousands of these little rascals, than you should start thinking very seriously about using the lowest speed grade you can get away with. The problem is that modifying and reverifying RTL to perform a series of what-if evaluations of alternative implementations is difficult and time-consuming. (Such evaluations may include performing certain operations in parallel versus sequentially, pipelining portions of the design versus nonpipelining, resource sharing, etc.) This means that 3 There’s also Verification IP, as discussed in chapter 19. Choosing the Right Device ■ 351 the design team may be limited to the number of evaluations it can perform, which can result in a less-than-optimal implementation. As discussed in chapter 11, one alternative is to use a pure untimed C/C++-based flow. Such a flow should feature a C/C++ analysis and synthesis engine that allows you to perform microarchitecture trade-offs and evaluate their effects in terms of size/area and speed/clock cycles. Such a flow facilitates improving the performance of a design, thereby allowing it to make use of a slower speed grade if required. 1945: Sci-fi author Arthur C. Clark envisions geo-synchronous communications satellites. On a happier note My friend Tom Dillon said that after scaring everyone with the complexities above, I should end on a happier note. So, on the bright side, once a design team has selected an FPGA vendor and become familiar with a product family, it tends to stick with that family for quite some time, which makes life (in the form of the device selection process) a lot easier for subsequent projects. 1946: Automobile radiotelephones connect to the telephone network. 1946: America. John William Mauchly, J. Presper Eckert and team finish building a general-purpose electronic computer called ENIAC. 1947: America. Physicists William Shockley, Walter Brattain, and John Bardeen create the first point-contact germanium Transistor on the 23rd December. Chapter 21 Gigabit Transceivers Introduction As we discussed in chapter 4, the traditional way to move large amounts of data between two (or more) devices on the same circuit board is to use a bus, which refers to a collection of signals that carry similar data and perform a common function (Figure 21-1). n-bit bus FPGA Other device Figure 21-1. Using a bus to communicate between devices. Early microprocessor-based systems circa 1975 used 8-bit buses to pass data around. As the need to push more data around and to move it faster grew, buses increased to 16 bits in width, then 32 bits, then 64 bits, and so forth. The problem is that this consumes a lot of pins on each device and requires a lot of tracks to connect the devices together. Routing these tracks such that they are all the same length and impedance and so forth becomes increasingly painful as boards grow in complexity. Furthermore, it becomes increasingly difficult to manage SI issues (such as susceptibility to noise and crosstalk effects) when you are dealing with large numbers of bus-based tracks. 354 ■ The Design Warrior's Guide to FPGAs 1948: America. Airplane re-broadcasts TV signals to nine states. For this reason, today’s high-end FPGAs include special hard-wired gigabit transceiver blocks. These high-speed serial interfaces use one pair of differential signals to transmit (TX) data and another pair to receive (RX) data (Figure 21-2). Transceiver block FPGA Transmit (TX) to other device Receive (RX) from other device Differential pairs Figure 21-2. Using high-speed transceivers to communicate between devices. Note that, unlike a traditional data bus in which you can have lots of devices hanging off the bus, these high-speed serial interfaces are point-to-point connections, which means that each transceiver can only talk to a single transceiver on one other device. At the time of this writing, relatively few designs (probably only a few percent of total design starts) make use of these high-speed serial interfaces, but this number is expected to rise dramatically over the next few years. Using these gigabit transceivers is something of an art form, but each FPGA vendor will provide detailed user guides and application notes for its particular technology. One problem with these interfaces is that there are so many nitty-gritty details to wrap one’s brain around. For the purposes of this book, however, we shall introduce only enough of the main concepts to give the unwary sufficient information to make them dangerous! Differential pairs The reason for using differential pairs (which refers to a pair of tracks that always carry complementary logical levels) is that these signals are less susceptible to noise from an external source, such as radio interference or another signal Gigabit Transceivers switching in close proximity to these tracks. In order to illustrate this, consider the same amount of noise applied to both a single wire and a differential pair (Figure 21-3). Noise spikes (a) Outside World FPGA (b) 1 IN IN Standard Input RXN Differential Pair 0 Noise spikes RXN 1 RXP RXP 0 Figure 21-3. Using high-speed transceivers to communicate between devices. In the case of the standard input, we have a pin called IN connected to a buffer gate. For the purposes of this example, we aren’t particularly interested in the first noise spike (a), but the second spike (b) could cause problems. If this noise spike crosses the input switching threshold of the buffer gate, it could cause a glitch (pulse) on the output of the gate. In turn, this glitch could cause some undesired activity (such as registers loading incorrect values) inside the FPGA. Things were somewhat easier in the not-so-distant past when the difference between logic 0 and logic 1 values was 5 volts because a noise spike of, say, 1 volt wouldn’t cause any problems. But the sands of time have slipped through the hourglass as is their wont, and depending on the I/O standard you are using, the difference between a logic 0 and a logic 1 may now be only 1.8 volts, 1.5 volts, or even less. In this case, a noise spike much smaller than 1 volt could be devastating.1 1 In the case of differential pairs, one standard has a differential voltage— the difference between a logic 0 and a logic 1—of only 0.175 volts (175 millivolts)! ■ 355 1948:America. Work starts on what is supposed to be the first commercial computer, UNIVAC-1. 356 ■ The Design Warrior's Guide to FPGAs 1948: First atomic clock is constructed. Now consider the differential pair, whose signals are generated by a special type of driving gate in the transmitting device (Figure 21-4). For the purists among us, we should note that the positive (true) halves of the differential pairs (RXP and TXP in Figures 21-3 and 21-4, respectively) are usually drawn on the top, while the negative (inverse or complementary) halves (RXN and TXN)—along with the bobbles (circles) on their buffer symbols—are usually drawn on the bottom. The reason we drew them the other way round was to make the RXP signal match up with the IN signal in Figure 21-3, thereby making this figure a little easier to follow. Transmitting Device Outside World TXN TXP Figure 21-4. Generating a differential pair. Remember that the two signals on a differential pair always carry complementary logical values. So when RXP in Figure 21-3 is a logic 0, RXN will be a logic 1, and vice versa. The point is that, as we see in Figure 21-3, the fact that the two tracks forming the differential pair are routed very closely together means that any noise spikes will affect both tracks identically. The receiving buffer gate is essentially interested only in the difference between the two signals, which means that differential pairs are much less susceptible to the effects of noise than are connections formed from individual wires. The end result is that, assuming the circuit board is designed appropriately, these transceivers can operate at incredibly high speeds. Furthermore, each FPGA may contain a number of these transceiver blocks and, as we shall see, sev- Gigabit Transceivers eral transceivers can be “ganged together” to provide even higher data transfer rates. Multiple standards Of course, electronics wouldn’t be electronics if there weren’t a variety of standards for this sort of thing. Each standard defines things from the high-level protocols all the way down to the physical layer (PHY). A few of the more common standards are as follows: ■ ■ ■ ■ ■ ■ Fibre Channel InifiniBand® PCI Express (started and pushed by Intel Corporation) RapidIO™ SkyRail™ (from Mindspeed Technologies™) 10-gigabit Ethernet This situation is further complicated by the fact that, in the case of some of these standards, like PCI Express and SkyRail, device vendors might use the same underlying concepts, but rebrand things using their own names and terminology. Also, implementing some standards requires the use of multiple transceiver blocks (see also the “Ganging multiple transceiver blocks together” section later in this chapter). Let’s assume that we are building a circuit board and wish to use some form of high-speed serial interface. In this case, the system architects will determine which standard is to be used. Each of the gigabit transceiver blocks in an FPGA can generally be configured to support a number of different standards, but usually not all of them. This means that the system architects will either select a standard that is supported by the FPGAs they intend to use, or they will select FPGAs that will support the interface standard they wish to employ. If the system under consideration includes creating one or more ASICs, we can of course implement the standard of our choice from the ground up (or, more likely, we would purchase an appropriate block of IP from a third-party vendor). Off- ■ 357 1949: America. MIT’s first real-time computer called Whirlwind is launched. 358 ■ The Design Warrior's Guide to FPGAs 1949: America. Start of network TV. the-shelf (ASSP-type) devices, however, will typically support only one, or a subset, of the above standards. In this case, an FPGA may be used to act as an interface between two (or more) standards (Figure 21-5). Gigibit interface standard A Transceiver blocks Gigibit interface standard B “Stuff” Chip A FPGA Chip B Figure 21-5. Using an FPGA to interface between multiple standards. 8-bit/10-bit encoding, etc. One problem that rears its ugly head when you are talking about signals with data rates of gigabits per second is that the circuit board and its tracks absorb a lot of the high-frequency content of the signal, which means that the receiver only gets to see a drastically attenuated version of that signal. Unfortunately, this is something that doesn’t make much sense in words, so let’s take a peek at some illustrations. First, let’s consider an ideal signal that’s alternating between logic 0 and logic 1 values (Figure 21-6). 1 0 1 0 1 0 Signal sent from transmitter Signal “seen” by receiver Figure 21-6. An ideal signal. 1 0 1 Gigabit Transceivers Full-blown engineers will immediately spot some errors in this diagram. For example, the signal generated by the transmitting chip is shown as being a pure digital square wave, but in the real world such a signal would actually have significant analog characteristics. In reality, the best you can say at these frequencies is that the signal is horrible coming out (from the transmitting chip), and it’s even worse going in (to the receiving chip). Also, the signal seen by the receiver would be phase shifted from that shown in Figure 21-6, but we’ve aligned the two signals so that we can see which bits at the transmitting and receiving ends of the track are associated with each other. As this illustration shows, the signal seen at the receiving end of the track has been severely attenuated, but it still oscillates above and below some median level, which will allow the receiver to detect it and pull useful information out of it. Now, let’s consider what would happen if we were to modify the previous sequence such that it commenced by transmitting a series of three consecutive logic 1 values (Figure 21-7). 1 1 1 0 1 0 1 0 1 Signal sent from transmitter Signal “seen” by receiver Figure 21-7. The effects of transmitting a series of identical bit values. In this case (and remembering that this is an over-the-top, pessimistic scenario intended purely for the purposes of providing an example for us to talk about), the signal seen by the receiver continues to rise throughout the course of the first three bits. This takes the signal above the median value, which ■ 359 1949: England. EDSAC computer uses first assembly language called Initial Orders. 360 ■ The Design Warrior's Guide to FPGAs ISI is pronounced by spelling it out as “I-S-I.” means that when the sequence eventually returns to its original 010101… sequence, the receiver will actually continue to see it as a sequence of logic 1 values. In the context of data communications, the individual binary digits (or sometimes words formed from a collection of digits) are referred to as symbols. The spreading or “smearing” of symbols where the energy from one symbol affects subsequent (downstream) symbols such that the received signal might be interpreted incorrectly is referred to as intersymbol interference (ISI). Another term that you often hear in conjunction with this is consecutive identical digits (CIDs), which refers to occurrences such as our three logic 1 values shown in Figure 21-7. As we noted earlier, the example shown in Figure 21-7 is overly pessimistic. In reality, it is only necessary to ensure that we never send more than five identical bits in a row. Thus, our high-speed transceiver blocks have to include some form of encoding—such as the 8-bit/10-bit (abbreviated to 8b/10b or 8B/10B) standard—in which each 8-bit chunk of data is augmented by two extra bits to ensure that we never send more than five 0s or five 1s in a row. Furthermore, this standard ensures that the signal is always DC-balanced (that is, it has the same amount of energy above and below the median) over the course of 20 bits (two chunks). There are alternative encoding schemes to the 8B/10B standard, including 64B/66B (or 64b/66b) and SONET Scrambling. The “scrambling” portion of the latter appellation comes from the fact that, like all of the schemes discussed here, this standard serves to randomize (“scramble”) the patterns of 0s and 1s to prevent long strings of all 0s or all 1s. One last point worth noting while we are here is that, in addition to addressing the problem presented in Figure 21-7, one of the main reasons for using these encoding schemes is to ease the task of recovering the clock signal from the data stream (see also the discussions on “Clock recovery, jitter, and eye diagrams” later in this chapter). Gigabit Transceivers Delving into the transceiver blocks Now that we’ve introduced the concept of 8B/10B encoding, we’re in a better position to take a slightly closer look at the main elements comprising a transceiver block (Figure 21-8). 8b/10b Encoder FIFO Polarity flipper Serializer TXP 8-bit bus from main FPGA fabric TXN This is where pre-emphasis takes place Transmitter FIFO 8b/10b Deccoder This is where equalization takes place Polarity flipper Deserializer RXP 8-bit bus to main FPGA fabric To another device RXN From another device Receiver Transceiver Block FPGA Figure 21-8. The main elements composing a transceiver block. As usual, this is a highly simplified representation that omits a lot of bits and pieces, but it serves to cover the points of interest to us here. With regard to the annotations on “preemphasis” and “equalization,” these topics are introduced later in this chapter. On the transmitter side, bytes of data are presented to the transceiver from user-defined logic in the main FPGA fabric via an 8-bit bus. This is passed through an 8B/10B encoder and handed over into a FIFO buffer, which is used to store data temporarily when too many words arrive too closely together. ■ 361 1949: England. Cambridge University. Small experimental computer called EDSAC performs its first calculation. 362 ■ The Design Warrior's Guide to FPGAs 1950: America. Jay Forrester at MIT invents magnetic core store. The output from the FIFO passes through a polarity flipper, which may be used to pass the data through unmodified or to flip each bit from a 0 to a 1 and vice versa (polarity flipping will only be required if the device we’re passing data to is expecting to see flipped data). In turn, the output from the polarity flipper is passed to a serializer, which converts the parallel input data into a serial stream of bits. This serial stream is then handed over to a special output driver/buffer that generates a differential signal pair. Similarly, on the receiver side, a serial data stream presented as a differential signal pair is passed through a special input buffer into a deserializer, which converts the serial data into 10-bit words. These words are passed into a polarity flipper, which may be used to pass the data through unmodified or to flip each bit from a 0 to a 1 and vice versa (polarity flipping will only be required if the device we’re receiving data from is sending us flipped data). The output from the polarity flipper is handed over to an 8B/10B decoder, which descrambles the data. The resulting 8-bit bytes are passed via a FIFO buffer into the main FPGA fabric, where they can be processed by whatever logic the design engineers decide to implement. Note that, depending on the FPGA technology you are using, some transceiver blocks may support a variety of encoding standards, such as 8B/10B, 64B/66B, SONET Scrambling, and so forth. Others may support only a single standard like 8B/10B, but in this case it may be possible to switch out these blocks and implement your own encoding scheme in the main FPGA fabric if required. Ganging multiple transceiver blocks together The term baud rate refers to the number of times a signal in a communications link changes (or can change) per second. Depending on the encoding technique used, a communications link can transmit one data bit—or fewer or more bits—with each baud, or change in state. Gigabit Transceivers At the time of this writing, the current state of play is that each transceiver channel can transmit and receive 8B/10Bencoded data (or data encoded using a similar scheme) at baud rates up to 3.125 gigabits per second (Gbps).2 This translates to 2.5 Gbps of real, raw data if we ignore the overhead of the additional bits added by the 8B/10B-encoding scheme (that is, a baud rate of 3.125 Gbps divided by 10 bits and multiplied by 8 bits equals a true data rate of 2.5 Gbps). The problem is that, by definition, standards such as 10gigabit Ethernet have data transfer requirements of 10 Gbps. For this reason, there are additional standards like the 10gigabit attachment unit interface (XAUI) approach that defines how to achieve 10 Gbps of data throughput using four differential signal pairs in each direction (Figure 21-9). Four gigabit transceiver blocks 4x electrical differential signal TX and RX pairs at 2.5 gigabits/second each To/from the main FPGA fabric Special external interface chip Channel bonding control signals Optical cable/signal (10 gigabits/second) FPGA Figure 21-9. Ganging multiple transceiver blocks together. 2 Once we go over baud rates of 3.175 Gbps, the overhead associated with the 8B/10B-encoding scheme becomes too high, which means we have to go to another scheme such as 64B/66B encoding. ■ XAUI is pronounced “zow-ee.” 363 364 ■ The Design Warrior's Guide to FPGAs 1950: America. Physicist William Shockley invents the first bipolar junction translator. In this case, the four transceiver blocks are linked using special channel bonding control signals so that each block knows what it is supposed to do and when it is supposed to do it. At some stage in the future—largely dictated by the rate of adoption of high-speed serial interface technology at the circuit board level—it is likely that the functions currently embodied by the external interface chip will be incorporated into the FPGA itself, which will then have the ability to transmit and receive optical signals directly (see also the discussions in Chapter 26). Configurable stuff The gigabit transceiver blocks embedded in FPGAs typically have a number of configurable (programmable) features. Different vendors and device families may support different subsets of these features, a selection of the main ones being as follows. Comma detection The 8B/10B-encoding scheme (and other schemes) includes special comma characters. These are null characters that may be transmitted to keep the line “alive” or to initiate a data transfer by indicating to the receiver that things are about to start happening and it needs to wake up and prepare itself for action. Another point is that these high-speed serial interfaces are asynchronous in nature, which means that the clock is embedded in the data signal (see also the discussions on “Clock recovery, jitter, and eye diagrams” later in this chapter). So when a transceiver block is ready to initiate a transfer, it will send a whole series of comma characters (several hundred bits) to allow the receiver at the other end of the line to synchronize itself. (Comma characters are also employed when aligning multiple bitstreams as discussed in the previous section.) Gigabit Transceivers The point is that some transceiver blocks allow the comma character that will be transmitted (and received) to be configured to be any 10-bit value, thereby allowing the transceiver to support a variety of communications protocols. Output differential swing Different standards support different differential output swings, which refers to the peak-to-peak difference in voltage between logic 0 and logic 1 values. Thus, transceiver blocks typically allow the differential output voltage swing to be configured across a range of values so as to support compatibility with a variety of serial system voltage levels. On-chip termination resistors The data rates supported by high-speed serial interfaces mean that using external termination resistors can cause discontinuities in the signals, so it’s typically recommended to use the on-chip termination resistors provided in the FPGA. The values of these on-chip terminating resistors are typically configurable (they can usually be set to 50 ohms or 75 ohms) so as to support a variety of different interface standards and circuit board environments. Pre-emphasis As was noted in the discussion associated with Figure 21-6, signals traveling across a high-speed serial interface are severely distorted (attenuated) by the time they arrive at the receiver because the circuit board and its tracks absorb a lot of the high-frequency content of the signal, leaving only the lower-frequency (more slowly changing) portions of the signal. One technique that may be used to mitigate this effect is pre-emphasis, in which the first 0 in a string of 0s and the first 1 in a string of 1s is given a bit of a boost with a slightly higher voltage (in this context, we will consider “string” to refer to one or more bits). In a way, we can think of this as applying our own distortion in the opposite direction to the distortion coming from the circuit board (Figure 21-10). ■ 365 1950: Maurice Karnaugh invents Karnaugh Maps (circa 1950), which quickly become one of the mainstays of the logic designer’s tool-chest. 366 ■ The Design Warrior's Guide to FPGAs 1950: Konrad Zuse’s Z4 is sold to a bank in Zurich, Switzerland, thereby making the Z4 the world’s first commercially available computer. 1951: America. The first UNIVAC 1 is delivered. Figure 21-10. Applying pre-emphasis. Once again, this illustration shows the signal generated by the transmitting chip as being an ideal representation (with sharp edges), but in the real world such a signal would actually have strong analog chracteristics. The amount of pre-emphasis to be applied is typically configurable so as to accommodate different circuit board environments. The amount of pre-emphasis required for a given high-speed link is a function of the position of the FPGA in relation to other components (which equates to track lengths), a variety of board characteristics, and the high-speed standard being employed. Working out the amount of pre-emphasis to use may be determined by simulation runs or by rule of thumb. Equalization This is somewhat related to pre-emphasis as discussed above, except that it takes place at the receiver end of the high-speed interface (Figure 21-11). Equalization refers to a special amplification stage that boosts higher frequencies more than lower ones. As for preemphasis, we can think of this as applying our own distortion in the opposite direction to the distortion coming from the circuit board. The amount of equalization to be applied is typically configurable to accommodate different circuit board environments. Depending on the particular design, we might wish to use pre-emphasis, equalization, or a mixture of both. Gigabit Transceivers Transceiver Pre-emphasis applied to outputs Equalization applied to inputs Chip A Transceiver CHIP B Equalization applied to inputs Pre-emphasis applied to outputs Figure 21-11. Applying equalization. One point worth noting is that, in the case of really long high-speed interface tracks on the circuit board (say, around 40 inches and above), it may be desirable to disable the internal equalization and to use an external equalizer device because the quality of equalization is typically better in a dedicated analog device than in an FPGA. Having said this, FPGAs are increasing in sophistication with regard to this sort of thing—the different vendors are constantly leapfrogging each other with regard to technology—and the quality of factors such as the quality of on-chip equalization may affect your device selection. Clock recovery, jitter, and eye diagrams Clock recovery High-speed serial interfaces are asynchronous in nature, which means that the clock is embedded in the data signal. Thus, the receiver portion of the transceiver includes clock and data recovery (CDR) circuitry that keys off the rising and falling edges of the incoming signal and automatically derives a clock that is representative of the incoming data rate. As you can imagine, this would not be a major feat if the incoming signal were toggling back and forth between logic 0 and logic 1 ■ 367 1952: America. John William Mauchly, J, Persper Eckert and team finish building a general-purpose (stored program) electronic computer called EDVAC. 1952: England. First public discussion of the concept of integrated circuits is credited to a British radar expert, G.W.A. Dummer. 368 ■ The Design Warrior's Guide to FPGAs 1952: Sony demonstrates the first miniature transistor radio, which is produced commercially in 1954. 1953: Americas. First TV dinner is marketed by the Swanson Company. 1954: Launch of giant balloon called Echo 1—used to bounce telephone calls coast-to-coast in the USA. values, in which case the clock and the data would effectively be identical (Figure 21-12a). Things get a little trickier when the signal becomes more complex (Figure 21-12b). For example, if the incoming signal commenced with three 1s followed by three 0s, we couldn’t fault the clock recovery function for making an initial guess that the clock frequency was only one third of its true value. As more data (and more transitions) arrive, however, the clock recovery function will refine its assumptions until it has derived the correct frequency. 1 0 1 0 1 0 1 0 1 0 1 0 (a) Simple signal Real edges 1 1 1 0 0 0 1 0 1 1 0 1 (b) More complex signal Derived edges Real edges Figure 21-12. Recovering the clock signal. Once the receiver has locked down the clock, it uses this information to sample the incoming data stream at the center point of each bit in order to determine whether that bit is a logic 0 or a logic 1 (Figure 21-13). This is why, as we discussed earlier, a data transmission will commence with several hundred bits of comma characters to allow the receiver to lock on the clock and prepare itself for action. The clock recovery function will continue to monitor edges and constantly tweak the clock value to accommodate Gigabit Transceivers Data sample times ■ 369 1954: The number of radio sets in the world out-numbers newspapers sold everyday. 1954: First silicon transistor manufactured. Derived edges Real edges Figure 21-13. Sampling the incoming signal. slight back-and-forth drifts in the clock caused by environmental conditions such as temperature and voltage variations. Jitter and eye diagrams The term jitter refers to short-term variations of signal transitions from their ideal positions in time. For example, if we were to take an incoming signal that was oscillating between logic 0 and logic 1 values (Figure 21-14a, b) and overlay the data associated with each clock cycle on top of the preceding cycles, we would start to see some fuzziness appearing (Figure 21-14c–f). This fuzziness is caused by a variety of factors, including the clock wandering slightly in the transmitting device and also the ISI effects we noted earlier (see also the discussion associated with Figure 21-7). In fact, we can go one step further, which is conceptually to fold each clock cycle in half, thereby overlaying the positive 0–1–0 pulses from the first half of the cycle with the negative 1–0–1 pulses from the second half of the cycle (Figure 21-14g). Once again, the waveforms shown in Figure 21-14 are unrealistic because they feature razor-sharp edges. Real-world signals would have analog characteristics. If we were to look at a real waveform in its folded form, it would look something like that shown in Figure 21-15. 1955: Velcro is patented. 1956: America. John Backus and team at IBM introduced the first widely used high-level computer language, FORTRAN. 370 ■ The Design Warrior's Guide to FPGAs 1956: America. John McCarthy develops a computer language called LISP for artificial intelligence applications. 1956: America MANIAC 1 is the first computer program to beat a human in a game (a simplified version of chess). 1956: First transatlantic telephone cable goes into operation. Figure 21-14. Jitter. Eye Mask Figure 21-15. Eye diagram and eye mask. The result is a diagram whose center looks something like a human eye, so, perhaps not surprisingly, it’s referred to as an eye diagram. As jitter, attenuation, and other distortions increase, the center of the eye closes more and more. Thus, a lot of specifications define a geometric shape called the eye mask. This mask, which may be rectangular or hexagonal as shown here, represents the data valid window. As long as all of the curves fall outside of the eye mask, the high-speed interface will work. Gigabit Transceivers The point of all of this is that if you are planning on using one of these high-speed serial communications interfaces, then you need to make sure that you have access to SI analysis tools that have been augmented to support the concept of eye diagrams. ■ 371 1957: America. Gordon Gould conceives the idea of the Laser. 1957: America. IBM 610 Auto-Point computer is introduced. 1957: Russia launches the Sputnik 1 satellite. Chapter 22 Reconfigurable Computing Dynamically reconfigurable logic The advent of SRAM-based FPGAs presented a new capability to the electronics fraternity: dynamically reconfigurable logic, which refers to designs that can be reconfigured on the fly while remaining resident in the system. Just to recap, FPGAs contain a large amount of programmable logic and registers, which can be connected together in different ways to realize different functions. SRAM-based variants allow the main system to download new configuration data into the device. Although all of the logic gates, registers, and SRAM cells forming the FPGA are created on the surface of a single piece of silicon substrate, it is sometimes useful to visualize the device as comprising two distinct strata: the logic gates/registers and the programmable SRAM configuration cells (Figure 22-1). The versatility of these devices opened the floodgates to a wealth of possibilities. For example, when the system is first powered up, the FPGAs can be configured to perform a variety of system-test (and even self-test) operations. Once the system checks out, the FPGAs can be reconfigured to perform their main function in life. Dynamically reconfigurable interconnect Although it’s great to be able to reconfigure the function of the individual devices on a circuit board, there are occasions when design engineers would like to create board-level systems that can be reconfigured to perform a variety of radically different functions. 1957: Russia launches the Sputnik 1 satellite. 374 ■ The Design Warrior's Guide to FPGAs 1958: America. Computer data is transmitted over regular telephone circuits. Uninitialized SRAM cells Configuration data stream X X X XX X X XX XX X X X X X X X X XX X X XX XX XX XX XX X X X X X XX X XX X X XX X XX X X Primary X X XX X outputs X XX XX X X X X Primary inputs SRAM cells loaded with 0s and 1s 01 1 10 01 1 0 0 1 0 11 00 00 00 11 10 01 1 0 1 1 0 00 0 1 0 1 0 1 1 1 0 0 1 10 01 0 1 11 1 10 10 11 01 1 0 1 00 10 0 01 1 Primary outputs Primary inputs (a) Unconfigured (b) Configured Figure 22-1. Dynamically reconfigurable logic: SRAM-based FPGAs. The solution is to be able to configure the board-level connections between devices dynamically. A breed of devices offer just this capability: field-programmable interconnect devices (FPIDs), which may also be known as field-programmable interconnect chips (FPICs).1 These devices, which are used to connect logic devices together, can be dynamically reconfigured in the same way as standard SRAM-based FPGAs. Because each FPID may have 1,000 or more pins, only a few such devices are typically required on a circuit board (Figure 22-2). One interesting point is that the concepts discussed here are not limited to board-level implementations. Any of the technologies discussed thus far may also potentially be implemented in hybrids, multichip modules (MCMs), and SoC devices. Reconfigurable computing As with many things in electronics, the term reconfigurable computing (RC) can mean different things to different people. 1 FPIC is a trademark of Aptix Corporation (www.aptix.com). Reconfigurable Computing FPIDs FPGAs (and other components) Figure 22-2. Dynamically reconfigurable interconnect: SRAM-based FPIDs. For some, it refers to special microprocessors whose instruction sets can be augmented or modified on the fly. For our purposes here, however, we understand RC to refer to a piece of general-purpose hardware—such as an FPGA (what a surprise)—that can be configured to perform a specific task, but that can subsequently be reconfigured on demand to carry out other tasks. One limitation with the majority of SRAM-based FPGAs is the time it takes to reconfigure them. This is because they are typically programmed using a serial data stream (or a parallel stream only 8 bits wide). When we start to talk about high-end devices with tens of millions of SRAM configuration cells, it can take up to a couple of seconds to reprogram these beasts. There have been some FPGAs that address this issue by using large numbers of general-purpose I/O pins to provide a wide configuration bus (say, 256 bits) before reverting to their main I/O functionality (see also chapter 26). Also, some flavors of field-programmable node arrays (FPNAs) have dedicated wide programming buses (see also Chapter 23). Another limitation with traditional FPGA architectures is that, when you wish to reconfigure any part of the device, you typically have to reprogram the entire device (some recent architectures do allow you to reconfigure them on a columnby-column basis, as discussed in chapter 14, but this offers only ■ 375 1958: America. Jack Kilby, working for Texas Instruments, succeeds in fabricating multiple components on a single piece of semiconductor (the first integrated circuit). 376 ■ The Design Warrior's Guide to FPGAs 1959: America. COBOL computer language is introduced for business applications. a rather coarse level of granularity). Furthermore, it is usually necessary to halt the operation of the entire circuit board while these devices are being reconfigured. Additionally, the contents of any registers in the FPGAs are irretrievably lost during the process. In order to address these issues, an interesting flavor of FPGA was introduced by Atmel Corporation (www.atmel.com) circa 1994. In addition to supporting the dynamic reconfiguration of selected portions of the internal logic, these devices also featured: ■ ■ ■ ■ No disruption to the device’s inputs and outputs No disruption to the system-level clocking The continued operation of any portions of the device that are not undergoing reconfiguration No disruption to the contents of internal registers during reconfiguration, even in the area being reconfigured The latter point is of particular interest because it allows one instantiation of a function to hand over data to the next function. For example, a group of registers may initially be configured to act as a binary counter. Then, at some time determined by the main system, the same registers may be reconfigured to operate as a linear feedback shift register (LFSR)2 whose seed value is determined by the final contents of the counter before it is reconfigured. Although these devices were evolutionary in terms of technology, they were revolutionary in terms of their potential. To reflect their new capabilities, appellations such as “virtual hardware” and “cache logic”3 were quickly coined. The term virtual hardware is derived from its software equivalent, virtual memory, and both are used to imply something that is not really there. In the case of virtual memory, a 2 LFSRs are introduced in detail in Appendix C. 3 Cache Logic is a trademark of Atmel Corporation, San Jose, CA, USA. Reconfigurable Computing computer’s operating system pretends that it has access to more memory than is actually available. For example, a program running on the computer may require 500 megabytes to store its data, but the computer may have only 128 megabytes of memory available. To get around this problem, whenever the program attempts to access a memory location that does not physically exist, the operating system performs a sleight of hand and exchanges some of the contents in the memory with data on the hard disk. Although this practice, known as swapping, tends to slow things down, it does allow the program to perform its tasks without having to wait while someone runs down to the store to buy some more memory chips. Similarly, the term cache logic is derived from its similarity to the concept of cache memory, in which high-speed, expensive SRAM is used to store active data, while the bulk of the data resides in slower, lower-cost memory devices such as DRAM. (In this context, “active data” refers to data or instructions that a program is currently using or that the operating system believes the program will want to use in the immediate future.) In fact, the concepts behind virtual hardware are actually quite easy to understand. Each large macrofunction in a device is usually formed by the combination of a number of smaller microfunctions, such as counters, shift registers, and multiplexers. Two things become apparent when a group of macrofunctions is divided into their respective microfunctions. First, functionality overlaps, and an element such as a counter may be used several times in different places. Second, there is a substantial amount of functional latency, which means that at any given time only a portion of the microfunctions are active. Put another way, relatively few micro- functions are in use during any given clock cycle. Thus, the ability to reconfigure individual portions of a virtual hardware device dynamically means that a relatively small amount of logic can be used to implement a number of different macrofunctions. ■ 377 1959: America. Robert Noyce invents techniques for creating microscopic aluminum wires on silicon, which leads to the development of modern integrated circuits. 378 ■ The Design Warrior's Guide to FPGAs 1959: Swiss physicist Jean Hoerni invents the planar process, in which optical lithographic techniques are used to create transistors. By tracking the occurrence and usage of each microfunction, then consolidating functionality and eliminating redundancy, virtual hardware devices can perform far more complex tasks than they would appear to have logic gates available. For example, in a complex function requiring 100,000 equivalent gates, only 10,000 gates may be active at any one time. Thus, by storing, or caching, the functions implemented by the extra 90,000 gates, a small, inexpensive 10,000-gate device can be used to replace a larger, more expensive 100,000-gate component (Figure 22-3). Configuration data stored in memory device Function A Unused resources Active tasks Inactive tasks Function A Function B Function B Function C Overwrite function B with new function C Figure 22-3. Virtual hardware. Theoretically, it would be possible to compile new design variations in real time, which may be thought of as dynamically creating subroutines in hardware! RC was a big buzz in the latter half of the 1990s, and there are still some who are waving the RC banner (and wearing the T-shirts). Sad to relate, however, nothing really came of this with the exception of highly specialized applications. The core problem is that traditional FPGA architectures are too Reconfigurable Computing fine grained, so reconfiguring them takes too long (in computer terms). In order to support true RC, one would need access to devices that could be reconfigured hundreds of thousands of times per second. The answer may be the coarsergrained architectures fielded by the FPNAs introduced in Chapter 23. ■ 379 1960: America. Theodore Maimen creates the first Laser. 1960: America. The Defense Advanced Research Projects Agency (DARPA) begins work on what will become the Internet. Chapter 23 Field-Programmable Node Arrays Introduction Before we throw ourselves into this topic with wild abandon, it’s probably only fair to note that the term fieldprogrammable node array, or FPNA, was coined by the author and is not industry-standard terminology (yet). Fine-, medium-, and coarse-grained architectures When it comes to categorizing different IC architectures, ASICs are usually said to be fine grained, because design engineers can specify their functionality down to the level of individual logic gates. By comparison, the majority of today’s FPGAs may be classed as medium grained because they consist of small blocks (“islands”) of programmable logic (where each block represents a number of logic gates and registers) in a “sea” of programmable interconnect. (Even though today’s FPGA offerings typically include processor cores, blocks of memory, and embedded functions like multipliers, the main underlying architecture is as described above.) Truth to tell, many engineers would actually refer to FPGAs as being coarse grained, but classing them as medium grained makes much more sense when we start to bring FPNAs into the picture because these boast really coarse-grained architectures. The underlying concept behind FPNAs is that they are formed from an array of nodes, each of which is a sophisticated processing element (Figure 23-1). Of course, this is a very simplified representation of an FPNA, not the least because it omits any I/O. Furthermore, we’ve only shown relatively few processing nodes, but such a 382 ■ The Design Warrior's Guide to FPGAs 1960: NASA and Bell Labs launch the first commercial communications satellite. Processing nodes Interconnect Figure 23-1. Generic representation of an FPNA. device can potentially contain hundreds or thousands of nodes. Depending on the vendor, each node might be an algorithmic logic unit (ALU), a complete microprocessor CPU, or an algorithmic processing element (this latter case is discussed in more detail later in this chapter). At the time of this writing, 30 to 50 companies are seriously experimenting with different flavors of FPNAs; a representative sample of the more interesting ones is as follows: Company Exilent IPflex Motorola PACT XPP Technologies AG picoChip Designs QuickSilver Technology Web site www.elixent.com www.ipflex.com www.motorola.com www.pactxpp.com www.picochip.com www.qstech.com Comment ALU-based nodes Operation-based nodes Processor-based nodes ALU-based nodes Processor-based nodes Algorithmic element nodes For the purposes of these discussions, we shall concentrate on just two of these vendors—picoChip and QuickSilver— who are conceptually at opposite ends of the spectrum: picoChip’s picoArray devices are formed from arrays of processors. Their key application area is large, fixed installations such as base stations for wireless networks in which power consumption is not a major consideration. Furthermore, these chips are intended to be reconfigured now and again (for Field-Programmable Node Arrays example, every hour or so as cellular phone usage profiles change throughout the day). By comparison, QuickSilver’s adaptive computing machine (ACM) devices are formed from clusters of algorithmic element nodes. Their key application area is small, low-power, handheld products like cameras and cell phones (although they are of interest for a wide variety of other applications). Furthermore, these chips can be reconfigured (QuickSilver prefers the term adapted) hundreds of thousands of times per second. Algorithmic evaluation FPNAs are mainly intended to execute sophisticated, compute-intensive algorithms. This means that before we go any further, we should spend a few moments ruminating on these algorithms to set the scene for what is to come. At one end of the spectrum are word-oriented algorithms, such as the extremely compute-intensive time division multiple access (TDMA) algorithm used in digital wireless transmission. Any variants such as Sirius, XM Radio, EDGE, and so forth form a subset of this algorithmic class, so an architecture that can handle high-end TDMA should also be able to handle its less-sophisticated cousins (figure 23-2). At the other end of the continuum, we find bit-oriented algorithms, such as wideband code division multiple access (W-CDMA), and its subvariants, such as CDMA2000, IS-95A, and the like. (W-CDMA is used for the wideband digital radio communications of Internet, multimedia, video, and other capacity-demanding applications.) And then there are algorithms that exhibit different mixes of word-oriented and bit-oriented components, such as the various flavors of MPEG, voice and music compression, and so forth. When one evaluates these various algorithms, it soon becomes quickly apparent that conventional RC approaches tend to attack the problem at inappropriate levels (RC concepts were introduced in chapter 22). For example, some RC 1961: Time-sharing computing is developed. ■ 383 ■ The Design Warrior's Guide to FPGAs 1962: America. Steve Hofstein and Fredric Heiman at RCA Research Lab invent field effect transistors (FETS). TDMA Sirius XM Radio Wordorientated EDGE GPRS GSM MPEG4 384 MPEG2 GPS W LA Bitorientated Music Compression IS-95A CDMA2000 Voice Compression N W-CDMA Figure 23-2. A simplified view of algorithm space. approaches engage problems at too micro of a level, that is, at the level of individual gates or FPGA blocks. Coupled with hideously difficult application programming, this powerhungry approach results in relatively long reconfiguration times, thereby making it unsuitable for some applications. By comparison, other approaches tackle the problem at too macro of a level, that is, at the level of entire applications or algorithms, which results in inefficient use of resources. Perhaps not surprisingly, it soon becomes apparent that algorithms are heterogeneous in nature, which means that if you take a bunch of diverse algorithms, their constituent elements are wildly different. Based on this, the obvious solution is to use heterogeneous architectures that fully address the heterogeneous nature of the algorithms they are required to implement, but what might these little scamps look like? picoChip’s picoArray technology In order to address the processing requirements of the algorithms discussed above, picoChip came up with a device called a picoArray. The heterogeneous node-based architec- Field-Programmable Node Arrays ture of the picoArray features a matrix of different flavors of reduced instruction set computing (RISC) processors. These 16bit devices are optimized in a variety of different ways: for example, one processor type may have lots of memory, while another will support special algorithmic instructions that can perform operations like “spread” and “despread” from the CDMA wireless standard using a single clock cycle (as opposed to 40 cycles using a general-purpose processor). In the first incarnation of these devices, each processor node was approximately equivalent (in processing capability, not in architecture) to an ARM9 for control-style applications or a TI C54xx for DSP-style applications. When you take into account the fact that a single picoArray can contain hundreds of such nodes, the result is a truly ferocious amount of processing power. As one example, when I first became aware of the picoArray technology around December 2002, one of the absolute top-of-the-line dedicated DSP chips in the world at that time was the TMS320C6415 from Texas Instruments. That bad boy could perform such a humongous number of calculations at such a breathtaking speed that it made your eyes water. However, picoChip claims that a single picoArray running at only 160 megahertz could deliver almost 20 times more processing power (measured in 16-bit ALU MOPS) than a TMS320C6415 running at 600 megahertz. Wow! An ideal picoArray application: Wireless base stations Cell phone companies spend billions and billions of dollars every year on wireless infrastructure, and a large portion of these funds is devoted to developing the digital baseband processing portions of wireless base stations. Depending on its location, each base station has to be capable of processing tens or hundreds of channels simultaneously. Not surprisingly, there is a huge drive to reduce the cost of implementing each channel. The fact that a single picoArray ■ 385 RISC is pronounced to rhyme with “lobster bisque.” 386 ■ The Design Warrior's Guide to FPGAs 1962: America. Unimation introduces the first industrial robot. can replace a number of traditional ASICs, FPGAs, and DSPs offers a way of dramatically reducing the cost of each base station channel. In fact, one of the problems with conventional solutions is that they require at least three design environments: ASIC and/or FPGA, DSP, and RISC (where the latter refers to some microprocessor-type functionality). All of this complicates development and test and slows the base station’s time-tomarket, which is not considered to be a good thing (Figure 23-3a). By comparison, a major advantage of a picoArraybased solution is that it largely consolidates everything into a single design environment (Figure 23-3b). Figure 23-3. Conventional devices versus a picoArray approach. Furthermore, in the case of conventional solutions, although ASICs can provide extremely high performance, they are very expensive to develop and they have long design cycles. Even worse, algorithms implemented in ASICs are effectively “frozen in silicon.” This is a major problem because wireless standards are evolving so quickly that, by the time an ASIC design has actually been implemented, it may already Field-Programmable Node Arrays be obsolete (honestly, this happens way more often than you might imagine). By comparison, in the case of the picoArray-based approach, the fact that every processor node on the device is fully programmable means that each channel can be easily reconfigured to adapt to hourly changes in usage profiles, to weekly enhancements and bug fixes, and to monthly evolutions in wireless protocols. Thus, a base station based on picoArray technology will have a much longer life in the field, thereby reducing operating costs. The picoArray design environment The underlying functionality to be mapped onto the processor nodes in a picoArray is captured in pure C code or in assembly language. As we discussed in chapter 11, C is a sequential language, so we need some way to describe any parallel processing requirements. As opposed to using one of the augmented C/C++ techniques mentioned in chapter 11, the folks at picoChip have taken another approach, which is to employ a VHDL framework to capture the structure of the design, including any parallel processing requirements, and to connect design modules together at the block level. C or assembly code is then used to implement the internals of each module. Another interesting aspect of the picoChip solution is the fact that they provide a complete library of programming/configuration modules that can be hooked together to implement a fully functioning base station (users can also tweak individual modules to implement their own algorithm variations, thereby gaining a competitive advantage). Around May 2003, picoChip announced that they had achieved a “world first” by using this library to implement a 3GPP-compliant carrier-class base station and to make a 3G call on that base station! Since that time, they have continued to progress in leaps and bounds, so you’ll have to visit their Web site at www.picochip.com to apprise yourself of the current state of play. ■ 1962: First commercial communications satellite (Telstar) launched and operational. 387 388 ■ The Design Warrior's Guide to FPGAs ACM is pronounced by spelling it out as “A-C-M.” QuickSilver’s ACM technology For several years now, the guys and gals at QuickSilver have been in “secret squirrel” mode working on their version of an FPNA (although I’m sure they are going to moan and groan about this appellation). Based on what I know (which is more than they think I know … at least I think it is), it’s fair to say that QuickSilver’s technology, which they call an adaptive computing machine, or ACM, boasts a truly revolutionary heterogeneous node-based architecture and interconnect structure (Figure 23-4). Algorithmic Element Node Matrix Interconnect Network (MIN) Level 1 Cluster Level 3 Cluster LEVEL 2 CLUSTER Figure 23-4. The ACM’s architecture. At the lowest level we have an algorithmic element node. Four of these nodes, forming a “quad,” are gathered together with a matrix interconnect network (MIN) to form what we might call a Level 1 cluster. Four of these Level 1 clusters can be grouped to form a Level 2 cluster, and so forth. At the time of this writing, there are a variety of different types of algorithmic element nodes (we’ll talk about how these node types are mapped into the quads in a little while). We aren’t going to delve into the guts of every node here, but Field-Programmable Node Arrays it’s important to understand that each such node performs tasks at the level of complete algorithmic elements. For example, an arithmetic node can be used to implement different (variable-width) linear arithmetic functions such as a FIR filter, a discrete cosign transform (DCT), an FFT, and so forth. Such a node can also be used to implement (variable-width) nonlinear arithmetic functions such as ((1/sine A) (1/x)) to the 13th power. Similarly, a bit-manipulation node can be used to implement different (variable-width) bit-manipulation functions, such as a linear feedback shift register (LRSR), Walsh code generator, GOLD code generator, TCP/IP packet discriminator, and so forth. Each node is surrounded by a wrapper, which makes all of the nodes appear to be identical to the outside world (that is, to the world outside the node). This wrapper is in charge of accepting packets of information (instructions, raw data, configuration data, etc.) from the outside world, unpacking this data, distributing it throughout the node, managing tasks, gathering and packing the results together, and presenting these results back to the outside world. The concept of the wrapper isolating the node from the outside world and making all of the nodes appear to be identical is of especial interest when we come to realize that each node is “Turing complete.” This means that you can present any node with any problem—say, an arithmetic node with a bit-manipulation task—and that node will solve the problem, although less efficiently than would a more appropriate type of node. Furthermore, QuickSilver also allows you to create your own types of nodes, where you define the core of the node and surround it with QuickSilver’s wrapper. Good grief! Trying to work out how best to wend our weary way through the complexities of all of this is making my brain ache. One key point is that any part of the device, from a few nodes all the way up to the full chip, can be adapted blazingly fast, in many cases within a single clock cycle. Also of interest is the fact that approximately 75 percent of each node ■ 1962: First commercial touch-tone phone system. 389 390 ■ The Design Warrior's Guide to FPGAs 1963: America. The LINC computer is designed at MIT. is in the form of local memory. This allows for a radical change in the way in which algorithms are implemented. As opposed to passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock-by-clock basis. It also means that, unlike an ASIC implementation in which each algorithm requires its own dedicated silicon, the ACM’s ability to be adapted tens or hundreds of thousands of times per second means that only those portions of an algorithm that are actually being executed need to be resident in the device at any one time (see also the discussions on SATS later in this chapter). This provides for tremendous reductions in silicon area and power consumption. You define the mix of nodes I’m not quite sure where to squeeze this topic in, so we’ll give it a whirl here to see how well it flies. Just a little while ago, we noted that there are various types of algorithmic element nodes. We also noted that each cluster is formed from a quad of these nodes gathered together with a MIN. Based on this, I’m sure that you are wondering how the node types are assigned across multiple clusters. Well, the point is that the folks at QuickSilver don’t actually make and sell chips themselves (apart from proof-ofconcept and evaluation devices of course). Instead, they license their ACM technology to anyone who is interested in playing with it, thereby allowing you (the end user) to determine the optimum mix of node types for your particular application and then have chips fabricated to your custom specifications. The fact that their wrappers make each node appear identical to the outside world makes it easy to exchange one type of node for another! The system controller node, input/ output nodes, etc. In addition to the structure shown in figure 23-4, each ACM also includes a gaggle of special-purpose nodes, such as Field-Programmable Node Arrays system controller, external memory controller, internal memory controller, and I/O nodes. In the case of the latter, each I/O node can be used to implement I/O tasks in such forms as a UART or bus interfaces such as PCI, USB, Firewire, and the like (as for the algorithmic element nodes, the I/O nodes can be reconfigured on a clock-by-clock basis as required). Furthermore, the I/O nodes are also used to import configuration data, which means that each ACM can have as wide a configuration bus as the total number of input pins if required. We will consider how applications are created for, and executed on, ACMs shortly. For the nonce, it is only important to note that almost everything that makes life difficult with other implementation technologies is handled transparently to the ACM design engineer. For example, each ACM has an onchip operating system (OS), which is distributed across the system controller node and the wrappers associated with each of the algorithmic element nodes. The individual algorithmic element nodes take care of scheduling their tasks and any internode communications. This leaves the system controller node relatively unloaded because its primary responsibilities are limited to knowing which nodes are currently free and to allocating new tasks to those nodes. From figure 23-4, it is obvious that the core ACM architecture is extremely scalable. Things start to get really clever if you have multiple ACMs on a board, their operating systems link up, and, to the rest of the system, they appear to function as a single device. ■ 391 1963: PDP-8 becomes the first popular microcomputer. OS is pronounced by spelling it out as “O-S.” Spatial and temporal segmentation One of the most important features of the ACM architecture is its ability to be reconfigured hundreds of thousands of times per second while consuming very little power. This allows ACMs to support the concept of spatial and temporal segmentation (SATS). In many cases, different algorithms, and even different portions of the same algorithm, can be performed at different times. SATS refers to the process of reconfiguring dynamic SATS is pronounced to rhyme with “bats.” 392 ■ The Design Warrior's Guide to FPGAs 1963: Philips introduces first audio cassette. hardware resources to rapidly perform the various portions of the algorithm in different segments of time and in different locations (nodes) on the ACM. As a simple example, consider that some operations on a wireless phone are modal, which means they only need to be performed some of the time. The three main modes are acquisition, idle, and traffic. The acquisition mode refers to the cell phone locating the nearest base station. When in idle mode, the phone keeps track of the base station it’s hooked up to and monitors the paging channel, looking for a signal that says, “Wake up because a call is being initiated.” The traffic mode has two variations: receiving or transmitting. Although you may think you are talking and listening simultaneously, you actually are only doing one or the other at any particular time on a digital phone. In the case of a wireless phone based on conventional IC technologies, each of these baseband processing functions requires its own silicon chip or some area on a common chip. This means that even when a function isn’t being used, it still occupies silicon real estate, which translates into high cost and high power consumption that drains your battery faster. By comparison, a phone based on ACM technology would require only a single chip that can be adapted on the fly to perform each baseband function as required. But this is only the beginning. In many cases, each of these major functions is composed of a suite of algorithms, which can themselves be performed at different times. For example, consider a highly simplified representation of a wireless phone receiving and processing a signal (figure 23-5). The incoming signal consists of a series of highly compressed blocks of data, each occupying a tiny segment of time. This data proceeds through a series of algorithms, each of which performs some processing on the data and downshifts it to a lower frequency. A key feature of this process is that each algorithmic stage occupies a different fragment of time. In traditional ASIC implementations, each function would occupy its own chip or Field-Programmable Node Arrays Baseband Processing Traffic (Receive) Algorithm 1 Algorithm 2 Agorithm 3 Time Time Time 393 1965: John Kemeny and Thomas Kurtz develop the BASIC computer language. RF = Radio Frequency IF = Intermediate Frequency RF to IF and IF to Baseband Conversion ■ etc. Algorithm ‘n’ Time One frame Figure 23-5. Highly simplified representation of a wireless phone receiving and processing a signal. its own area of silicon real estate on a common device. This results in a significant waste of available resources (space and power consumption) because only a limited number of functions are actually being exercised at any particular time. Once again, the solution is the ACM, which can be adapted on the fly to perform each algorithm as required. This concept of on-demand hardware results in the most efficient use of hardware in terms of cost, size (silicon real estate), performance, and power consumption (ACMs are claimed to provide 10 to 100 times or more performance increase over comparable solutions at only 1/2 to 1/20 of the power consumption). Creating and running applications on an ACM Of course, the next big question is how would one go about creating applications for one of these little rapscallions? Well, QuickSilver’s design flow is built on a C-based system-design 394 ■ The Design Warrior's Guide to FPGAs 1967: America. Fairchild introduce an integrated circuit called the Micromosaic (the forerunner of the modern ASIC). language called SilverC (this language is similar in concept to the augmented C/C++ languages introduced in chapter 11). SilverC preserves traditional C syntax and control structures, which makes it easy for C programmers and DSP designers to use and simplifies legacy C code conversion. SilverC also includes special module, pipe, and process keywords/extensions that facilitate dataflow representations and support parallel programming. Furthermore, SilverC provides special extensions for DSP programming, such as circular pointers for efficient use of DAG resources, fixed-width integer and fixed-point data types, support for saturated and nonsaturated types, and so forth. SilverC representations can be captured and simulated much faster than the equivalent HDL representations (Verilog and VHDL) used in traditional ASIC and FPGA design flows. Once a SilverC representation has been simulated and verified, it is compiled into an executable (binary) Silverware application. The ACM’s on-chip operating system only loads whatever portions of a Silverware application are required at any particular time, and multiple Silverware applications can be running concurrently on an ACM at any particular time. It’s important to note that when a Silverware application is created, it doesn’t need to know which type of ACM chip is being used (including the mix of node types, etc.) or, indeed, how many ACM chips are available on the board. The onchip ACM operating system takes care of handling any pesky details of this sort. But wait, there’s more In our discussions on DSP-based design flows in chapter 12, we introduced the concept of system-level design and simulation environments such as Simulink from The MathWorks (www.mathworks.com). This tool, which has a wide base of users, encourages dataflow-oriented design and provides an excellent mapping to the ACM architecture. Well, the lads and lasses at QuickSilver have been working furiously on integrating SilverC with Simulink. At the Field-Programmable Node Arrays simplest level, you can use Simulink to describe the various blocks and the dataflow connections between them, and then automatically output a top-level framework of the design containing the module instantiations and the pipes connecting them together. In this case, you would then go into the framework to code the SilverC processes by hand. Alternatively, QuickSilver has developed a library of SilverC modules that map onto existing Simulink blocks. This library includes widely used DSP components, filters, encoders, decoders, and bit and word manipulators. These SilverC modules can be used for functional and cycle-accurate simulation, and, on compilation into a Silverware executable, they can be mapped directly onto the ACM’s dynamic hardware resources. It’s silicon, Jim, but not as we know it! As you have probably surmised, I’m quite excited about the possibilities of FPNAs in general and QuickSilver’s offering in particular. So, does this mean the end of ASICs and FPGAs as we know them? Of course not! FPNAs are particularly well suited to a variety of application areas, but there is no such thing as an “all-singing, all-dancing, one-size-fits-all” chip architecture that can do everything well (and makes your teeth whiter as a by-product). In the real world, FPNAs are just one more weapon in the system architect’s arsenal. On the other hand, based on everything that has gone before, it wouldn’t surprise me to see both ASICs and FPGAs with embedded FPNA cores appearing on the scene at some time in the not-so-distant future. Alternatively, as was noted earlier, QuickSilver allows you to create your own types of nodes, where you define the core of the node and surround it with QuickSilver’s wrapper. So, another alternative is to use the main ACM fabric as supplied by QuickSilver, but to include some nodes implemented as FPGA fabric. And if and when any of this comes to pass, you can bet your little cotton socks that I’ll be there, gesticulating furiously and shouting, “I told you so!” ■ 1967: Dolby eliminates audio hiss. 395 Chapter 24 Independent Design Tools Introduction When it comes to design tools such as logic simulators, synthesis technology, and so forth, we mostly look to the big, full-line EDA companies, to smaller EDA companies who are focused on a particular aspect of the design flow, or to the FPGA vendors themselves. However, we shouldn’t forget the guys and gals working in the open-source arena (see also Chapter 25). Furthermore, small FPGA design consultancy firms often spend some considerable time and effort creating niche tools to help with their internal development projects. Occasionally, these tools are so useful that they end up being productized and become available to the outside world. In this chapter, we briefly introduce a brace of such tools. ParaCore Architect Dillon Engineering (www.dilloneng.com) offers a variety of custom design services, with particular emphasis on FPGAbased DSP algorithms and high-bandwidth, real-time digital signal and image processing applications. Toward the end of the 1990s, their engineers became conscious that they were constantly reinventing and reimplementing things like floating-point libraries, convolution kernels, and FFT processors. Thus, in order to make their lives easier, they developed a tool called ParaCore Architect™, which facilitates the design of IP cores. The process begins by creating a source file containing a highly parameterized description of the design at an extremely 398 ■ The Design Warrior's Guide to FPGAs high level of abstraction using a Python-based language (Python is introduced in more detail in chapter 25). ParaCore Architect takes this description, combines it with parameter values specified by the user, and generates an equivalent HDL representation, a cycle-accurate C/C++ model to speed up verification in the form of simulation, and an associated testbench (Figure 24-1). Highly-parameterized description of design User specified parameters VHDL RTL ParaCore Architect C/C++ model - Non-implementation-specific - Easy to create - Easy to modify Synthesis LUT/CLBlevel netlist Logic Simulation Simulation results Testbench Figure 24-1. ParaCore Architect generates RTL, C/C++, and an associated testbench. The ensuing HDL is guaranteed suitable for use with any simulation and synthesis environment, so it isn’t necessary to run any form of HDL rule-checking program. The beauty of this type of highly parameterized representation is that it’s extremely easy to target it toward a new application or an alternative device. Generating floating-point processing functions FPU is pronounced by spelling it out as “F-P-U.” As one simple example of the use of ParaCore Architect, a number of FPGA vendors now supply devices containing embedded microprocessor cores. Sad to relate, these typically do not come equipped with an associated floating-point unit (FPU). This means that, should the designers wish to perform floating-point operations on floating-point representations, they either have to do this in software (which is horrendously time-consuming) or they have to do it in hardware. In the latter case, this will take a lot of effort that could be better spent creating the fun part of the design. Independent Design Tools ■ 399 For this reason, one of the ParaCore Architect design descriptions can be used to generate corresponding floatingpoint cores. Different parameters can be used to define whatever exponent and mantissa precisions are required, how many pipeline stages to use, whether or not to handle IEEE floatingpoint special cases like infinity (some applications don’t require these special cases), the type of microprocessor core being used (so as to create an appropriate interface block), and so forth. Generating FFT functions A good example of the power of ParaCore Architect is demonstrated by the design description used to generate FFT cores. The smallest computational element used to generate an FFT is called a butterfly which consists of a complex multiplication, a complex addition, and a complex subtraction (Figure 24-2). Butterfly Complex Addition Butterfly Outputs Butterfly Inputs + Twiddle Factor Generator X - Complex Multiply Complex Subtraction Figure 24-2. The butterfly is the smallest computational element in an FFT. In turn, the complex multiplication requires four simple multiplications and two simple additions, while the complex addition and complex subtraction each require two simple additions. Thus, each butterfly requires a total of four simple multiplications and six simple additions. FFT is pronounced by spelling it out as “F-F-T,” 400 ■ The Design Warrior's Guide to FPGAs 1967: America. First handheld electronic calculator invented by Jack Kilby of Texas Instruments. One real-world image-processing application for this core involved generating a two-dimensional 2k × 2k–point FFT that could handle 120 frames-per-second (fps). Processing a single 2,048 (2k) pixel row requires a total of 11,256 butterflies organized in eleven ranks, where the outputs from the butterflies forming the first rank are used to drive the butterflies forming the second rank, and so forth. Thus, processing a single row requires 45,025 simple multiplications and 67,536 simple additions. In order to generate the FFT for an entire 2k × 2k frame, this process has to be repeated for each of the 2,048 (2k) rows forming the frame. This means that in order to achieve a frame rate of 120 fps, the processing associated with each row must be completed within 4 microseconds. (This leads to a time budget of 90 picoseconds per simple multiplication and 60 picoseconds per simple addition.) Let’s consider the 11,256 butterfly operations required to implement a 2k-point FFT. If execution time were not a major factor, it would be necessary to use only a relatively small FPGA device—such as a Xilinx Virtex-II XC2V40—with four multiplier blocks, to create a single butterfly structure (four simple multipliers and six simple adders), and to cycle all of the butterfly operations through this function. The resulting structure would take 90 microseconds to generate each 2kpoint FFT. Although this is extremely respectable, it falls well short of the 4-microsecond time budget required by the image-processing application discussed above. The easiest way to increase the speed of this algorithm is to increase the number of butterfly structures instantiated in hardware and to perform more of the processing in parallel. In the case of Xilinx XC2V6000 devices with six million system gates, 144 × 18-bit multipliers, and 144 × 18-kilobit RAM blocks, it’s possible to perform an entire 2k × 2k–point FFT fast enough to achieve a system that can process 120 fps. The point is that targeting these different devices requires setting only a single ParaCore Architect parameter to specify the number of butterfly structures required to be instantiated in hardware. Independent Design Tools As another example, if one were to decide to change the length of the FFT from 2K to 1K points, setting a single parameter takes care of all of the details, including resizing the RAMS used to store any internal results. Similarly, another parameter can be used to select between fixed-point and floating-point math formats (in the latter case, two further parameters are used to specify the size of the exponent and the mantissa). In early 2002, the folks at Dillon Engineering used ParaCore Architect to create what was possibly the world’s fastest FFT processor at that time. This processor subsequently found use in a variety of environments, such as the SETI project, where it is used to process huge amounts of data from radio telescopes in the search for extraterrestrial intelligence! A Web-based interface What is really cool is that Dillon Engineering has made ParaCore Architect available for its clients to use over the Internet. When you’re creating something like an FFT, you often want to experiment with different trade-offs, such as how many bits to store for each point. Now Dillon Engineering clients can visit the www.dilloneng.com Web site, select the type of core they’re interested in, specify a set of parameters, and press the “Go” button to generate the equivalent HDL, C/C++ model, and testbench. The Confluence system design language Like most design engineers, I quake when faced with yet another software programming or hardware design language, but Launchbird Design Systems (www.launchbird.com) has come up with a system design language called Confluence— along with an associated Confluence Compiler—that is well worth looking at. It’s hard to wrap your brain around the many facets to Confluence, but we’ll give it a try. First of all, Confluence is an incredibly compact language that can be used to create representations of both hardware and embedded software. In the ■ 401 1969: First radio signal transmitted by “man on the moon.” 402 ■ The Design Warrior's Guide to FPGAs 1970: America. Ethernet developed at Palo Alto Research center by Bob Metcalf and David Boggs. case of hardware, the Confluence Compiler then takes these descriptions and generates the corresponding RTL in VHDL or Verilog (Figure 24-3). VHDL RTL Confluence design description Confluence Compiler Synthesis LUT/CLBlevel netlist Verilog RTL - Non-implementation-specific - Easy to create - Easy to modify Figure 24-3. A highly simplified representation of the outputs from the Confluence Compiler. One way to think about this is that you use an HDL (like VHDL or Verilog) to describe a specific circuit, but you use Confluence to describe an algorithm that can generate an entire class of circuits. The point is that you can express more in Confluence using far fewer lines of code (you can reduce your source code by 3 to 10 times, which makes designs quicker to produce, easier to manage, and faster to verify). Also, the result is “guaranteed clean” RTL, which prevents common errors and bad design practices. In programming terms, Confluence offers recursion, highorder data types, lexical scoping, and referential transparency (more than enough to make any system designer’s toes curl up in excitement). A simple example As a simple hardware example, consider a Confluence component that cascades any single-input-single-output element for any number of stages: component Cascade +Stages +SisoComp +Input -Output is if Stages <= 0 Output <- Input else Output <- {Cascade (Stages - 1) SisoComp {SisoComp Input $} $} end end Independent Design Tools Although nonprogrammers may initially regard the above as being a tad scary, it’s really not all that bad. The first line declares a new component we’ve decided to call Cascade, which has four parameters associated with it: Stages (the number of stages you require), SisoComp (the name of some subcomponent you wish to cascade), Input (the name of the input signal, or signals in the case of a bus), and Output (the name of the output signal, or signals in the case of a bus). Note that the only language keywords in this line are “component” and “is”; by comparison, Stages, SisoComp, Input, and Output are all user-defined variable names. (The “+” and “-” characters in this line indicate whether the associated user-defined variables are to be regarded as input or output ports, respectively.) Furthermore, when we said that this component cascades any single-input–single-output element, both the Input and the Output variables could actually be multibit buses. In fact. these signals don’t even have to be bit vectors; they could be lists of bit vectors or lists of lists of bit vectors (or any data type for that matter). As a simple example of the use of our new Cascade component, let’s assume that for some wild reason we wish to string 1,024 NOT gates together (don’t ask me why) such that the output from the first drives the input to the second, the output from the second drives the input to the third, and so forth. In this case, we could do this with a single line that calls our Cascade component and passes in the appropriate parameters: {Cascade 1024 (‘~’) Input Output} In this case, the Confluence Compiler understands “~” to be a primitive logical inversion (NOT) function. As a slightly more interesting example, let’s assume that we wish to cascade sixteen 8-bit registers such that the outputs from the first register drive the inputs to the second, the outputs from the second drive the inputs to the third, and so forth. In this case, we would first need to declare a component ■ 403 1970: America. Fairchild introduced the first 256-bit static RAM called the 4100. 404 ■ The Design Warrior's Guide to FPGAs 1970: America. Intel announced the first 1024-bit dynamic RAM called the 1103. called something like Reg8 to represent the 8-bit register, and then use our Cascade component to replicate this 16 times: component Reg8 +A -X is {VectorReg 8 A X} end {Cascade 16 Reg8 Input Output} Pretty cool, huh? But it gets better! How about squaring a signal’s values four times with a pipeline register between each stage? We can quickly and easily represent this as follows: component RegisteredPowerOfTwo +A -X is {Delay 1 (A ‘*’ A) X} end {Cascade 4 RegisteredPowerOfTwo Input Output} As we see, our Cascade component provides a perfect illustration of recursion and the use of higher-order datatypes, the two main characteristics of functional programming that provide higher levels of abstraction and increased design reuse. And things get better and better because there’s no restriction that our subcomponent variable SisoComp is obliged to have input and output ports of the same width. In fact, this variable can be associated with any user-defined function; it can even input a component and then output a component, or it could input a system (an instantiated component) and then output another system. Similarly, there is no restriction that SisoComp can operate only on bit vectors; it can just as well operate on integers, floats, lists, components, systems, or any other Confluence datatype. As one final example, SisoComp could be used to concatenate a bit vector onto itself, thereby doubling the number of bits. In order to illustrate this, let’s assume that we create a new component called SelfConcat: component SelfConcat +A -X is X = A ‘++’ A end Independent Design Tools where “++” is the concatenation operator. When SelfConcat is used in conjunction with Cascade, the bit vector grows by a factor of two at each stage. For example, assume that we start with a 2-bit vector set to 01 and pass SelfConcat into Cascade: {Cascade 4 SelfConcat ‘01’ Output} In this case, the output will be a 32-bit vector with a value of 01010101010101010101010101010101. Of course, VHDL has always had a generate statement, and Verilog was augmented with this capability in the 2K1 release, but Confluence blows these statements away. But wait, there’s more As I said earlier, it’s hard to wrap your brain around the many facets of Confluence. Perhaps the best way to summarize things is by means of an illustration (Figure 24-4). VHDL and/or Verilog RTL Confluence digital logic design description Confluence Compiler Confluence embedded software design description C and/or Python and/or Java Promela and/or NuSMV RTL descriptions used for ASIC and/or FPGA synthesis Hardware models or software code Formal verification Figure 24-4. A more accurate representation of the outputs from the Confluence Compiler. On the input side, you can use the Confluence language to create a representation of a piece of hardware or a chunk of embedded software. In the case of a hardware description, you can instruct the Confluence Compiler to generate VHDL or Verilog RTL for use with simulation and synthesis tools. You can also use the Confluence Compiler to output ANSI C or Python or Java representations (again, the Python language is introduced in more detail in Chapter 25). If your input source represented hardware, then these outputs may be ■ 405 1970: First floppy disk (8.5 inch) is used for storing computer data. 406 ■ The Design Warrior's Guide to FPGAs 1970: Researchers at Corning Glass develop first commercial/feasible optical fiber. considered to be cycle-accurate and bit-accurate highperformance simulation models, which can be linked into your custom verification environment. Alternatively, if your input source represented software, then these outputs may be considered to be executable code for use in your hardware/software coverification environment. Last, but not least, the Confluence Compiler can be instructed to generate representations in the PROMELA or NuSMV languages for formal verification purposes using the open-source SPIN model checker and NuSMV symbolic model checker, respectively (formal verification is discussed in chapter 19, while PROMELA, SPIN, and NuSMV are introduced in more detail in Chapter 25). Free evaluation copy If you visit the Launchbird Web site at www.launchbird.com, you’ll find a lot of Confluence source code examples. One really “cool beans” idea is that anyone can download and use a single unlimited license for free. Subsequent licenses will cost you for commercial purposes (academic usage is free), but prices are always subject to change, so you’ll have to get the latest info from Launchbird on this. What is really cool is that you own everything you develop with your free license (that is, any Confluence source code models and any ensuing VHDL, Verilog, C, etc. representations), and you can do with them what you wish, including sell them, which has to be a good deal, whichever way you look at it! Do you have a tool? Should you run into a useful tool from a small design house on your travels, or if you have created a tool of this type, please feel free to contact me at max@techbites.com for possible inclusion in the next edition of this tome or maybe an article in my bimonthly “Max Bytes” column at www.eedesign.com. Chapter 25 Creating an Open-Source-Based Design Flow How to start an FPGA design shop for next to nothing Something you don’t really see a lot of are small two-guysin-a-garage-type design houses focused on developing ASICs. This isn’t particularly surprising because the design tools required to develop this class of device tend to be horrendously expensive at $100,000 and up on a good day. (Of course, the fact that it costs millions of dollars to actually have a chip fabricated is also a bit of a showstopper.) By comparison, the combination of modern FPGAs and recent developments in open-source EDA and IP technology have brought the cost of starting an FPGA design outfit down to practically zero. This has paved the way for folks ranging from college graduates to full-blown professionals setting up shop in their basements. In addition to actually knowing what you are doing with regard to creating digital logic designs, starting a successful FPGA design house requires a few fundamental pieces: ■ ■ ■ ■ ■ ■ A development platform A verification environment Formal verification (optional) Access to common IP components Synthesis and implementation tools FPGA development boards (optional) The development platform: Linux Created by the Swedish engineer Linus Torvalds (and friends) starting around 1990, Linux is quickly becoming the Note that it is not my purpose to recommend the use of less wellsupported tools. Lowcost FPGA vendorsupplied tools are preferred for cost-sensitive setups, while more powerful tools from the larger and/or specialist EDA vendors are preferred as designs increase in size and complexity. However, if you are trying to create an FPGA design “shop” at home on a limited (or nonexistent) budget, the open-source tools presented here may well be of interest. 408 ■ The Design Warrior's Guide to FPGAs Linux is either pronounced “lee-nuks” (“lee” to rhyme with “see”) or “li-nuks” (“li” to rhyme with the “li” in “lit”, but NOT the “li” in “light”). predominant platform for ASIC and FPGA development. Even though the majority of FPGA synthesis and implementation tools originated on Microsoft Windows®, most are starting to be, or already have been, ported to Linux. Linux and GNU provide many invaluable tools for hardware and software development. Some common Linux tools (in no particular order, excepting one that pleased the author) include the following: ■ GNU is pronounced “G-noo” by taking the guttural ‘g’ sound from “great” and following it with “noo” to rhyme with “boo” or “pooh.” ■ ■ LISP offocially stands for List Processor (although it’s detractors say it really means “Lots of Irritating, Superfluous Parentheses.”) ■ gcc: C remains the fastest modeling language around for simulation and verification. If your designs are so large that they choke your HDL (Verilog or VHDL) simulation capability, you might consider creating a cycleaccurate C model and compiling it using the opensource GNU C compiler (gcc). make: The make utility is used to automate your build process. In the context of hardware, “build” can refer to anything from simulation, HDL-code generation, and logic synthesis to place-and-route. In order to tell make which files you wish to process and which files depend on other files, you have to define these files and their relationships in a file called a makefile. gvim: Derived from “visual interface,” VI is the classic UNIX text editor. The vim utility is an enhanced version of VI, and gvim is a graphical user interface (GUI) version of vim. The gvim utility extends VI with syntax highlighting features and all sorts of other cool macros. With built-in support for both Verilog and VHDL, gvim is an ultrafast, never-take-your-fingers-off-thekeyword design-entry tool. EMACS: Considered by many hackers to be the ultimate editor, EMACS (from “Editing MACroS”) is a programmable text editor with an entire LISP interpreter system inside it. More powerful and more complex than VI, EMACS now has modules available for use in developing Verilog and VHDL-based representations. Creating an Open-Source-Based Design Flow ■ ■ ■ ■ cvs: The Concurrent Versions System (CVS) is the dominant open-source, network-transparent, version-control system and is applicable to everyone from individual developers to large, distributed teams. CVS supports branching, multiple users, and remote collaboration. It maintains a history of all changes made to the directory (folder) tree and files it is instructed to manage. Using this history, CVS can recreate past states of the tree and show you when, why, and by whom a given change was made. So, if you accidentally mess up your RTL code or decide you want to resynthesize a version of your design from three months ago, no problem; CVS will help you deal with this type of thing. PERL: Scripting languages are often used for one-off programming jobs and for prototyping. In the context of electronic designs, they are also used to tie a number of tools in the flow together by controlling the ways in which the tools work and by organizing how data is passed between them. The Practical Extraction and Report Language (PERL) is historically one of the more widely used scripting languages. Developed by Larry Wall, PERL has jokingly been described as “The Swiss Army chainsaw” of UNIX (and Linux) programming, and many hardware design flows are still glued together using PERL scripts. Python: Arguably more powerful than PERL, the Python language is an “all-singing-all-dancing” scripting language that has evolved into a full-fledged programming language. Invented by Guido Van Rossum in 1990 and named after Monty Python due to Guido’s love of the Flying Circus, Python can be used for anything from gluing together the design flow, to high-level modeling and verification, to creating custom EDA tools (see also the additional discussions on Python later in this chapter). diff: A relatively simple, but incredibly useful, utility, diff is used to quickly compare source files and detect and report differences between them. ■ 409 1971: America. The Datapoint 2200 computer is introduced by CTC. 410 ■ The Design Warrior's Guide to FPGAs 1971: America. Ted Hoff designs (and Intel releases) the first computer-on-a-chip, the 4004 microprocessor. ■ ■ ■ ■ grep: Standing for globally search for a regular expression and print the lines containing matches to it (phew!), grep is used to quickly search a file or group of files to locate and report on instances of a particular text string or pattern. OpenSSL: Whether you are a large or small company, it pays to ensure the security of your IP. One aspect of this comes when you wish to transmit your IP over a network or over the Internet to your collaborators or customers. In this case, you really should consider encrypting the IP before waving it a fond farewell. One solution is the open-source OpenSSL project, which features a commercial-grade, full-featured toolkit implementing the Secure Sockets Layer (SSL) and Transport Layer Security (TLS) protocols, as well as an industrial-strength general-purpose cryptography library. OpenSSH: Is your design team spread across the planet? The Secure SHell (ssh) utility is a program for logging into a remote machine and for executing commands on a remote machine while providing secure encrypted communications between two untrusted hosts over an insecure network. An open-source version of the ssh suite, OpenSSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other network-level attacks. OpenSSH also provides a variety of secure tunneling capabilities and authentication methods. tar, gzip, bzip2: These are different utilities that can be used to compress and archive your work. Obtaining Linux Until recently, the leading distributors of Linux have been Red Hat (www.redhat.com) and MandrakeSoft (www.mandrakesoft.com). However, Gentoo Linux™ (www.gentoo.org) is rapidly becoming a favorite among developers. Gentoo has a unique package distribution system that automatically Creating an Open-Source-Based Design Flow downloads, compiles, and installs packages to your Linux machines. Want Icarus Verilog? Just type $ emerge iverilog and in a few minutes you’ll find that Icarus has been installed on your system and is ready to rock and roll! The verification environment You can argue about this back and forth, but many would say that the verification environment is the most critical part of the design flow. Anyone can bang away on the keyboard and produce HDL, but it’s the verification tools that provide designers with feedback to steer the design toward a correct implementation. Icarus Verilog The predominant open-source verification tool is a Verilog compiler known as Icarus (http://icarus.com/eda/verilog). In its basic form, Icarus compiles a Verilog design into an executable that can be run as a simulation. Truth to tell, Icarus is primarily used as an event-based simulator, but it can also handle basic logic synthesis for Xilinx FPGAs. Verilog is a complex language, and Icarus’s author, Stephen Williams, has done an excellent job with his Verilog implementation. In fact, Icarus Verilog’s language coverage and performance exceeds that of some commercial simulators. Dinotrace and GTKWave Icarus Verilog, discussed above, is strictly a command-line tool. (Command-line tools are preferred in UNIX and Linux environments because they are easy to glue together with makefiles.) Icarus does not provide a GUI to display simulation results. Rather, it can produce industry-standard value change dump (VCD) files that can be used downstream in the design flow by stand-alone waveform viewing applications. ■ 1971: CTC’s Kenbak-1 computer is introduced. 411 412 ■ The Design Warrior's Guide to FPGAs Enter Dinotrace and GTKWave, which are GUI utilities that can be used to display simulation results in VCD format. Both of these waveform viewers can scroll through a simulation, add trace lines, and search for patterns. Dinotrace (www.veripool.com/dinotrace) is a solid tool, but with limited functionality. By comparison, GTKWave (www.cs.man.ac.uk/apt/tools/gtkwave) started out a little rough around the edges, but has seen modest development in recent months. Covered code coverage When verifying a design, access to functional coverage metrics is important to ensure that your test vectors are hitting the corner cases in your design. Covered (http://covered.sourceforge.net) is a Verilog code-coverage utility that produces the code-coverage metrics associated with a simulation. More specifically, Covered analyzes Verilog source and the VCD data produced from an Icarus Verilog simulation to determine the level of functional coverage. Covered currently handles four types of coverage metrics: line coverage, toggle coverage, combinational coverage, and finite-state-machine coverage. Verilator Another useful tool is VTOC from Tenison EDA (www.tenison.com). This tool generates C++ or SystemC models from RTL source code. The hot design issue these days is how to handle SoC designs, which require the integration of hardware and embedded software on a single chip. Many FPGAs host embedded hard processor cores or have access to soft processor cores (see also chapter 13). The real trick in an SoC design involves verifying the hardware and software integration. Enter Verilator (www.veripool.com/verilator.html), which converts Verilog into cycle-accurate C++ models. The ability to autogenerate C/C++ models from RTL source code is a powerful verification tool. This allows the software to integrate directly with the C/C++ version of the RTL for simulation purposes. Creating an Open-Source-Based Design Flow In addition to hardware-software coverification, Verilator can also be used for general-purpose Verilog simulation because simulating with cycle-accurate C gives much faster run times than can be obtained with an event-based HDL simulator. All you have to do is compile the output C code using gcc (see “The development platform: Linux” section above) and run.1 Python Python (www.python.org) is a very useful high-level scripting and programming language becoming world renowned for its rapid implementation capabilities. Not surprisingly, Python is shaping up as a power tool for digital design and verification engineers, particularly for tasks such as system modeling, testbench construction, and general design management. In fact, many design firms are starting to discover that it’s easier and faster to begin by creating Python models rather than Verilog or VHDL representations. Once these Python models have been verified via simulation, the design team can undertake the RTL coding process constantly referencing their “golden” Python models. MyHDL (www.jandecaluwe.com/Tools/MyHDL/Overview.html) is a Python framework for high-level system modeling. It uses recent feature additions to the Python language (generators) to mimic concurrent operations. MyHDL also has the ability to connect to Icarus Verilog for mixed Python/Verilog simulation. Formal verification As the Dutch mathematician and computer pioneer Edsger Wybe Dijkstra once said, “Program testing can be used to show the presence of bugs, but never to show their absence.” Although hardware simulation remains the predominant means for system testing, one can only ensure a system is correct by means of formal verification (see also Chapter 19). 1 Note that Icarus also has C code-generation capabilities. ■ 413 1971: First direct telephone dialing between the USA and Europe. 414 ■ The Design Warrior's Guide to FPGAs 1971: Niklaus Wirth develops the PASCAL computer language (named after Blaise Pascal). Unlike simulation, formal verification mathematically proves that a system’s implementation meets some form of specification. The two main types of formal verification are model checking and automated reasoning. Model checking is a technique that explores the state space of a system to ensure that certain properties, typically specified as “assertions,” are true. A subdiscipline called equivalence checking, which compares two representations of a system (for example, RTL and a gate-level netlist) to determine whether or not they have the same input-to-output functionality, is a form of model checking. By comparison, automated reasoning uses logic to prove (much like a formal mathematical proof) that an implementation meets an associated specification. Open-source model checking The predominant open-source model checker is SPIN (http://spinroot.com), which has been under development for almost 20 years by Dr. Gerard J. Holzmann at Bell Labs. A rather cunning beast, SPIN recently received the Software and System Award by the Association for Computing Machinery (ACM). This is no small honor as previous award recipients have been UNIX, SmallTalk, TCP/IP, and the World Wide Web. SPIN accepts an input specification with an integrated system model using a language called PROMELA. By means of this language, users can create complex assertions in the form of never-claims, which define a series of events that should never occur in the system. Given a model and a specification, SPIN exhaustively searches the state-space for violations. The main drawback with SPIN is that it’s primarily intended for asynchronous software verification and, thus, employs a technique called explicit verification. Although explicit verification is ideal for verifying software protocols, the technique tends to be inefficient for large hardware-based designs. Creating an Open-Source-Based Design Flow For moderately sized hardware designs, a symbolic model checker is the way to go. Unlike explicit verification, symbolic model checking uses binary decision diagrams (BDDs) and propositional satisfiability algorithms (SATs)2 to contain the problem and, if possible, avoid state-space explosion. Fortunately, there is a high-quality open-source symbolic model checker called NuSMV (http://nusmv.irst.itc.it). Open-source automated reasoning The advantage of the model-checking approach discussed in the previous section is that it’s an automated process: click the button, then wait for the result. The drawback is that you may have to wait for a very long time. Even though the symbolic representation used by NuSMV provides a leg up on explicit model checkers, state-space explosion is still an imminent threat. It doesn’t take long before a system’s size grows beyond the practical limitations of a model checker. Another problem associated with model checking is that it’s limited in expression to the extent that some complex assertions simply can’t be specified in a modelchecking environment. Enter automated reasoning, otherwise known as automated theorem proving. Automated reasoning does not share the limitations of model checking. For example, system size is not as relevant because automated reasoning does not search the state-space. More importantly, automated reasoning supports a much higher level of expression for accurately modeling complex and intricate specifications. Unfortunately, what is gained in some areas is lost in others. Despite its name, automated reasoning is not a fully automatic process. In the real world, the verification engineer conducts the proofing process with the assistance of the tools. Furthermore, in order to use the tools effectively, the verifica- 2 The abbreviation “SAT” comes from the first three letters of “satisfiability.” ■ 415 1972: America. Intel introduce the 8008 microprocessor. 416 ■ The Design Warrior's Guide to FPGAs 1973: America. Scelbi Computer Consulting Company introduces the Scelbi-8H microcomputer-based do-it-yourself computer kit. tion engineer needs to be well versed in proof strategies, mathematical logic, and the tools themselves. This is a nontrivial learning curve, but if you’re willing to invest the time and effort, automated reasoning is arguably the most powerful form of verification. Unlike model checking, where open-source tools struggle to compete with commercial applications, the open-source tools for automated reasoning are at the world’s leading edge. Three of the most popular are HOL (http://hol.sourceforge.net), TPS (http://gtps.math.cmu.edu/tps.html), and MetaPRL (http://cvs.metaprl.org:12000/metaprl/default.html). What actually is the problem? Like any tool, formal verification is only as good as the engineers using it. Even on a good day, formal verification can only answer the question, Does my implementation meet the specification? But the critical question remains: Is my specification correct? Evaluations of real-world designs show that most system failures are not due to a faulty implementation per se. Even without the use of formal verification, designs tend to implement the requirements correctly more often than not. The root causes of most failures are usually the requirements themselves. Open communication and collaboration are the best ways to ensure a correct specification, and, at the time of this writing, the only known tool that can tackle this problem is the cerebral cortex. Access to common IP components A useful rule of thumb if you are a small design house (or even a large design house) is to avoid reinventing the wheel. Over time, every design firm acquires a library of frequently used components that it can pull from to speed up the design process. In fact, a design firm’s capabilities are sometimes judged by its IP portfolio. Creating an Open-Source-Based Design Flow OpenCores Fortunately for aspiring designers, they already have access to a vast IP library in the form of OpenCores (www.opencores.org). As the industry’s premier open-source hardware IP repository, OpenCores collects projects with cores ranging across arithmetic units, communication controllers, coprocessors, cryptography, DSP, forward error correction coding, and embedded microprocessors. Furthermore, OpenCores also stewards Wishbone, which is a standardized bus protocol for use in SoC projects. OVL Designers can spend as much as 70 percent of a design’s total development time in the verification portion of the flow. This has created the need for access to libraries of verification IP. For this reason, Accellera (www.accellera.org) started the Open Verification Library, or OVL, to address the need for common IP verification components. Synthesis and implementation tools Synthesis (both logic synthesis and physically aware synthesis) is one major step in the FPGA design flow not completely addressed by open-source technology. Unfortunately, this situation is unlikely to change in the immediate future due to the complexity of the FPGA synthesis problem. At the time of this writing, Icarus (see “The verification environment” section above) is the only open-source tool known to synthesize HDL to FPGA primitives. The only other low-cost options are the synthesis and implementation tools from the FPGA vendors themselves (these should be the primary choice for a low-cost setup). When a design approaches the capacity of a top-of-the-line device, however, even FPGA-vendor-provided synthesis tools start to become inadequate for the task. This means that in the case of large, bleeding-edge designs, you may have no choice but to fork out the cash for a high-end synthesis tool. ■ 417 1973: America. Xerox Alto computer is introduced. 418 ■ The Design Warrior's Guide to FPGAs 1973: May, France. 8008-based Micral microcomputer is introduced. FPGA development boards If a design firm decides to get involved with physical hardware, FPGA development boards are a must. OpenCores (see the “Access to common IP components” section above) does offer a few FPGA development board projects, but most designers would be better served by purchasing professional development boards. On the bright side, money spent on boards can be saved in other areas. For example, a clever engineer can turn a small FPGA evaluation board into a highly capable logic analyzer (hmmm, this sounds like a potential OpenCores project!). Miscellaneous stuff Some other odds and sods that might be of interest are as follows: ■ ■ ■ ■ ■ ■ www.easics.be Click the “WebTools” link to find a CRC utility that allows you to select standard or custom polynomials and generate associated Verilog or VHDL modules www.linuxeda.com EDA tools for Linux http://geda.weul.org A collection of open-source EDA tools www.veripool.com A collection of Verilog-based tools (this is the home of Dinotrace and Verilator) http://ghdl.free.fr An open-source VHDL front end to gcc http://asics.ws Some more open-source IP cores While surfing the Web, one can meander into a lot of other open-source projects related to EDA and FPGAs. Unfortunately, most are dormant or have been abandoned without achieving a useful level of functionality. Having said this, should you run into something useful, or if you have created something useful, please feel free to contact me at max@techbites.com for possible inclusion in the next edition of this tome. Chapter 26 Future FPGA Developments Be afraid, be very afraid This is the scary bit, because past experience has shown that whatever I thought was coming down the pike was but a pale imitation of what actually ended up sneaking up behind me and leaping out with gusto and abandon when I was least expecting it. You have to remember that when I started my career designing CPUs for mainframe computers back in 1980 (which really isn’t all that long ago when you come to think about it), we didn’t have access to any of the technologies and tools that are around today. We didn’t have schematic capture packages, so we used a pencil and paper to draw gate-level circuit diagrams. We didn’t have logic simulators (early versions were available, but we didn’t have one), so we verified our designs by peer review, which boils down to other engineers looking at your schematics and saying, “That looks OK to me.” Sophisticated HDLs like Verilog and VHDL were a long way off in the future, and the possibility that tools like logic synthesis might one day exist simply never occurred to us. When it came to logic optimization and minimization, we had a Chinese engineer on our team who was incredible at this sort of thing; we gave him our designs and he returned optimized versions a day or so later. In the case of timing analysis, once again we were back to pencil and paper, calculating delay paths by hand (no one I knew could afford even the most rudimentary of electronic calculators). In those days, we were working with multimicron ASIC technologies containing only a few thousand logic gates 420 ■ The Design Warrior's Guide to FPGAs (FPGAs had not yet been invented). If you had told me that by 2003 we’d be designing ASICs and SoCs at the 90nanometer technology node containing tens to hundreds of millions of logic gates and that we’d have reconfigurable devices like today’s SRAM-based FPGAs, I would have laughed my socks off. Similarly, if you’d told me that I’d one day have a personal computer on my desktop with hundreds of megabytes of RAM, a clock running at 2 or more gigahertz, and a hard disk with a capacity of 60 gigabytes and that I’d have access to the EDA tools that are around today, I’d have calmly smiled while furtively looking for the nearest exit.1 The point is that electronics is going so fast that any predictions we might make are probably going to be of interest only for the purposes of saying, “Well, we didn’t see that coming, did we?” But what the heck, I’m game for a laugh, so let’s throw the dice and see how well we do. Next-generation architectures and technologies In Britain, the term “billion” traditionally used to mean “a million million” (1012). For reasons unknown, however, the Americans decided that “billion” should mean “a thousand million” (109). In order to avoid the confusion that would otherwise ensue, most countries in the world (including Britain) have decided to go along with the Americans on this one. Billion-transistor devices One thing I feel very confident in predicting is that the next generation of FPGAs will contain a billion or more transistors (the reason I’m so self-assured on this point is that Xilinx recently announced devices of this ilk). These chips will be fabricated at the 90-nanometer technology node in late 2003 or early 2004, followed by even larger devices created at the 65- to 70-nanometer node in 2004 or 2005. Super-fast I/O When it comes to the gigabit transceivers discussed in chapter 21, today’s high-end FPGA chips typically sport one or more of these transceiver blocks, each of which has multiple channels. Each channel can carry 2.5 Gbps of real data; so four channels have to be combined to achieve 10 Gbps. Furthermore, an external device has to be employed to convert an incoming optical signal into the four channels of electrical 1 The first IBM PC wouldn’t see the light of day until 1981. Future FPGA Developments data that are passed to the FPGA. Conversely, this device will accept four channels of electrical data from the FPGA and convert them into a single outgoing optical signal. At the time of this writing, some FPGAs are coming online that can accept and generate these 10 Gbps optical signals internally. Another technology that may come our way at some stage in the future is FPGA-to-FPGA and FPGA-to-ASIC wireless or wireless-like interchip communications. With regard to my use of the term wireless-like, I’m referring to techniques such as the experimental work currently being performed by Sun Microsystems on interchip communication based on extremely fast, low-powered capacitive coupling. This requires the affected chips to be mounted very (VERY) close to each other on the circuit board, but should offer interchip signal speeds 60 times higher than the fastest board-level interconnect technologies available today. Super-fast configuration The vast majority of today’s FPGAs are configured using a serial bit-stream or a parallel stream only 8 bits wide. This severely limits the way in which these devices can be used in reconfigurable computing-type applications. Quite some time ago (somewhere around the mid-1990s), a team at Pilkington Microelectronics (PMEL) in the United Kingdom came up with a novel FPGA architecture in which the device’s primary I/O pins were also used to load the configuration data. This provided a superwide bus (256 or more pins/bits) that could program the device in a jiffy.2 2 The official definition of “jiffy” is “a short space of time,” “a moment,” or “an instant.” Engineers may use “jiffy” to refer to the duration of one tick of a computer’s system clock. This is often based on one cycle of the mains power supply, which is 1/60 of a second in the U.S. and Canada and 1/50 of a second in England and most other places. More recently, equating a jiffy to 1/100 of a second has started to become common. Just to add to the fun, physicists sometimes use “jiffy” to refer to the time required for light to travel one foot in a vacuum (this is close to one nanosecond). ■ 421 Founded in 1826, Pilkington is one of the world’s largest manufacturers of glass products. Pilkington is widely recognized as the world’s technological leader in glass. For example, in 1952, Sir Alastair Pilkington invented the float process in which molten glass, at approximately 1000ºC, is poured continuously from a furnace onto one end of a shallow bath of molten tin. The glass floats on the tin, which gives it an incredibly smooth surface. The glass cools and solidifies as it progresses across the bath, and is pulled off the far end in a continuous sheet. Having said all of this, I have no idea why Pilkington became involved in microelectronics. 422 ■ The Design Warrior's Guide to FPGAs 1973: June, the term microcomputer first appears in print in reference to the 8008-based Micral microcomputer. As an example of where this sort of architecture might be applicable, consider the fact that there are a wide variety of compressor/decompressor (CODEC) algorithms that can be used to compress and decompress audio and video data. If you have a system that needs to decompress different files that were compressed using different algorithms, then you are going to need to support a variety of different CODECs. Assuming that you wished to perform this decompression in hardware using an FPGA, then with traditional devices you would either have to implement each CODEC in its own device or as a separate area in a larger device. You wouldn’t wish to reprogram the FPGA to perform the different algorithms on the fly because this would take from 1 to 2.5 seconds with a large component, which is too long for an end user to wait (we demand instant gratification these days). By comparison, in the case of the PMEL architecture, the reconfiguration data could be appended to the front of the file to be processed (Figure 26-1). Files containing configuration data for different CODEC algorithms PMEL FPGA Audio and video files compressed using different CODEC algorithms Figure 26-1. A wide configuration bus. The idea was that the configuration data would flood through the wide bus, program the device in a fraction of a second, and be immediately followed by the main audio or video data file to be decompressed. If the next file to be proc- Future FPGA Developments essed required a different CODEC, then the appropriate configuration file could be used to reprogram the device. This concept was applicable to a wide variety of applications. Unfortunately, the original incarnation of this technology fell by the wayside, but it’s not beyond the bounds of possibility that something like this could reappear in the not-so-distant future.3 More hard IP In the case of technology nodes of 90 nanometers and below, it’s possible to squeeze so many transistors onto a chip that we are almost certainly going to see an increased amount of hard IP blocks for such things as communications functions, special-purpose processing functions, microprocessor peripherals, and the like. Analog and mixed-signal devices Traditional digital FPGA vendors have a burning desire to grab as many of the functions on a circuit board as possible and to suck these functions into their devices. In the short term, this might mean that FPGAs start to include hard IP blocks with analog content such as analog-to-digital (A/D) and digitalto-analog (D/A) converters. Such blocks would be programmable with regard to such things as the number of quanta (width) and the dynamic range of the analog signals they support. They might also include amplification and some filtering and signal conditioning functions. Furthermore, over the years a number of companies have promoted different flavors of field-programmable analog arrays (FPAAs).4 Thus, there is more than a chance that predominantly digital FPGAs will start to include areas of truly programmable analog functionality similar to that provided in pure FPAA devices. A wide-bus configuration scheme is used by some of the field programmable node array (FPNA) devices introduced in chapter 23. 4 For example, Anadigm (www.anadigm.com) have some interesting devices. 3 ■ 423 1974: America. Intel introduces the 8080 microprocessor, the first true general-purpose device. 424 ■ The Design Warrior's Guide to FPGAs ASMBL and other architectures ASMBL is pronounced like the word “assemble.” Just as I started penning the words for this chapter, Xilinx formally announced their forthcoming Application Specific Modular BLock (ASMBL™) architecture. The idea here is that you have an underlying column-based architecture, where the folks at Xilinx have put a lot of effort into designing different flavors of columns for such things as ■ ■ ■ ■ ■ ■ ■ General-purpose programmable logic Memory DSP-centric functions Processing functions High-speed I/O functions Hard IP functions Mixed-signal functions Xilinx will provide a selection of off-the-shelf devices, each with different mixes of column types targeted toward different application domains (Figure 26-2). Figure 26-2. Using the underlying ASMBL architecture to create a variety of off-the-shelf devices with domain-specific functionality. Future FPGA Developments Of course, the other FPGA vendors are doubtless working on their own next-generation offerings, and we can expect to see a flurry of new architectures over the coming years. Different granularity As we discussed in chapter 4, FPGA vendors and university students have spent a lot of time researching the relative merits of 3-, 4-, 5-, and even 6-input LUTs. In the past, some devices were created using a mixture of different LUT sizes, such as 3-input and 4-input LUTs, because this offered the promise of optimal device utilization. For a variety of reasons, the vast majority of today’s FPGAs contain only 4-input LUTs, but it’s not beyond the range of possibility that future offerings will sport a mixture of different LUT sizes. Embedding FPGA cores in ASIC fabric The cost of developing a modern ASIC at the 90nanometer technology node is horrendous. This problem is compounded by the fact that once you’ve completed a design and built the chip, your algorithms and functions are effectively “frozen in silicon.” This means that if you have to make any changes in the future, you’re going to have to regenerate the design, create a new set of photo-masks (costing around $1 million), and build a completely new chip. In order to address these issues, some users are interested in creating ASICs with FPGA cores embedded into the fabric. Apart from anything else, this means that you can use the same design for multiple end applications without having to create new mask sets. At the time of this writing, the latest incarnation of this technology is the XBlue architecture announced by IBM and Xilinx. Created using the 90nanometer technology node, these devices are expected to start shipping in 2004. I also think that we are going to see increased deployment of structured ASICs and that these will lend themselves to sporting embedded FPGA cores because their design styles and tools will exhibit a lot of commonality. ■ 425 1974: America. Motorola introduces the 6800 microcomputer. 426 ■ The Design Warrior's Guide to FPGAs 1974: America. Radio Electronic Magazine publishes an article by Jonathon (Jon) Titus on building an 8008-based microcomputer called the Mark-8. Embedding FPNA cores in ASIC and FPGA fabric and vice versa In Chapter 23, we discussed the concept of embedding FPNA cores in FPGA and ASIC fabric or embedding FPGAbased nodes in FPNA fabric. Should this come to pass, it’s not beyond the bounds of possibility that one day we’ll be designing an ASIC with an embedded FPGA core, which itself has an embedded FPNA core, which, in turn, contains FPGAbased nodes. The mind boggles! MRAM-based devices In Chapter 2, we introduced the concept of MRAM. MRAM cells have the potential to combine the high speed of SRAM, the storage capacity of DRAM, and the nonvolatility of FLASH, all while consuming a miniscule amount of power. MRAM-based memory chips are predicted to become available circa 2005. Once these memory chips do reach the market, other devices, such as MRAM-based FPGAs, will probably start to appear shortly thereafter. Don’t forget the design tools As we discussed above, the next generation of FPGAs will contain 1 billion transistors or more. Existing HDL-based design flows in which designs are captured at the RTL-level of abstraction are already starting to falter with the current generation of devices, and it won’t be long before they essentially grind to a halt. One useful step up the ladder will be increasing the level of design abstraction by using the pure C/C++-based flows introduced in Chapter 11. Really required, however, are true system-level design environments that help users explore the design space at an extremely high level of abstraction. In addition to algorithmic modeling and verification, these environments will aid in partitioning the design into its hardware and software components. These system-level environments will also need to provide performance analysis capabilities to aid users in evaluating Future FPGA Developments which blocks are too slow when realized in software and, thus, need to be implemented in hardware, and which blocks realized in hardware should really be implemented in software so as to optimize the use of the chip’s resources. People have been talking about this sort of thing for ages, and various available environments and tools go some way toward addressing these issues. In reality, however, such applications have a long way to go with regard to their capabilities and ease of use. Expect the unexpected That’s it, the end of this chapter and the end of this book. Phew! But before closing, I’d just like to reiterate that anything you or I might guess at for the future is likely to be a shallow reflection of what actually comes to pass. There are device technologies and design tools that have yet to be conceived, and when they eventually appear on the stage (and based on past experience, this will be sooner than we think), we are all going to say, “WOW! What a cool idea!” and “Why didn’t I think of that?” Good grief, I LOVE electronics! ■ 427 1975: America. Microcomputer in kit form reaches U.S. home market. Appendix A Signal Integrity 101 Before we start Before leaping into this topic, it’s important to note that signal integrity (SI) is an incredibly complicated and convoluted subject that can quickly make your brain ache and your eyes water if you’re not careful. For this reason, the discussions in this appendix are intended only to introduce some of the more significant SI concepts. If you are interested in learning more, you could do a lot worse than reading Signal Integrity—Simplified by SI expert Dr. Eric Bogatin, ISBN: 0130669466, and High Speed Signal Propagation: Advanced Black Magic by Howard W. Johnson, ISBN: 013084408X. SI encompasses a wide range of different aspects, including the way in which the “shape” of a signal degrades as it passes through a wire, and also the way signals can effectively “bounce back” off the end of a wire that is incorrectly terminated (like a ball thrown down a corridor bouncing off the wall at the end). For our purposes here, however, we shall concentrate on those SI effects that are gathered together under the umbrella appellation of crosstalk. Crosstalk-induced noise (glitches) and delays are dominated by different issues inside silicon chips from those seen at the circuit board level. For this reason, we shall commence by introducing the root causes of these effects and then consider their chip-level and board-level manifestations independently. SI is pronounced by spelling it out as “S-I.” The amount by which a material impedes the flow of electric current is referred to as resistance (R), which is measured in units of ohms. The term “ohm” (represented by the Greek letter omega “Ω”) is named after the German physicist Georg Simon Ohm, who defined the relationship between voltage, current, and resistance in 1827. 430 ■ The Design Warrior's Guide to FPGAs The property of an electric conductor that characterizes its ability to store an electric charge is referred to as capacitance (C), which is measured in units of Farads (F). The term “Farad” is named after the British scientist Michael Faraday, who constructed the first electric motor in 1821. Capacitive and inductive coupling (crosstalk) Consider two signal wires called Wire1 and Wire2, each of which is driven by a single gate and drives a single load. In an ideal—and somewhat simplified—world, both wires would be perfectly straight with no awkward bends or discontinuities, and each could be represented by a single series resistance, series inductance, and capacitance (Figure A-1). Driver Wire1 RW1 Receiver (Load) LW1 C W1 Driver Wire2 RW2 Receiver (Load) LW2 C W2 Figure A-1. Two signal wires in an ideal (simplified) world. For the purposes of this minimalist example, the capacitances CW1 and CW2 are considered with respect to a ground plane. In its simplest form, a capacitor consists of two metal plates separated by an insulating layer called the dielectric. This means that if our two signal wires run in close proximity to each other, then from the perspective of an outside observer they would actually appear to form a rudimentary capacitor. This may be represented by adding a symbol CM to reflect this mutual capacitance into our circuit diagram (Figure A-2). When one of the signal wires is in the process of transitioning between logic values, the coupling capacitance between the wires causes a transfer of charge into the other wire, which may result in noise (glitch) and delay effects, as discussed in the following sections. As was previously noted, each wire also has some amount of inductance associated with it. In its simplest terms, induc- Signal Integrity 101 The two wires are coupled by a mutual capacitance Driver RW1 Receiver (Load) LW1 CW1 CM Driver RW2 LW2 Receiver (Load) ■ 431 In 1831, the British scientist Michael Faraday discovered that a changing electromagnetic field induced a current in a nearby conductor. This effect subsequently became known as inductance (L). CW2 Figure A-2. Two wires in close proximity are coupled by a mutual capacitance. tance is the property associated with conductors by which changes in the current flowing through a conductor creates a magnetic field surrounding that conductor. Correspondingly, any changes in the magnetic field surrounding a conductor induce a response in that conductor. This means that when one of our signal wires is in the process of transitioning between logic values, the change in current flowing through the wire combined with the inductance associated with that wire causes a magnetic field to build up around the wire. As it expands, this field interacts with the inductances associated with any wires in close proximity, which, once again, may result in noise and delay effects as discussed in the following sections. This mutual inductance is indicated by adding a dot to each of the inductor symbols (Figure A-3). The symbol for inductance is the capital letter L in honor of the Russian physicist Heinrich Lenz, who discovered the relationships between the forces, voltages, and currents associated with electromagnetic induction in 1833. Inductance is measured in units of Henries (H). The term “Henry” is named after the American scientist Joseph Henry, who independently discovered inductance around the same time as Faraday. Chip-level effects Chip-level effects are RC (resistance-capacitance) dominated Early ICs had tracks that were formed from aluminum (chemical symbol Al), which has a relatively high resistance. Pronounced “al-oo-minum” in America, aluminum is spelled (and pronounced) “al-u-minium in the UK. 432 ■ The Design Warrior's Guide to FPGAs Aluminium was also the accepted spelling in America until 1925. At that time, the American Chemical Society officially decided to use the name aluminum in their publications. Dating back more than 10,000 years, copper is the oldest metal worked by man. Most creatures on earth have blood, whose red color is caused by the iron-based pigment hemoglobin. However, some primitive creatures have green, copperbased blood, whose pigment is called cuproglobin. RC is pronounced by spelling it out as “R-C.” RLC is pronounced by spelling it out as “R-L-C.” RW1 Receiver (Load) LW1 CW1 The two wires are coupled by a mutual inductance CM RW2 LW2 Receiver (Load) CW2 Figure A-3. Two wires in close proximity are coupled by a mutual inductance. As device feature sizes continued to shrink with each new technology node, the resistance associated with the aluminum tracks started to increase to unacceptable levels. IC manufacturers had long wanted to use copper tracks (chemical symbol Cu) because copper is one of the best conductors known to man, especially for high-frequency applications. However, copper also has some awkward properties, not the least of which is that it can easily diffuse into the silicon chip, thereby rendering the device useless. It was not until the late 1990s that IBM solved this problem by the inclusion of special barrier layers. Even though copper has a much lower resistance than aluminum, signal tracks on ICs are so fine that their resistance is still extremely significant. The result is that, thus far, delay effects associated with signals propagating through IC tracks have tended to be dominated by their resistive and capacitive (RC) characteristics. At this time, inductive (L) effects are typically ignored in signal tracks and are only considered with respect to the power grid. This grid employs wider tracks with correspondingly lower resistance, such that resistance, inductance, and capacitance (RLC) characteristics all need to be accounted for. Signal Integrity 101 ■ 433 Increased sidewall capacitive coupling In the case of early IC implementation technologies, the aspect ratio of tracks was such that their width was significantly greater than their height (figure A-4a). As feature sizes continue to shrink, however, the processes used to create these devices result in track aspect ratios in which height predominates over width (Figure A-4b). Sidewall coupling capacitance ( CSIDE ) Cross-sectional view of interconnect IC Substrate IC Substrate (a) 1.0 micron circa 1990 (small C SIDE values) (b) 0.13 micron circa 2003 (large CSIDE values) Figure A-4. Sidewall capacitance effects increase with shrinking feature sizes (not to scale—illustrates relative aspect ratios only). The result is a dramatic increase in coupling capacitance (CSIDE) between the sidewalls of adjacent tracks relative to the substrate capacitances CAREA (track base to substrate) and CFRINGE (sidewall to substrate). Furthermore, the high integration densities associated with today’s devices, which can support eight or more metalization layers, result in significant capacitive coupling between adjacent layers. This is represented by CCROSSOVER (Figure A-5). The combination of these factors leads to a tremendous increase in the complexity of crosstalk noise and timing effects, as discussed below. Crosstalk-induced glitches When signals in neighboring wires transition between logic values, the coupling capacitance between the wires causes a transfer of charge. Depending on the slew of the signals (the speed of switching in terms of rise and fall times) and the amount of mutual crosstalk capacitance (CM), there can be significant crosstalk-induced glitches (Figure A-6). The term “glitch” possibly comes from the Yiddish word glitsh, meaning “a slip or lapse.” 434 ■ The Design Warrior's Guide to FPGAs 1975: America. MOS Technology introduces the 6502-based KIM-1 microcomputer. CSIDE Metal 2 CCROSSOVER Metal 1 Substrate CAREA CFRINGE CSIDE Figure A-5. Capacitance effects associated with the interconnect. Figure A-6. A crosstalk-induced glitch. In this example, a transition on the fast aggressor net causes a glitch to be presented to the input of the receiver (load) of an adjacent victim net. Of course, this illustration presents a very simplistic view. In reality, each track may be formed from multiple segments occupying multiple levels of metalization. Thus, the resistances (RW1 and RW2) and capacitances (CW1 and CW2) will each consist of multiple elements associated with the different segments. Similarly, the mutual coupling crosstalk capacitance (CM) may consist of multiple elements. Signal Integrity 101 ■ 435 The example glitch illustrated in figure A-6 represents only one of four generic possibilities based on the fact that a rising or falling transition on the aggressor net may be coupled with a logic 0 or logic 1 on the victim net (Figure A-7). Figure A-7. Types of crosstalk-induced glitches. If the ensuing low-noise or high-noise glitches on the victim net cross the input switching threshold of its receiver, a functional (logic) error may occur. In some cases this error may manifest itself as an incorrect data value that is subsequently loaded into a register or latch. In other cases, the error may cause a latch to perform an unintended load, set, or reset. The low-undershoot and high-overshoot glitches on the victim net pose a different problem because they can cause undesirable charge carriers to be trapped in the transistors forming the logic gates, which can degrade circuit performance. Although these effects, commonly known as hot electron effects, are not a major threat in the context of current IC implementation technologies, they will become increasingly significant as device geometries progress furter into the deepsubmicron (DSM) and ultra-deep-submicron (UDSM) realms. Crosstalk-induced delay effects The situation becomes even more complex when simultaneous switching occurs on both the aggressor and victim nets. Any IC implementation technology below 0.5 µm is referred to as being deep submicron (DSM). DSM is pronounced by spelling it out as “D-S-M.” At some point that isn’t particularly well defined (or is defined differently depending on whom you are talking to), we move into the UDSM realm. UDSM is pronounced by spelling it out as “U-D-S-M” or by saying “ultra-D-S-M.” 436 ■ The Design Warrior's Guide to FPGAs 1975: America. Sphere Corporation introduces the 6800-based Sphere 1 microcomputer. For example, in the case of opposing transitions, the signal on the victim net may be slowed down (Figure A-8). Figure A-8. Crosstalk-induced signal delay. If the signal on the victim net were transitioning in isolation, it would take a certain amount of time to cross its receiver’s switching threshold (which, for the purposes of these discussions, may be assumed to be 50 percent of the value between a logic 0 and a logic 1). However, the glitch caused by a simultaneous transition on the aggressor net holds the victim’s signal above the receiver’s switching threshold for an additional amount of time. This can result in a downstream setup violation. An alternative scenario occurs when a transition on the victim is complemented by a simultaneous transition on the aggressor in the same direction, in which case the signal on the victim may speed up (Figure A-9). In this case, the glitch caused by a simultaneous transition on the aggressor net causes the victim’s signal to cross the load/receiver’s switching threshold earlier than expected. This can result in a downstream hold violation. Signal Integrity 101 ■ 437 1975: America. Bill Gates and Paul Allen found Microsoft. Figure A-9. Crosstalk-induced signal speed up. Multiaggressor scenarios In reality, the examples shown above are extremely simplistic. In the case of real-world designs, each victim net may be affected by multiple aggressors (Figure A-10). Aggressor A Aggressor B Victim Aggressor C Figure A-10. Multiaggressor scenario. Accurate analysis of today’s designs requires that each aggressor’s contribution be individually analyzed and accounted for. 438 ■ The Design Warrior's Guide to FPGAs And let’s not forget the Miller effect In the context of an electronic circuit, the term “bus” (sometimes “buss”) refers to a set of signals performing a common function and carrying similar data. The Miller effect, which is of particular significance at the chip level, states that the simultaneous switching of both terminals of a capacitor will modify the effective capacitance between the terminals. What this means in real terms becomes apparent when we consider one of the signals in the middle of a bus, for example. If one or more of the surrounding signals in the bus are switching with the same polarity (in the same direction) as the signal of interest, then the capacitance associated with this signal will appear to be reduced, and its propagation delay will decrease (this is in addition to the crosstalk-induced delay effects introduced earlier). By comparison, if one or more of the surrounding signals in the bus are switching with the opposite polarity to the signal of interest, then the capacitance associated with this signal will appear to be larger and its propagation delay will increase. The reason we commenced with the chip-level effects introduced above is that these provide a familiar starting point for IC design engineers. In reality, however, on-chip SI effects (excluding packaging considerations) are of little interest to engineers using FPGAs because these effects are handled behind the scenes by the device vendor. By comparison, board-level SI effects are extremely pertinent when it comes to integrating FPGAs into a circuit board environment. Board-level effects Board-level effects are LC (inductancecapacitance) dominated LC is pronounced by spelling it out as “L-C.” PCB is pronounced by spelling it out as “P-C-B.” When it comes to PCBs, the resistance of their copper tracks is almost negligible in the context of coupling effects. This is because at around 125 microns wide and 18 microns thick, board-level tracks have a huge cross-sectional area compared to their chip-level counterparts (the larger the cross section of a conductor, the lower its resistance). By compari- Signal Integrity 101 son, both inductive and capacitive coupling effects are significant, so circuit board signal tracks are predominantly considered to be LC-coupled. A different way of thinking about things In the case of today’s high-speed, high-performance PCBs, the tracks almost invariably act like transmission lines. This means we have to visualize a signal edge as a moving wave propagating down the wire through time (Figure A-11). Figure A-11. A signal edge moving through time. With regard to this transmission line view, in which the delay down the wire is long in comparison to the signal’s transition times, the only place any capacitive or inductive coupling occurs is at the current location of the moving edge. This means that we have to consider the track in terms of a series of small RLC segments (which are not shown in these figures for reasons of simplicity). Capacitive and inductive coupling effects Things really start to get interesting when we consider two of these board-level tracks in close proximity to each other. Let’s assume that we are looking at a moving edge that is in the process of propagating down an aggressor track that is inductively and capacitively coupled to a neighboring victim track (Figure A-12). In the case of the capacitive coupling effect, the moving edge on the aggressor net induces positive-going current pulses on the victim net in both the forward and reverse directions. By comparison, in the case of the inductive coupling effect, ■ 439 An alternative name for PCB is printed wire board (PWB), which is pronounced by spelling it out as “P-W-B.” The most commonly used board material is FR4, which is pronounced by spelling it out as “F-R-4” (the “FR” stands for “flame retardant”). 440 ■ The Design Warrior's Guide to FPGAs 1975: England. First liquid crystal displays (LCDs) are used for pocket calculators and digital clocks. Figure A-12. Capacitive and inductive coupling effects. the moving edge on the aggressor net induces a negative-going current pulse on the victim net in the forward direction and a positive-going current pulse on the victim net in the reverse direction. This means that the capacitive and inductive coupling effects tend to augment each other when it comes to near-end noise (noise as seen at the driver end of the track). However, they tend to cancel each other out when it comes to far-end noise (noise as seen at the receiver end of the track). This means that the best-case scenario one can ever hope for is when the capacitive and inductive coupling effects are of comparable magnitudes, because they will cancel each other out at the far end. Unfortunately, this will only ever happen if the dielectric (insulating) layer around the signals is homogeneous, such as with a stripline stackup. In the real world, the dielectric around signals is typically inhomogeneous, such as surface traces, which have air above and FR4 below. In this case, the inductive coupling does not change, but the capacitive coupling decreases. This increases the relative amount of inductive coupling and gives rise to the generation of far-end noise at the receiver. In a typical circuit board environment, the inductive noise can be as much as two to four times the capacitive noise. If anything occurs to degrade the return path, the inductive coupling can increase dramatically to as much as ten to thirty times the capacitive. In such a regime, where the Signal Integrity 101 crosstalk is dominated by inductive coupling, we call the ensuing noise switching noise. In the case where multiple signal paths share the same return path, the switching noise we get across the return (ground) connection is called ground bounce. The anti-Miller effect In our chip-level discussions, we introduced the concept of the Miller effect, which says that if one or more signals are switching with the same polarity (in the same direction) in close proximity to a signal of interest, then the capacitance associated with this signal will appear to be reduced, and its propagation delay will decrease. As was previously noted, however, the propagation delays of chip-level signals are predominantly RC dependent, while the propagation delays of board-level signals are predominantly LC dependent. This means that if one or more board-level signals are switching with the same polarity in close proximity to a signal of interest, then the inductance associated with this signal will appear to be larger. In an inhomogeneous dielectric stackup, the relative inductive coupling is larger than the capacitive coupling, and the increased inductance of the signal trace causes the propagation delay to increase. By comparison, if one or more board-level signals are switching with the opposite polarity to the signal of interest, then the inductance associated with this signal will appear to be reduced and its propagation delay will decrease. Transmission line effects In addition to the effects presented above, there are, of course, classical transmission line effects with associated termination considerations such as using series termination on outputs and parallel termination on inputs, but this sort of thing is beaten into the ground in standard textbooks, so we will skip over it here. ■ 441 1975: America. Ed Roberts and his MIT’s company introduce the 8800-based Altair 8800 microcomputer. 442 ■ The Design Warrior's Guide to FPGAs Things you can do to make life easier I/O is pronounced by spelling it out as “I-O.” Unfortunately, 70 to 80 percent of the SI problems associated with connecting an FPGA to a circuit board are not related to the board per se, but rather to the FPGA’s package. Ideally, the package should have as large a number of power-ground pad pairs as possible, and these pad pairs should be uniformly distributed across the base of the package so as to provide the I/O pads with plenty of adjacent return paths. In reality, the power and ground pads tend to be clustered together leaving groups of I/O pads to do the best they can with the return paths available to them. You can make life easier by making it a rule, if you have the option to use differential output pairs for your I/O, especially in the case of buses and high-speed interconnections, to do so. Of course, this doubles the number of pins you use for the affected I/Os, but it’s well worth your time if you can afford the overhead in pins. Another point to consider relates to the internal, programmable termination resistors provided in some FPGAs. The use of these is optional in that you can either use discrete components at the board level or enable these internal equivalents as required. These internal terminations are predominantly considered in the context of easing routing congestion at the board level, but they also have SI implications. The rule of thumb is that for any signals with rise/fall times of 500 picoseconds or less, external termination resistors cause discontinuities in the signal, so you should always use their on-chip counterparts. Appendix B Deep-Submicron Delay Effects 101 Introduction When one is designing ASICs and ASSPs, the timing effects one needs to account for are extremely complex. As each new technology process node comes online, these effects become ever-more horrendous. At some point—which isn’t particularly well defined (or which is defined differently by different people), but which we will take to be somewhere around the 0.5-micron (500 nanometer) node—we start to move into an area rife with what are known as deep-submicron (DSM) delay effects. The great thing about working with FPGAs, of course, is that the folks who create these devices handle the bulk of the problems associated with DSM delay effects, leaving them largely transparent to the end users (design engineers). On this basis, it’s fair to say that we really don’t need to discuss DSM timing issues here. On the other hand, this is the sort of thing you tend to hear about all the time, but I’ve never run across an introduction to these effects that is comprehensible to anyone sporting anything less than a size-16 brain with go-faster stripes! It is for this reason that the following overview is presented for your delectation and delight. The evolution of delay specifications Way back in the mists of time, sometime after the Jurassic period when dinosaurs ruled the earth—say, around the late 1970s and early 1980s—the lives of ASIC design engineers were somewhat simpler than they are today. Delay specifications for the early (multimicron) technologies were The contents of this appendix are abstracted from my book Designus Maximus Unleashed (ISBN 0-7506-9089-5) with the kind permission of the publisher. 444 ■ The Design Warrior's Guide to FPGAs 1975: America. MOS Technology introduces the 6502 microprocessor. rudimentary at best. Consider the case of a simple 2-input AND gate, for which input-to-output databook delays were originally specified as being identical for all of the inputs and for both rising and falling transitions at the output (Figure B-1i). (i) a,b -> y (LH, HL) = ?ns + ?ns/pF (ii) a,b -> y (LH) a,b -> y (HL) = ?ns + ?ns/pF = ?ns + ?ns/pF (iii) a a b b -> -> -> -> = = = = ?ns ?ns ?ns ?ns + + + + ?ns/pF ?ns/pF ?ns/pF ?ns/pF (iv) ? -> = ? + ? a b & y y y y y (LH) (HL) (LH) (HL) ? Late 1970s Increasing complexity of delay specifications Early 2000s Figure B-1. Delay specifications become increasingly complex over time. As device geometries shrank, however, delay specifications became increasingly complex. The next step was to differentiate delays for rising and falling output transitions (Figure B-1ii), and this was followed by associating different delays with each input-to-output path (Figure B-1iii). All of these early delays were typically specified in the form ?ns + ?ns/pF. The first portion (?ns) indicates a fixed delay specified in nanoseconds1 associated with the gate itself. This is combined with some additional delay specified as nanoseconds per picofarad (?ns/pF) caused by capacitive loading.2 As we will see, this form of specification simply cannot handle the delay effects characteristic of DSM technologies, 1 Today’s devices are much faster, so their delays would be measured in picoseconds. 2 The basic unit of capacitance is the Farad. This was named after the British scientist, Michael Faraday, who constructed the first electric motor in 1821. Deep-Submicron Delay Effects 101 not the least in the area of RLC interconnect delays, as discussed below. A potpourri of definitions Before plunging headfirst into the mire of DSM delays, it is first necessary to introduce a number of definitions as follows. Signal slopes The slope (or slew) of a signal is its rate of change when transitioning from a logic 0 value to a logic 1, or vice versa. An instantaneous transition, which cannot be achieved in the real world, would be considered to represent the maximum possible slope value (Figure B-2). Figure B-2. The slope of a signal is the time taken to transition between logic values. The slope of the signal is a function of the output characteristics of the driving gate combined with the characteristics of the interconnect (track) and the input characteristics of any load gate(s). Input switching thresholds An input switching threshold is the point at which an input to a load gate first sees a transition as occurring; that is, the point at which the signal presented to the input crosses some threshold value, at which point the downstream gate deigns to notice that something is happening. Input switching thresh- ■ 445 1975: America. Microsoft releases BASIC 2.0 for the Altair 8800 microcomputer. 446 ■ The Design Warrior's Guide to FPGAs 1976: America. Zilog introduces the Z80 microprocessor. olds are usually specified as a percentage of the value (voltage differential) between a logic 0 and a logic 1, and each input may have different switching thresholds for rising and falling transitions (Figure B-3). Figure B-3: Input switching thresholds may differ for rising and falling transitions. Intrinsic versus extrinsic delays The term intrinsic refers to any delay effects that are internal to a logic function, while the term extrinsic refers to any delay effects that are associated with the interconnect (Figure B-4). Intrinsic (66%) Intrinsic (34%) Extrinsic (34%) Extrinsic (66%) (i) 2.0 micron (ii) 1.0 micron Total delay = 100% Gate delay Interconnect (inc. fan-in) “Intrinsic” “Extrinsic” a y g1 a y g2 Figure B-4. Intrinsic versus extrinsic delays. Deep-Submicron Delay Effects 101 In the early multimicron technologies, intrinsic delays dominated over their extrinsic counterparts. In the case of devices with 2.0-micron geometries, for example, the intrinsic delay typically accounted for approximately two-thirds of the total delay (Figure B-4a). But extrinsic delays became increasingly important with shrinking geometries. By the time that devices with 1.0-micron geometries became available, the relative domination of the intrinsic and extrinsic delays had effectively reversed (Figure B-4b). This trend is destined to continue because the geometry of the interconnect is not shrinking at the same rate as the transistors and logic gates. In the case of today’s DSM technologies, extrinsic delays can account for 80 percent or more of the total path delays. Pn-Pn and Pt-Pt delays To a large extent, pin-to-pin (Pn-Pn) and point-to-point (Pt-Pt) delays are more modern terms for intrinsic and extrinsic delays, respectively. A Pn-Pn delay is measured between a transition occurring at the input to a gate and a corresponding transition occurring at the output from that gate, while a Pt-Pt delay is measured between the output from a driving gate to the input of a load gate (Figure B-5).3 Pn-Pn Pn-Pn Pt-Pt Pt-Pt g1.a a y g1 a y g2 g1.y g2.a Figure B-5. Pn-Pn versus Pt-Pt delays. 3 It should be noted that circuit board layout designers don’t tend to worry too much about what happens inside devices, which they usually consider to be “black boxes.” The reason for mentioning this is that the board designers may use the term “pin-to-pin” to refer to track delays at the board level. ■ 447 1976: America. Steve Wozniak and Steve Jobs introduce the 6502-based Apple 1 microcomputer. 448 ■ The Design Warrior's Guide to FPGAs 1976: America. Steve Wozniak and Steve Jobs form the Apple Computer Company (on April 1st). To be more precise, a Pn-Pn delay is the time between a signal on a gate’s input reaching that input’s switching threshold to a corresponding response beginning at its output, while a Pt-Pt delay is the time from the output of a driving gate beginning its transition to a corresponding load gate perceiving that transition as crossing its input switching threshold. Good Grief! There are a number of reasons why we’re emphasizing the fact that we consider the time when the output begins to respond as marking the end of the Pn-Pn delay and the start of the Pt-Pt delay. In the past, these delays were measured from the time when the output reached 50 percent of the value between a logic 0 and a logic 1. This was considered to be acceptable, because load gates were all assumed to have input switching thresholds of 50 percent. But consider a rising transition on the output and assume that the load gate’s input switching threshold for a rising transition is 30 percent. If we’re assuming that delays are measured from the time the output crosses its 50 percent value, then it’s entirely possible that the load gate will see the transition before we consider the output to have changed. Also, when we come to consider mixed-signal (analog and digital) simulation, then the only meaningful time to pass an event from a gate’s output transitioning in the digital realm into the analog domain is the point at which the gate’s output begins its transition. State and slope dependency Any attribute associated with an input to a gate (including a Pn-Pn delay) that is a function of the logic values on other inputs to that gate is said to be state dependent. Similarly, any attribute associated with an input to a gate (including a Pn-Pn delay) that is a function of the slope of the signal presented to that input is said to be slope dependent. These state- and slope-dependency definitions might not appear to make much sense at the moment, but they’ll come to the fore in the notso-distant future as we progress through this chapter. Deep-Submicron Delay Effects 101 Alternative interconnect models As the geometries of structures on the silicon shrink and the number of gates in a device increase, interconnect delays assume a greater significance, and increasingly sophisticated algorithms are required to accurately represent the effects associated with the interconnect as follows. The lumped-load model As was noted earlier, the Pn-Pn gate delays in early multimicron technologies dominated over Pt-Pt interconnect delays. Additionally, the rise and fall times associated with signals were typically greater than the time taken for the signals to propagate through the interconnect. In these cases, a representation of the interconnect known as the lumped-load model was usually sufficient (Figure B-6). Equivalent capacitance a y g1 a a g2 y g3 y Figure B-6. The lumped-load interconnect model. The idea here is that all of the capacitances associated with the track and with the inputs to the load gates are added together to give a single, equivalent capacitance. This capacitance is then multiplied by the drive capability of the driving gate (specified in terms of nanoseconds per picofarad, or equivalent) to give a resulting Pt-Pt delay. The lumped-load model is characterized by the fact that all of the nodes on the track are considered to commence transitioning at the same ■ 1977: America. Apple introduces the Apple II microcomputer. 449 450 ■ The Design Warrior's Guide to FPGAs 1977: America. Commodore Business Machines present their 6502-based Commodore PET microcomputer. time and with the same slope. This model may also be referred to as a pure RC model. The distributed RC model The shrinking device geometries of the mid-1980s began to mandate a more accurate representation of the interconnect than was provided by the lumped-load model. Thus, the distributed RC model was born (where R and C represent resistance and capacitance, respectively) (Figure B-7). a y g1 a a g2 y g3 y Figure B-7. The distributed RC interconnect model. In the distributed RC model, each segment of the track is treated as an RC network. The distributed RC model is characterized by the fact that all of the nodes on the track are considered to commence transitioning at the same time but with different slopes. Another way to view this is that the signal’s edge is collapsing (or deteriorating) as it propagates down the track. The pure LC model At the circuit board level, high-speed interconnects start to take on the characteristics of transmission lines. This pure LC model (where L and C represent inductance and capaci- Deep-Submicron Delay Effects 101 tance, respectively) can be represented as a sharp transition propagating down the track as a wavefront (Figure B-8). a y g1 a a g2 y g3 y Figure B-8. The pure LC interconnect model. Pure transmission line effects do not occur inside silicon chips, but large submicron devices do begin to exhibit certain aspects of these delay effects, as discussed below. The RLC model In the case of large devices with DSM geometries, the speed of the signals coupled with relatively long traces results in the interconnect exhibiting some transmission-line-type effects. However, the resistive nature of on-chip interconnect does not support pure LC effects; instead, these traces may be described as exhibiting RLC effects (Figure B-9). The RLC model is characterized as a combination of a discrete wavefront, supplied by the interconnect’s LC constituents, and a collapsing (or deteriorating) signal edge caused by the interconnect’s RC components. ■ 451 1977: America. Tandy/Radio Shack announce their Z80-based TRS-80 microcomputer. 452 ■ The Design Warrior's Guide to FPGAs 1977: First implementation of optical light-waves in operating telephone company. DSM delay effects Path-specific Pn-Pn delays Each input-to-output path typically has its own Pn-Pn delay. In the case of a 2-input OR gate, for example, a change on input a causing a transition on output y (Figure B-10a) would have a different delay from that of a change on input b causing a similar transition on output y (Figure B-10b). a y g1 a a g2 g3 y y Figure B-9. The RLC interconnect model. a b | y g1 a a b b y y (i) Input a to output y (ii) Input b to output y Figure B-10. Path-specific Pn-Pn delays. Deep-Submicron Delay Effects 101 Similarly, each rising and falling transition at the output typically has its own Pn-Pn delay. In the case of our OR gate, for example, a change on input a causing a rising transition on output y would have a different delay from that of a change on input a causing a falling transition on output y. Note that this example assumes input switching thresholds of 50 percent, and remember that Pn-Pn delays are measured from the time when a signal presented to an input crosses that input’s switching threshold to the time when the output first begins to respond. Path- and transition-specific Pn-Pn delays are not limited to DSM technologies, and they should come as no surprise, but they are presented here to prepare the stage for the horrors that are to come. Threshold-dependent Pn-Pn delays Pn-Pn delays depend on the switching thresholds associated with inputs, at least to the extent that the delay through the gate doesn’t actually commence until the signal presented to the input crosses the threshold. For example, if the input switching threshold for a rising transition on input a were 30 percent of the value between the logic 0 and logic 1 levels (Figure B-11a), then the input would see the transition earlier than it would if its input switching threshold were 70 percent (Figure B-11b). Additionally, the slope of a signal being presented to an input affects the time that signal crosses the input switching threshold. For the purposes of presenting a simple example, let’s assume that input a has a switching threshold of 50 percent. If a signal with a steep slope is presented to input a (Figure B-12a), then the input will see the signal as occurring earlier than it would if the slope of the signal were decreased (Figure B-12b). Although this affects the time at which the Pn-Pn delay commences, it is NOT the same as the slope-dependent Pn-Pn delays presented in the next section. ■ 453 1978: America. Apple introduces the first hard disk drive for use with personal computers. 454 ■ The Design Warrior's Guide to FPGAs 1979: ADA programming language is named after Augusta Ada Lovelace (now credited as being the first computer programmer). a a b | y g1 30% a b b y y n ticks (i) 30% switching threshold on input a 70% n ticks (ii) 70% switching threshold on input a Figure B-11. Threshold-dependent Pn-Pn delays. a a b | y g1 50% a b b y y n ticks (i) Fast transition presented to input a 50% n ticks (ii) Slower transition presented to input a Figure B-12. The slope of an incoming signal affects the time at which the input sees that signal. Slope-dependent Pn-Pn delays Speaking of which … the previous example was somewhat simplistic in that it showed two Pn-Pn delays as being identical, irrespective of the slope of the incoming signal. Some vendors of computer-aided design tools refer to the previous case as “slope dependency,” but this is not a correct usage of the term. As it happens, a variety of delay effects in DSM technologies may be truly slope dependent, which means that they may be directly modified by the slope of an incoming signal. Deep-Submicron Delay Effects 101 Let’s consider what happens from the point at which the signal presented to an input crosses that input’s switching threshold. The Pn-Pn delay from this point may be a function of the rate of change of the incoming signal. For example, a fast slope presented to the input may result in a short Pn-Pn delay (Figure B-13a), while a slower slope may result in a longer delay (Figure B-13b). a a b | y g1 50% a b b y y (i) Fast transition presented to input a 50% (ii) Slower transition presented to input a Figure B-13. Slope-dependent Pn-Pn delays. Actually, the effect illustrated in Figure B-13, in which a decreasing slope causes an increasing Pn-Pn delay, is only one possible scenario. This particular case applies to gates or technologies where the predominant effect is that the switching speeds of the transistors forming the gate are directly related to the rate of change of charge applied to their inputs. By comparison, in the case of certain technologies, a decreasing slope actually results in faster Pn-Pn delays (as measured from the switching threshold of the input). This latter case results from the fact that a sufficiently long slope permits internal transistors to become precharged almost to the point of switching. Thus, when the input signal actually crosses the input’s switching threshold, the gate is poised at the starting blocks and appears to switch faster than it would if a sharp edge had been applied to the input. To further increase your pleasure and double your fun, both effects may be present simultaneously. In this case, applying a ■ 455 1979: America. The first true commercial microcomputer program, the VisiCalc spreadsheet, is made available for the Apple II. 456 ■ The Design Warrior's Guide to FPGAs 1980: Cordless and cell phones are developed. sharp edge to the input may result in a certain Pn-Pn delay, and gradually decreasing the slope of the applied signal could cause a gradual increase in the Pn-Pn delay. At some point, however, further decreasing the slope of the applied input will cause a reduction in the Pn-Pn delay, possibly to the point where it becomes smaller than the Pn-Pn delay associated with our original sharp edge!4 State-dependent Pn-Pn delays In addition to being slope-dependent, Pn-Pn delays are often state dependent, which means that they depend on the logic values of other inputs (Figure B-14). a a co 1 b a b + b 0 1 y g1 ci ci ci co co 0 Full adder (i) b = 1, ci = 0 (ii) b = 0, ci = 1 Figure B-14. State-dependent Pn-Pn delays. This example illustrates two cases in which a signal presented to the a input causes an identical response (in terms of logic values) at the co output. However, even assuming that the slopes of the signals presented to a and the switching thresholds on a are identical in both cases, the Pn-Pn delays may be different due to the logic values present on inputs b and ci. 4 And there are those who would say that electronics is dull and boring—go figure! Deep-Submicron Delay Effects 101 Path-dependent drive capability This is where life really starts to get interesting (trust me, have I ever lied to you before?).5 Up to this point, we have only considered effects that impact Pn-Pn delays through a gate, but many of these effects also influence the gate’s ability to drive signal at its output(s). For example, the driving capability of a gate may be path dependent (Figure B-15). a b | y g1 a a b b y y (i) Input a causes fast transition on y (ii) Input b causes slower transition on y Figure B-15. Path-dependent drive capability. In this case, in addition to the fact that inputs a and b have different Pn-Pn delays, the driving capability of the gate (and hence the slope of the output signal) is dependent on which input caused the output transition to occur. This phenomenon was originally associated only with MOS technologies and was not generally linked to bipolar technologies such as TTL. As the plunge into DSM continues, however, many of the more esoteric delay effects are beginning to manifest themselves across technologies with little regard for traditional boundaries. Slope-dependent drive capability In addition to being dependent on which input causes an output transition to occur (as discussed in the previous point), the driving capability of the gate (and hence the slope of the 5 Don’t answer that! ■ 457 1980: Development of the World Wide Web begins. 458 ■ The Design Warrior's Guide to FPGAs 1980: Faxes can be sent over regular phone lines. output signal) may also be dependent on the slope of the signal presented to the input. For example, a fast transition on input a may cause a fast slope at the output (Figure B-16a), while a slower transition on the same input may impact the gate’s driving capability and cause the slope of the output signal to decrease (Figure B-16b). Are we having fun yet? a a y | b g1 a 50% 50% b b y y (i) Fast transition on a gives higher drive on y (ii) Slow transition on a gives lower drive on y Figure B-16. Slope-dependent drive capability. State-dependent drive capability Yet another factor that can influence the drive capability of an output is the logic values present on inputs other than the one actually causing the output transition to occur. This effect is known as state-dependent drive capability (Figure B-17). a a co 1 b a b + b 0 1 y g1 ci ci ci co co 0 Full adder (i) b = 1, ci = 0 (ii) b = 0, ci = 1 Figure B-17. State-dependent drive capability. Deep-Submicron Delay Effects 101 Figure B-17 illustrates two cases in which a signal presented to the a input causes an identical response (in terms of logic values) at the co output. However, even assuming that the slopes of the signals presented to a and the switching thresholds on a are identical in both cases, the driving capability of the gate (and hence the slope of the output signal) may be different due to the logic values present on inputs b and ci. State-dependent switching thresholds As you doubtless observed, the previous point on statedependent drive capability included the phrase “assuming that the input switching thresholds on input a are identical in both cases.” If this caused a few alarm bells to start ringing in your mind, then, if nothing else, at least these discussions are serving to hone your abilities to survive the dire and dismal depths of the DSM domain. The point is that by some strange quirk of fate, an input’s switching threshold may be state dependent; that is, it may depend on the logic values present on other inputs (Figure B-18). co 30% a (i) b = 0, ci = 1, switching threshold on input a = 30% 0 b 1 ci + y g1 co 70% a (ii) b = 1, ci = 0, switching threshold on input a = 70% 1 b 0 ci + y g1 Figure B-18. State-dependent input switching thresholds. ■ 459 1981: America. First IBM PC is launched. 460 ■ The Design Warrior's Guide to FPGAs 1981: America. First mouse pointing device is created. In this example, the switching threshold of input a (the point at which this input sees a transition as occurring) depends on the logic values presented to inputs b and ci. State-dependent terminal parasitics In addition to an input’s switching threshold being state dependent, further characteristics associated with that input (such as its parasitic values) may also depend on the logic values presented to other inputs. For example, consider a 2-input OR gate (Figure B-19). a y g1 a b y | g2 Figure B-19. State-dependent terminal parasitics. The terminal capacitance of input g2.a (as seen by the driving output g1.y) may depend on the logic value presented to input g2.b. If input g2.b is a logic 0, a transition on input g2.a will cause the output of the OR gate to switch. In this case, g1.y (the output of the gate driving g2.a) will see a relatively high capacitance. However, if input g2.b is a logic 1, a transition on input g2.a will not cause the output of the OR gate to switch. In this case, g1.y will see a relatively small capacitance. At this point you may be asking, “In the case where the OR gate isn’t going to switch, do we really care if the parasitic capacitance on input a is different? Can’t we just set the value of the capacitance to be that for when the OR gate will switch?” In fact, this would be okay if the output g1.y were Deep-Submicron Delay Effects 101 only driving input g2.a, but problems obviously arise if we modify the circuit such that g1.y starts to drive two or more load gates. This particular effect first manifested itself in ECL technologies. In fact, as far back as the late 1980s, I was made aware of one ECL gate-array technology in which the terminal capacitance of a load gate (as perceived by the driving gate) varied by close to 100 percent due to this form of state dependency. But this effect is no longer confined to ECL; once again, delay effects are beginning to manifest themselves across technologies with scant regard for traditional boundaries as we sink further into the DSM domain. The effect of multi-input transitions on Pn-Pn delays Prior to this point, we have only considered cases in which a signal presented to a single input causes an output response. Not surprisingly, the picture becomes more complex when multi-input transitions are considered. For example, take the case of a 2-input OR gate (Figure B-20). a b | y g1 a a b b y y (i) Input a transitions in isolation (ii) Inputs a and b transition simultaneously Figure B-20. The effect of multi-input transitions on Pn-Pn delays. For the sake of simplicity, we will assume that both the a and b inputs are fully symmetrical; that is, both have identical input switching thresholds and Pn-Pn delays. ■ 461 1981: First laptop computer is introduced. 462 ■ The Design Warrior's Guide to FPGAs 1983: Apple’s Lisa is the first personal computer to use a mouse and pull-down menus. First, consider the case where a transition applied to a single input (for example, input a) causes a response at the output (Figure B-20a). The resulting Pn-Pn delay is the one that is usually specified in the databook for this cell. However, if both inputs transition simultaneously (Figure B-20b), the resulting Pn-Pn delay may be reduced to close to 50 percent of the value specified in the databook. These two cases (a single input transition occurring in isolation versus multi-input transitions occurring simultaneously) provide us with worst-case endpoints. However, it is also necessary to consider those cases where the inputs don’t transition simultaneously, but do transition close together. For example, take the OR gate shown in figure B-20 and assume that both inputs are initially at logic 0. Now assume that input a is presented with a rising transition, which initiates the standard databook Pn-Pn delay, but before this delay has fully completed, input b is also presented with a rising transition. The result is that the actual Pn-Pn delay could occur anywhere between the two worst-case endpoints. The effect of multi-input transitions on drive capability In addition to modifying Pn-Pn delays, multi-input transitions may also affect the driving capability of the gate, and hence the slope of the output signal (Figure B-21). a b | y g1 a a b b y y (i) Input a transitions in isolation (ii) Inputs a and b transition simultaneously Figure B-21. The effect of multi-input transitions on drive capability. Deep-Submicron Delay Effects 101 All of these multi-input transition effects can be estimated with simple linear approximations. Unfortunately, today’s verification tools—such as STA and digital logic simulation—are not well equipped to perform on-the-fly calculations of this type. Reflected parasitics With the technologies of yesteryear, it was fairly safe to assume that parasitic effects had limited scope and were generally only visible to logic gates in their immediate vicinity. For example, consider the three gates shown in Figure B-22. a y a y w1 g1 a y w2 g2 g3 Figure B-22: Reflected parasitics. Traditionally, it was safe to assume that gate g2 would buffer the output of g1 from wire w2 and gate g3. Thus, the output g1.y would only see any parasitics such as the capacitances associated with wire w1 and gate terminal g2.a. These assumptions become less valid in the DSM domain. Returning to the three gates shown in figure B-22, it is now possible for some proportion of the parasitics associated with wire w2 and gate terminal g3.a to be reflected back through gate g2 and made visible to output g1.y. Additionally, if gate g2 were a multi-input gate such as a 2-input XOR, then the proportion of these parasitics reflected back through g2 might well be state dependent; that is, they might vary depending on the logic value presented to the other input of g2. At the time of this writing, reflected parasitics remain relatively low-order effects in the grander scheme of things. If ■ 463 1983: Time magazine names the computer as Man of the year. 464 ■ The Design Warrior's Guide to FPGAs 1984: 1 megabyte memory chips available for the first time. history has taught us anything, however, it is to be afraid (very afraid), because it’s not beyond the bounds of possibility that these effects will assume a much greater significance as we continue to meander our way through new technology nodes. Summary The majority of the delay effects introduced in this chapter have always been present, even in the case of multimicron technologies, but many of these effects have traditionally been fourth or third order and were therefore considered to be relatively insignificant. As device geometries plunged through the 0.5-micron barrier to 0.35 microns, some of these effects began to assume second- and even first-order status, and their significance continues to increase with new technology nodes operating at lower voltage levels. Unfortunately, many design verification tools are not keeping pace with silicon technology. Unless these tools are enhanced to account fully for DSM effects, designers will be forced to use restrictive design rules to ensure that their designs actually function. Thus, design engineers may find it impossible to fully realize the potential of the new and exciting technology developments that are becoming available. Appendix C Linear Feedback Shift Registers 101 The Ouroboras The Ouroboros, a symbol of a serpent or dragon devouring its own tail and thereby forming a circle, has been employed by a variety of ancient cultures around the world to depict eternity or renewal.1 The equivalent of the Ouroboros in the electronics world would be the linear feedback shift register (LFSR), in which outputs from a standard shift register are cunningly manipulated and fed back into its input in such a way as to cause the function to cycle endlessly through a sequence of patterns. Many-to-one implementations LFSRs are simple to construct and are useful for a wide variety of applications. One of the more common forms of LFSR is formed from a simple shift register with feedback from two or more points, called taps, in the register chain (Figure C-1). The taps in this example are at bit 0 and bit 2, and an easy way to represent this is to use the notation [0,2]. All of the register elements share a common clock input, which is omitted from the symbol for reasons of clarity. The data input to the LFSR is generated by XOR-ing or XNOR-ing the tap bits, while the remaining bits function as a standard shift register. 1 Not to be confused with the Amphisbaena, a serpent in classical mythology having a head at each end and being capable of moving in either direction. The contents of this appendix are abstracted from my book Bebop to the Boolean Boogie (An Unconventional Guild to Electronics, Edition 2 (ISBN 0-7506-7543-8) with the kind permission of the publisher. LFSR is pronounced by spelling it out as “L-F-S-R.” 466 ■ The Design Warrior's Guide to FPGAs 1985: CD-ROMs are used to store computer data for the first time. XOR XOR d 0 1 (a) Symbol 2 q dff0 q0 d q q1 dff1 d q q2 dff2 clock (b) Implementation Figure C-1. LFSR with XOR feedback path. The sequence of values generated by an LFSR is determined by its feedback function (XOR versus XNOR) and tap selection. For example, consider two 3-bit LFSRs using an XOR feedback function, the first with taps at [0,2] and the second with taps at [1,2] (Figure C-2). Both LFSRs start with the same initial value, but due to the different taps, their sequences rapidly diverge as clock pulses are applied. In some cases, an LFSR will end up cycling Figure C-2. Comparison of alternative tap selections. Linear Feedback Shift Registers 101 round a loop comprising a limited number of values. However, both of the LFSRs shown in figure C-2 are said to be of maximal length because they sequence through every possible value (excluding all of the bits being 0) before returning to their initial values. A binary field with n bits can assume 2n unique values, but a maximal-length LFSR with n register bits will only sequence through (2n – 1) values. For example, a 3-bit field can support 23 = 8 values, but the 3-bit LFSRs in figure C-2 sequence through only (23 – 1) = 7 values. This is because LFSRs with XOR feedback paths will not sequence through the “forbidden” value where all the bits are 0, while their XNOR equivalents will not sequence through the value where all the bits are 1 (Figure C-3).2 Figure C-3. Comparison of XOR versus XNOR feedback paths. 2 If an LFSR somehow finds itself containing its “forbidden value,” it will lock-up in this value until some external event occurs to extract it from its predicament. ■ 1989: Pacific fiber-optic link/cable opens (supports 40,000 simultaneous conversation). 467 468 ■ The Design Warrior's Guide to FPGAs 1990: Switzerland. British physicist Tim Berners-Lee sets up the world’s first World Wide Web server. More taps than you know what to do with Each LFSR supports a number of tap combinations that will generate maximal-length sequences. The problem is weeding out the ones that do from the ones that don’t, because badly chosen taps can result in the register entering a loop comprising only a limited number of states. Purely for my own amusement, I created a simple C program to determine the taps for maximal-length LFSRs with 2 to 32 bits. These values are presented for your delectation and delight in Figure C-4 (the * annotation indicates a sequence whose length is a prime number). The taps are identical for both XOR-based and XNORbased LFSRs, although the resulting sequences will, of course, differ. As was previously noted, alternative tap combinations # Bits 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Loop Length 3 7 15 31 63 127 255 511 1,023 2,047 4,095 8,191 16,383 32,767 65,535 131,071 262,143 524,287 1,048,575 2,097,151 4,194,303 8,388,607 16,777,215 33,554,431 67,108,863 134,217,727 268,435,455 536,870,911 1,073,741,823 2,147,483,647 4,294,967,295 * * * * * * * * Taps [0,1] [0,2] [0,3] [1,4] [0,5] [0,6] [1,2,3,7] [3,8] [2,9] [1,10] [0,3,5,11] [0,2,3,12] [0,2,4,13] [0,14] [1,2,4,15] [2,16] [6,17] [0,1,4,18] [2,19] [1,20] [0,21] [4,22] [0,2,3,23] [2,24] [0,1,5,25] [0,1,4,26] [2,27] [1,28] [0,3,5,29] [2,30] [1,5,6,31] Figure C-4. Taps for maximal length LFSRs with 2 to 32 bits. Linear Feedback Shift Registers 101 may also yield maximum-length LFSRs, although once again the resulting sequences will vary. For example, in the case of a 10-bit LFSR, there are two 2-tap combinations that result in a maximal-length sequence: [2,9] and [6,9]. There are also twenty 4-tap combinations, twenty-eight 6-tap combinations, and ten 8-tap combinations that satisfy the maximal-length criteria.3 VIP! It’s important to note that the taps shown in figure C-4 may not be the best ones for the task you have in mind with regard to attributes such as being primitive polynomials and having their sequences evenly distributed in “random” space; they just happened to be the ones I chose out of the results I generated. If you are using LFSRs for real-world tasks, one of the best sources for determining optimum tap points is the book Error-Correcting Codes by W. Wesley Peterson and E. J. Weldon Jr. (published by MIT Press). Also, the CRC utility referenced under the “Miscellaneous Stuff” section at the end of chapter 25 might be of some interest. One-to-many implementations Consider the case of an 8-bit LFSR, for which the minimum number of taps that will generate a maximal-length sequence is four. In the real world, XOR gates only have two inputs, so a 4-input XOR function has to be created using three XOR gates arranged as two levels of logic. Even in those cases where an LFSR does support a minimum of two taps, there may be special reasons for you to use a greater number such as eight (which would result in three levels of XOR logic). However, increasing the levels of logic in the combinational feedback path can negatively impact the maximum clocking frequency of the function. One solution is to transpose the many-to-one implementations discussed above into their one-to-many counterparts (Figure C-5). 3 A much longer table (covering LFSRs with up to 168 bits) is presented in application note XAPP052 from Xilinx. ■ 1993: The MOSAIC web browser becomes available. 469 470 ■ The Design Warrior's Guide to FPGAs 1999: First 1 GHz microprocessor created by Intel. 0 1 2 3 4 5 6 7 6 7 (a) Many-to-one implementation 0 1 2 3 4 5 (b) One-to-many implementation Figure C-5: Many-to-one versus one-to-many implementations. The traditional many-to-one implementation for the eight-bit LFSR has taps at [1,2,3,7]. To convert this into its one-to-many counterpart, the most significant tap, which is always the most significant bit (bit 7 in this case), is fed back directly into the least significant bit. This bit is also individually XOR’d with the other original taps (bits [1,2,3] in this example). Although both of these approaches result in maximallength LFSRs, the actual sequences of values will differ between them. But the main point is that using the one-tomany technique means that there is never more than one level of combinational logic in the feedback path, irrespective of the number of taps being employed. Of course, FPGAs have the additional consideration that a 4-input LUT will have the same delay for 2-, 3-, and 4-input XOR trees. In this case, the many-to-one approach only starts to offer advantages when you are dealing with an LFSR that requires more than four taps. Seeding an LFSR One quirk with XOR-based LFSRs is that, if one happens to find itself in the all-0s value, it will happily continue to shift all 0s indefinitely (similarly for XNOR-based LFSRs and Linear Feedback Shift Registers 101 the all-1s value). This is of particular concern when power is first applied to the circuit. Each register bit can randomly initialize containing either a logic 0 or a logic 1, and the LFSR can therefore “wake up” containing its “forbidden” value. For this reason, it is necessary to initialize LFSRs with a seed value. An interesting aspect of an LFSR based on an XNOR feedback path is that it does allow an all-0s value. This means that a common clear signal to all of the LFSR’s registers can be used to provide an XNOR LFSR with a seed value of all 0s. One method for loading a specific seed value is to use registers with reset or set inputs. A single control signal can be connected to the reset inputs on some of the registers and the set inputs on others. When this control signal is placed in its active state, the LFSR will load with a hard-wired seed value. With regard to certain applications, however, it is desirable to be able to vary the seed value. One technique for achieving this is to include a multiplexer at the input to the LFSR (Figure C-6). XOR MUX seed-data 0 1 2 select Figure C-6. Circuit for loading alternative seed values. When the multiplexer’s seed-data input is selected, the device functions as a standard shift register, and any desired seed value can be loaded. After loading the seed value, the feedback path is selected and the device returns to its LFSR mode of operation. ■ 471 472 ■ The Design Warrior's Guide to FPGAs FIFO applications The fact that an LFSR generates an unusual sequence of values is irrelevant in many applications. For example, let’s consider a 4-bit-wide, 16-word-deep FIFO memory function (Figure C-7). Read Pointer 4:16 Decoder Data-In[3:0] d or -W M 6 1 RA a y r Ar 4:16 Decoder Write Pointer Output Register Data-Out[3:0] wr ite d rea s et r e ull f p ty em Control Logic Figure C-7. A 16-word FIFO function. In addition to some control logic and an output register, the FIFO contains a read pointer and a write pointer. These pointers are 4-bit registers whose outputs are processed by 4:16 decoders to select one of the 16 words in the memory array. The read and write pointers chase each other around the memory array in an endless loop. An active edge on the write input causes any data on the input bus to be written into the word pointed to by the write pointer; the write pointer is then incremented to point to the next empty word. Similarly, an active edge on the read input causes the data in the word pointed to by the read pointer to be copied into the output register; the read pointer is then incremented to point to the next word containing data.4 (There would also be some logic 4 These discussions assume write-and-increment and read-and-increment techniques however, some FIFOs employ an increment-and-write and increment-and-read approach. Linear Feedback Shift Registers 101 to detect when the FIFO is full or empty, but this is irrelevant to our discussions here.) The write and read pointers for a 16-word FIFO are often implemented using 4-bit binary counters. However, a moment’s reflection reveals that there is no intrinsic advantage to a binary sequence for this particular application, and the sequence generated by a 4-bit LFSR will serve equally well. In fact, the two functions operate in a very similar manner as is illustrated by their block diagrams (Figure C-8). Feedback Logic Feedback Logic Next Value Current Value Next Value Current Value Registers Clock Registers Clock (a) 4-bit binary counter (b) 4-bit LFSR Figure C-8. Binary counter versus LFSR. It doesn’t take more than a few seconds before we realize that the only difference between these two diagrams is their names. The point is that the combinational feedback logic for the 4-bit binary counter requires a number of AND and OR gates, while the feedback logic for the 4-bit LFSR consists of a single, 2-input XOR gate. This means that the LFSR requires fewer tracks and is more efficient in terms of silicon real estate. Additionally, the LFSR’s feedback only passes through a single level of logic, while the binary counter’s feedback passes through multiple levels of logic. This means that the new data value is available sooner for the LFSR, which can therefore be ■ 473 474 ■ The Design Warrior's Guide to FPGAs clocked at a higher frequency. These differentiations become even more pronounced for FIFOs with more words requiring pointers with more bits. Thus, LFSR’s provide an interesting option for the discerning designer of FIFOs.5 n Modifying LFSRs to sequence 2 values The sole downside to using 4-bit LFSRs in the FIFO scenario above is that they will sequence through only 15 values (24 – 1), as compared to the binary counter’s sequence of 16 values (24). Depending on the application, the design engineers may not regard this to be a major problem, especially in the case of larger FIFOs. However, if it is required for an LFSR to sequence through every possible value, then there is a simple solution (Figure C-9). n Figure C-9. LFSR modified to sequence 2 values. For the value where all of the bits are 0 to appear, the preceding value must have comprised a logic 1 in the 5 So do Gray Counters, but that will have to be a topic for another time. Linear Feedback Shift Registers 101 most significant bit (MSB)6 and logic 0s in the remaining bit positions. In an unmodified LFSR, the next clock would result in a logic 1 in the least significant bit (LSB) and logic 0s in the remaining bit positions. However, in the modified LFSR shown in figure C-9, the output from the NOR is a logic 0 for every case but two: the value preceding the one where all the bits are 0 and the value where all the bits are 0. These two values force the NOR’s output to a logic 1, which inverts the usual output from the XOR. This in turn causes the sequence first to enter the all-0s value and then to resume its normal course. (In the case of LFSRs with XNOR feedback paths, the NOR can be replaced with an AND, which causes the sequence to cycle through the value where all of the bits are 1.) Accessing the previous value In some applications, it is required to make use of a register’s previous value. For example, in certain FIFO implementations, the “full” condition is detected when the write pointer is pointing to the location preceding the location pointed to by the read pointer.7 This implies that a comparator must be used to compare the current value in the write pointer with the previous value in the read pointer. Similarly, the “empty” condition may be detected when the read pointer is pointing to the location preceding the location pointed to by the write pointer. This implies that a second comparator must be used to compare the current value in the read pointer with the previous value in the write pointer. In the case of binary counters (assuming that, for some reason, we decided to use them for a FIFO application), there are two techniques by which the previous value in the sequence may be accessed. The first requires the provision of an addi- 6 As is often the case with any form of shift register, the MSB in these examples is taken to be on the right-hand side of the register and the LSB is taken to be on the left-hand side (this is opposite to the way we usually do things). 7 Try saying that quickly! ■ 475 MSB and LSB are pronounced by spelling them out as “M-S-B” and “L-S-B”, respectively. 476 ■ The Design Warrior's Guide to FPGAs tional set of shadow registers. Every time the counter is incremented, its current contents are first copied into the shadow registers. Alternatively, a block of combinational logic can be used to decode the previous value from the current value. Unfortunately, both of these techniques involve a substantial overhead in terms of additional logic. By comparison, LFSRs inherently remember their previous value. All that is required is the addition of a single register bit appended to the MSB (Figure C-10). XOR Additional register bit appended to MSB of main LFSR 0 1 2 3 Current Value Previous Value Figure C-10. Accessing an LFSR’s previous value. Encryption and decryption applications The unusual sequence of values generated by an LFSR can be gainfully employed in the encryption (scrambling) and decryption (unscrambling) of data. A stream of data bits can be encrypted by XOR-ing them with the output from an LFSR (Figure C-11). Figure C-11. Data encryption using an LFSR. Linear Feedback Shift Registers 101 ■ The stream of encrypted data bits seen by a receiver can be decrypted by XOR-ing them with the output of an identical LFSR. This is obviously a very trivial form of encryption that isn’t very secure, but it’s cheap and cheerful, and it may be useful in certain applications. Cyclic redundancy check applications A traditional application for LFSRs is in cyclic redundancy check (CRC) calculations, which can be used to detect errors in data communications. The stream of data bits being transmitted is used to modify the values being fed back into an LFSR (Figure C-12). CRC is pronounced by spelling is out as “C-R-C.” Figure C-12. CRC calculations. The final CRC value stored in the LFSR, known as a checksum, is dependent on every bit in the data stream. After all of the data bits have been transmitted, the transmitter sends its checksum value to the receiver. The receiver contains an identical CRC calculator and generates its own checksum value from the incoming data. Once all of the data bits have arrived, the receiver compares its internally generated checksum value with the checksum sent by the transmitter to determine whether any corruption occurred during the course of the transmission. This form of error detection is very efficient in terms of the small number of bits that have to be transmitted in addition to the data. However, the downside is that you don’t know if 477 478 ■ The Design Warrior's Guide to FPGAs there was an error until the end of the transmission (and if there was an error, you have to repeat the entire transmission). In the real world, a 4-bit CRC calculator would not be considered to provide sufficient confidence in the integrity of the transmitted data because it can only represent (24 – 1) = 15 unique values. This leads to a problem called aliasing, in which the final CRC value is the same as was expected, but this value was actually caused by multiple errors canceling each other out. As the number of bits in a CRC calculator increases, however, the probability that multiple errors will cause identical checksum values approaches zero. For this reason, CRC calculators typically use 16 bits (which can accommodate 65,535 unique values) or more. There are a variety of standard communications protocols, each of which specifies the number of bits employed in their CRC calculations and the taps to be used. The taps are selected such that an error in a single data bit will cause the maximum possible disruption to the resulting checksum value. Thus, in addition to being referred to as maximal length, these LFSRs may also be qualified as maximal displacement. In addition to checking data integrity in communications systems, CRCs find a wide variety of other uses, for example, the detection of computer viruses. For the purposes of this discussion, a computer virus may be defined as a self-replicating program released into a computer system for a number of purposes. These purposes range from the simply mischievous, such as displaying humorous or annoying messages, to the downright nefarious, such as corrupting data or destroying (or subverting) the operating system. One mechanism by which a computer virus may both hide and propagate itself is to attach itself to an existing program. Whenever that program is executed, it first triggers the virus to replicate itself, yet a cursory check of the system shows only the expected files to be present. In order to combat this form of attack, a unique checksum can be generated for each program on the system, where the value of each checksum is Linear Feedback Shift Registers 101 based on the binary instructions forming the program with which it is associated. At some later date, an antivirus program can be used to recalculate the checksum values for each program and to compare them to the original values. A difference in the two values associated with a program may indicate that a virus has attached itself to that program.8 Data compression applications The CRC calculators discussed above can also be used in a data compression role. One such application is found in the circuit board test strategy known as functional test. The board, which may contain thousands of components and tracks, is plugged into a functional tester by means of its edge connector, which may contain hundreds of pins. The tester applies a pattern of signals to the board’s inputs, allows sufficient time for any effects to propagate around the board, and then compares the actual values seen on the outputs with a set of expected values stored in the system. This process is repeated for a series of input patterns, which may number in the tens or hundreds of thousands. If the board fails the preliminary tests, a more sophisticated form of analysis known as guided probe may be employed to identify the cause of the failure. In this case, the tester instructs the operator to place the probe at a particular location on the board, and then the entire sequence of test patterns is rerun. The tester compares the actual sequence of values seen by the probe with a sequence of expected values that are stored in the system. This process (placing the probe and running the tests) is repeated until the tester has isolated the faulty component or track. 8 Unfortunately, the creators of computer viruses are quite sophisticated, and some viruses are armed with the ability to perform their own CRC calculations. When a virus of this type attaches itself to a program, it can pad itself with dummy binary values, which are selected so as to cause an antivirus program to return a checksum value identical to the original. ■ 479 480 ■ The Design Warrior's Guide to FPGAs A major consideration when supporting a guided probe strategy is the amount of expected data that must be stored. Consider a test sequence comprising 10,000 patterns driving a board containing 10,000 tracks. If the data were not compressed, the system would have to store 10,000 bits of expected data per track, which amounts to 100 million bits of data for the board. Additionally, for each application of the guided probe, the tester would have to compare the 10,000 data bits observed by the probe with the 10,000 bits of expected data stored in the system. Thus, using data in an uncompressed form is an expensive option in terms of storage and processing requirements. One solution to these problems is to employ LFSR-based CRC calculators. The sequence of expected values for each track can be passed through a 16-bit CRC calculator implemented in software. Similarly, the sequence of actual values seen by the guided probe can be passed through an identical CRC calculator implemented in hardware. In this case, the calculated checksum values are also known as signatures, and a guided probe process based on this technique is known as signature analysis. Irrespective of the number of test patterns used, the system has to store only two bytes of data for each track. Additionally, for each application of the guided probe, the tester has to compare only the two bytes of data gathered by the probe with two bytes of expected data stored in the system. Thus, compressing the data results in storage requirements that are orders of magnitude smaller and comparison times that are orders of magnitude faster than the uncompressed data approach. Built-in self-test applications One test strategy that may be employed in complex ICs is that of built-in self-test (BIST). Devices using BIST contain special test-generation and result-gathering circuits, both of which may be implemented using LFSRs (Figure C-13). The LFSR forming the test generator is used to create a sequence of test patterns, while the LFSR forming the results Linear Feedback Shift Registers 101 XOR Test Generator d q * Dff-TG0 d q d q * d q * Dff-TG1 * Dff-TG2 Dff-TG3 * Clock signals to flip-flops omitted for purposes of simplicity Logic Being Tested XOR XOR XOR d q * XOR Dff-RG0 XOR d q * d q * Dff-RG1 Dff-RG2 d q * Dff-RG3 Results Gatherer Figure C-13. BIST. gatherer is used to capture the results. Observe that the results-gathering LFSR features modifications that allow it to accept parallel data. Additional circuitry would be required to provide a way to load new seed values into the test generator and to access the final values in the results gatherer. This logic is not shown here for purposes of simplicity. Note that the two LFSRs are not obliged to contain the same number of bits because the number of inputs to the logic being tested may be different to the number of outputs coming from that logic. Also note that all of the flip-flops in the test generator would share a common clock. Similarly, all of the flip-flops in the results gatherer would also share a common clock. These two clocks might be common or they might be distinct (in the latter case they would be synchronized in some way). The ■ 481 482 ■ The Design Warrior's Guide to FPGAs clock signals are not shown in figure C-13 so as to keep things simple. Pseudorandom-number-generation applications Many computer programs rely on an element of randomness. Computer games such as Space Invaders employ random events to increase the player’s enjoyment. Graphics programs may exploit random numbers to generate intricate patterns. All forms of computer simulation may utilize random numbers to represent the real world more accurately. For example, digital logic simulations (see also Chapter 19) may benefit from the portrayal of random stimulus such as external interrupts. Random stimulus can result in more realistic design verification, which can uncover problems that may not be revealed by more structured tests. Random-number generators can be constructed in both hardware and software. The majority of these generators are not truly random, but they give the appearance of being random and are therefore said to be pseudorandom. In reality, pseudorandom numbers have an advantage over truly random numbers because the majority of computer applications typically require repeatability. For example, a designer repeating a digital simulation would expect to receive identical answers to those from the previous run. However, designers also need the ability to modify the seed value of the pseudorandom-number generator so as to spawn different sequences of values as required. There are a variety of methods available for generating pseudorandom numbers, one of which is to use an LFSR whose tap values have been selected so as to provide a reasonably good pseudorandom source. Last but not least LFSRs are simple to construct and are useful for a wide variety of applications, but be warned that choosing the optimal polynomial (which ultimately boils down to selecting the tap points) for a particular application is a task that is Linear Feedback Shift Registers 101 usually reserved for a master of the mystic arts, not to mention that the maths can be hairy enough to make a grown man break down and cry (and don’t even get me started on the subject of cyclotomic polynomials,9 which are key to the tap-selection process). 9 Mainly because I haven’t got the faintest clue what a cyclotomic polynomial is! ■ 483 Glossary ACM (adaptive computing machine)—A revolutionary new form of digital integrated circuit (IC) featuring a coarsegrained algorithmic element node-based architecture that can be reconfigured (adapted) hundreds of thousands of times a second. Adaptive computing machine—see ACM Address bus—A unidirectional set of signals used by a processor (or similar device) to point to memory locations in which it is interested. A/D (analog to digital)—The process of converting an analog value into its digital equivalent. Analog—A continuous value that most closely resembles the real world and can be as precise as the measuring technique allows. Analog circuit—A collection of components used to process or generate analog signals. Analog to digital—see A/D Analogue—The way they spell “analog” in England. Antifuse technology—A technology used to create programmable integrated circuits (ICs) whose programmable elements are based on conductive links called antifuses. When an engineer purchases a programmable device based on antifuses, none of the links is initially intact. Individual links can be selectively “grown” by applying pulses of relatively high voltage and current to the device’s inputs. Application-specific integrated circuit—see ASIC 486 ■ The Design Warrior's Guide to FPGAs Application-specific standard part—see ASSP ASIC (application-specific integrated circuit)—A custombuilt integrated circuit (IC) designed to address a specific application. Such a device can contain hundreds of millions of logic gates and can be used to create incredibly large and complex functions. Similar to an ASSP, except that an ASIC is designed and built to order for use by a specific company. ASIC cell—A logic function in the cell library defined by the manufacturer of an ASIC. Assertions/properties—The term property comes from the model-checking domain and refers to a specific functional behavior of the design that you want to (formally) verify (e.g., “after a request, we expect a grant within 10 clock cycles”). By comparison, the term assertion stems from the simulation domain and refers to a specific functional behavior of the design that you want to monitor during simulation (and flag violations if that assertion “fires”). Today, with the use of formal tools and simulation tools in unified environments and methodologies, the terms property and assertion tend to be used interchangeably. ASSP (application-specific standard part)—A custom-built integrated circuit (IC) designed to address a specific application. Such a device can contain hundreds of millions of logic gates and can be used to create incredibly large and complex functions. Similar to an application-specific integrated circuit (ASIC), except that an ASSP is marketed to multiple customers for inclusion in their products. Asynchronous—A signal whose data is acknowledged or acted upon immediately and does not depend on a clock signal. Ball grid array—see BGA Bare die—An unpackaged integrated circuit (IC). Glossary Basic cell—A predefined group of unconnected transistors and resistors. This group is replicated across the surface of a gate-array form of ASIC. Bebop—A form of music characterized by fast tempos and agitated rhythms that became highly popular in the decade following World War II. BGA (ball grid array)—A packaging technology similar to a pad grid array (PGA), in which a device’s external connections are arranged as an array of conducting pads on the base of the package. In the case of a ball grid array, however, small balls of solder are attached to the conducting pads. BiCMOS (bipolar-CMOS)—(1) A technology in which the logical function of each logic gate is implemented using low-power CMOS, while the output stage of each logic gate is implemented using high-drive bipolar transistors. (2) A device whose internal logic gates are implemented using low-power CMOS, but whose output pins are driven by high-drive bipolar transistors. Binary digit—A numeral in the binary scale of notation. A binary digit (typically abbreviated to “bit”) can adopt one of two values: 0 or 1. Binary encoding—A form of state assignment for state machines that requires the minimum number of state variables. Binary logic—Digital logic gates based on two distinct voltage levels. The two voltages are used to represent the binary values 0 and 1 along with their logical equivalents False and True. Bipolar junction transistor—see BJT BIST (built-in self-test)—A test strategy in which additional logic is built into a component, thereby allowing it to test itself. Bit—Abbreviation of binary digit. A binary digit can adopt one of two values: 0 or 1. ■ 487 488 ■ The Design Warrior's Guide to FPGAs Bit file—see Configuration file BJTs (bipolar junction transistors)—A family of transistors. Bobble—A small circle used on the inputs to a logic-gate symbol to indicate an active low input or control or on the outputs to indicate a negation (inversion) or complementary signal. Some engineers prefer to use the term bubble. Boolean algebra—A mathematical way of representing logical expressions. Built-in self-test—see BIST Bus—A set of signals performing a common function and carrying similar data. Typically represented using vector notation: for example, an 8-bit database might be named data[7:0]. Byte—A group of eight binary digits, or bits. Cache memory—A small, high-speed memory (usually implemented in SRAM) used to buffer the central processing unit from any slower, lower-cost memory devices such as DRAM. The high-speed cache memory is used to store the active instructions and data1 associated with a program, while the bulk of the instructions and data resides in the slower memory. Capacitance—A measure of the ability of two adjacent conductors separated by an insulator to hold a charge when a voltage differential is applied between them. Capacitance is measured in units of farads. Cell—see ASIC cell, Basic cell, Cell library, and Memory cell Cell library—The collective name for the set of logic functions defined by the manufacturer of an application-specific integrated circuit (ASIC). The designer decides which types of cells should be realized and connected together to make the device perform its desired function. 1 In this context, “active” refers to data or instructions that a program is currently using, or which the operating system believes that the program will want to use in the immediate future. Glossary Central processing unit—see CPU Ceramic—An inorganic, nonmetallic material, such as alumina, beryllia, steatite, or forsterite, which is fired at a high temperature and is often used in electronics as a substrate (base layer) or to create component packages. CGA (column grid array)—A packaging technology similar to a pad grid array (PGA), in which a device’s external connections are arranged as an array of conducting pads on the base of the package. In the case of a column grid array, however, small columns of solder are attached to the conducting pads. Channel—(1) The area between two arrays of basic cells in a channeled gate array. (2) The gap between the source and drain regions in a MOSFET transistor. Channeled gate array—An application-specific integrated circuit (ASIC) organized as arrays of basic cells. The areas between the arrays are known as channels. Channelless gate array—An application-specific integrated circuit (ASIC) organized as a single large array of basic cells. May also be referred to as a “sea-of-cells” or a “sea-of-gates” device. Checksum—The final cyclic-redundancy check (CRC) value stored in a linear feedback shift register (LFSR) (or software equivalent). Also known as a “signature” in the guidedprobe variant of a functional test. Chemical mechanical polishing—see CMP Chip—Popular name for an integrated circuit (IC). Chip scale package—see CSP Circuit board—The generic name for a wide variety of interconnection techniques, which include rigid, flexible, and rigid-flex boards in single-sided, double-sided, multilayer, and discrete wired configurations. CLB (configurable logic block)—The Xilinx term for the next logical partition/entity above a slice. Some Xilinx ■ 489 490 ■ The Design Warrior's Guide to FPGAs FPGAs have two slices in each CLB, while others have four. See also LAB, LC, LE, and Slice. Clock tree—This refers to the way in which a clock signal is routed throughout a chip. This is called a “clock tree” because the main clock signal branches again and again (register elements like flip-flops can be considered the “leaves” on the end of the branches). This structure is used to ensure that all of the flip-flops see the clock signal as close together as possible. CMOS (complementary metal oxide semiconductor)—Logic gates constructed using a mixture of NMOS and PMOS transistors connected together in a complementary manner. CMP (chemical mechanical polishing)—A process used to replanarize a wafer—smoothing and flattening the surface by polishing out the “bumps” caused by adding a metalization (tracking) layer. Column grid array—see CGA Combinatorial logic—see Combinational logic Combinational logic—A digital logic function formed from a collection of primitive logic gates (AND, OR, NAND, NOR, etc.), where any output values from the function are directly related to the current combination of values on its inputs. That is, any changes to the signals being applied to the inputs to the function will immediately start to propagate (ripple) through the gates forming the function until their effects appear at the outputs from the function. Some folks prefer to say “combinatorial logic.” See also Sequential logic. Complementary output—Refers to a function with two outputs carrying complementary (opposite) logical values. One output is referred to as the true output and the other as the complementary output. Complex programmable logic device—see CPLD Conditioning—see Signal conditioning Glossary Configurable logic block—see CLB Configuration commands—Instructions in a configuration file that tell the device what to do with the associated configuration data. See also Configuration data and Configuration file. Configuration data—Bits in a configuration file that are used to define the state of programmable logic elements directly. See also Configuration commands and Configuration file. Configuration file—A file containing the information that will be uploaded into the FPGA in order to program (configure) it to perform a specific function. In the case of SRAM-based FPGAs, the configuration file contains a mixture of configuration data and configuration commands. When the configuration file is in the process of being loaded into the device, the information being transferred is referred to as the configuration bitstream. See also Configuration commands and Configuration data. Constraints—In the context of formal verification, the term constraint derives from the model-checking space. Formal model checkers consider all possible allowed input combinations when performing their magic and working on a proof. Thus, there is often a need to constrain the inputs to their legal behavior; otherwise, the tool would report false negatives, which are property violations that would not normally occur in the actual design. Core—see Hard core and Soft core Corner condition—see Corner case Corner case—A hard-to-exercise or hard-to-reach functional condition associated with the design. CPLD (complex PLD)—A device that contains a number of SPLD (typically PAL) functions sharing a common programmable interconnection matrix. CPU (central processing unit)—The brain of a computer where all of the decision making and number crunching are performed. ■ 491 492 ■ The Design Warrior's Guide to FPGAs CRC (cyclic redundancy check)—A calculation used to detect errors in data communications, typically performed using a linear feedback shift register (LFSR). Similar calculations may be used for a variety of other purposes such as data compression. CSP (chip scale package)—An integrated circuit (IC) packaging technique in which the package is only fractionally larger than the silicon die. Cyclic redundancy check—see CRC D/A (digital to analog)—The process of converting a digital value into its analog equivalent. Data bus—A bidirectional set of signals used by a computer to convey information from a memory location to the central processing unit and vice versa. More generally, a set of signals used to convey data between digital functions. Data-path function—A well-defined function such as an adder, counter, or multiplier used to process digital data. DCM (digital clock manager)—Some FPGA clock managers are based on phase-locked loops (PLLs), while others are based on digital delay-locked loops (DLLs). The term DCM is used by Xilinx to refer to an advanced clock manager that is a superset of a DLL. See also DLL and PLL. Declarative—In the context of formal verification, the term declarative refers to an assertion/property/event/constraint that exists within the structural context of the design and is evaluated along with all of the other structural elements in the design (for example, a module that takes the form of a structural instantiation). Another way to view this is that a declarative assertion/property is always “on/active,” unlike its procedural counterpart that is only “on/active” when a specific path is taken/executed through the HDL code. Deep submicron—see DSM Delay-locked loop—see DLL Glossary DeMorgan transformation—The transformation of a Boolean expression into an alternate, and often more convenient, form. Die—An unpackaged integrated circuit (IC). In this case, the plural of die is also die (in much the same way that “a shoal of herring” is the plural of “herring”). Digital—A value represented as being in one of a finite number of discrete states called quanta. The accuracy of a digital value is dependent on the number of quanta used to represent it. Digital circuit—A collection of logic gates used to process or generate digital signals. Digital clock manager—see DCM Digital delay-locked loop—see DLL Digital signal processing/processor—see DSP Digital to analog—see D/A Diode—A two-terminal device that conducts electricity in only one direction; in the other direction it behaves like an open switch. These days the term diode is almost invariably taken to refer to a semiconductor device, although alternative implementations such as vacuum tubes are available. Discrete device—Typically taken to refer to an electronic component such as a resistor, capacitor, diode, or transistor that is presented in an individual package. More rarely, the term may be used in connection with a simple integrated circuit (IC) containing a small number of primitive logic gates. DLL (digital delay-locked loop)—Some FPGA clock managers are based on phase-locked loops (PLLs), while others are based on digital delay-locked loops (DLLs). DLLs are, by definition, digital in nature. The proponents of DLLs say that they offer advantages in terms of precision, stability, power management, noise insensitivity, and jitter performance. I ■ 493 494 ■ The Design Warrior's Guide to FPGAs have no clue as to why these aren’t called DDLLs. See also PLL. DSM (deep submicron)—Typically taken to refer to integrated circuit (ICs) containing structures that are smaller than 0.5 microns (one half of one millionth of a meter). DSP (1) (digital signal processing)—The branch of electronics concerned with the representation and manipulation of signals in digital form. This form of processing includes compression, decompression, modulation, error correction, filtering, and otherwise manipulating audio (voice, music, etc.), video, image, and other such data for such applications like telecommunications, radar, and image processing (including medical imaging). (2) (digital signal processor)—A special form of microprocessor that has been designed to perform a specific processing task on a specific type of digital data much faster and more efficiently than can be achieved using a general-purpose microprocessor. Dynamic formal verification—Some portions of a design are going to be difficult to verify via simulation because they are deeply buried in the design, making them difficult to control from the primary inputs. In order to address this, some verification solutions use simulation to reach a corner case and then automatically pause the simulator and invoke a static formal verification engine to evaluate that corner case exhaustively. This combination of simulation and traditional static formal verification is referred to as dynamic formal verification. See also Corner case, Formal verification, and Static formal verification. Dynamic RAM—see DRAM ECL (emitter-coupled logic)—Logic gates implemented using particular configurations of Bipolar junction transistors (BJTs). Glossary Edge sensitive—An input to a logic function that only affects the function when it transitions from one logic value to another. EEPROM or E2PROM (electrically erasable programmable read-only memory)—A memory integrated circuit (IC) whose contents can be electrically programmed by the designer. Additionally, the contents can be electrically erased, allowing the device to be reprogrammed. Electrically erasable programmable read-only memory—see EEPROM Emitter-coupled logic—see ECL EPROM (erasable programmable read-only memory)—A memory integrated circuit (IC) whose contents can be electrically programmed by the designer. Additionally, the contents can be erased by exposing the die to ultraviolet (UV) light through a quartz window mounted in the top of the component’s package. Equivalency checking—see Formal verification Equivalent gate—An ASIC-based concept in which each type of logic function is assigned an equivalent gate value for the purposes of comparing functions and devices. However, the definition of an equivalent gate varies depending on whom you’re talking to. Erasable programmable read-only memory—see EPROM Event—In the context of formal verification, an event is similar to an assertion/property, and in general events may be considered a subset of assertions/properties. However, while assertions/properties are typically used to trap undesirable behavior, events may be used to specify desirable behavior for the purposes of functional coverage analysis. Falling edge—see Negative edge FET (field-effect transistor)—A transistor whose control (or “gate”) signal is used to create an electromagnetic field that turns the transistor on or off. ■ 495 496 ■ The Design Warrior's Guide to FPGAs Field-effect transistor—see FET Field-programmable gate array—see FPGA Field-programmable interconnect chip—see FPIC2 Field-programmable interconnect device—see FPID FIFO (first in first out)—A special memory device or function in which data is read out in the same order that it was written in. Finite state machine—see FSM Firm IP—In the context of an FPGA, the term firm IP refers to a library of high-level functions. Unlike their soft IP equivalents, however, these functions have already been optimally mapped, placed, and routed into a group of programmable logic blocks (possibly combined with some hard IP blocks like multipliers, etc.). One or more copies of each predefined firm IP block can be instantiated (called up) into the design as required. See also Hard IP and Soft IP. Firmware—Refers to programs or sequences of instructions that are loaded into nonvolatile memory devices. First in first out—see FIFO FLASH memory—An evolutionary technology that combines the best features of the EPROM and E2PROM technologies. The name FLASH is derived from the technology’s fast reprogramming time compared to EPROM. Formal verification—In the not-so-distant past, the term formal verification was considered synonymous with equivalency checking for the majority of design engineers. In this context, an equivalency checker is a tool that uses formal (rigorous mathematical) techniques to compare two different representations of a design—say an RTL description with a gate-level netlist—to determine whether or not they have the same input-to-output functionality. In fact, 2 FPIC is a trademark of Aptix Corporation. Glossary equivalency checking may be considered to be a subclass of formal verification called model checking, which refers to techniques used to explore the state space of a system to test whether or not certain properties, typically specified as “assertions,” are true. See also Static formal verification and Dynamic formal verification. FPGA (field-programmable gate array)—A type of digital integrated circuit (IC) that contains configurable (programmable) blocks of logic along with configurable interconnect between these blocks. Such a device can be configured (programmed) by design engineers to perform a tremendous variety of different tasks. FPIC (field-programmable interconnect chip)3—An alternate, proprietary name for a field-programmable interconnect device (FPID). FPID (field-programmable interconnect device)—A device used to connect logic devices together that can be dynamically reconfigured in the same way as standard SRAM-based FPGAs. Because each FPID may have around 1,000 pins, only a few such devices are typically required on a circuit board. FR4—The most commonly used insulating base material for circuit boards. FR4 is made from woven glass fibers that are bonded together with an epoxy. The board is cured using a combination of temperature and pressure, which causes the glass fibers to melt and bond together, thereby giving the board strength and rigidity. The first two characters stand for “flame retardant,” and you can count the number of people who know what the “4” stands for on the fingers of one hand. FR4 is technically a form of fiberglass, and some people do refer to these composites as fiberglass boards or fiberglass substrates, but not often. Full custom—An application-specific integrated circuit (ASIC) in which the design engineers have complete control over 3 FPIC is a trademark of Aptix Corporation. ■ 497 498 ■ The Design Warrior's Guide to FPGAs every mask layer used to fabricate the device. The ASIC vendor does not provide a cell library or prefabricate any components on the substrate. Functional latency—Refers to the fact that, at any given time, only a portion of the logic functions in a device or system are typically active (doing anything useful). Fuse—see Fusible link technology and Antifuse technology Fusible-link technology—A technology used to create programmable integrated circuits (ICs) whose programmable elements are based on microscopically small fusible links. When an engineer purchases a programmable device based on fusible links, all of the fuses are initially intact. Individual fuses can be selectively removed by applying pulses of relatively high voltage and current to the device’s inputs. FSM (finite state machine)—The actual implementation (in hardware or software) of a function that can be considered to consist of a finite set of states through which it sequences. GAL (generic array logic)—A variation on a PAL device from a company called Lattice Semiconductor Corporation.4 Garbage in garbage out—see GIGO Gate array—An application-specific integrated circuit (ASIC) in which the manufacturer prefabricates devices containing arrays of unconnected components (transistors and resistors) organized in groups called basic cells. The designer specifies the function of the device in terms of cells from the cell library and the connections between them, and the manufacturer then generates the masks used to create the metalization layers. Generic array logic—see GAL Geometry—Refers to the size of structures created on an integrated circuit (IC). The structures typically referenced are 4 GAL is a registered trademark of Lattice Semiconductor Corporation. Glossary the width of the tracks and the length of the transistor’s channels; the dimensions of other features are derived as ratios of these structures. Giga—Unit qualifier (symbol = G) representing one thousand million, or 109. For example, 3 GHz stands for 3 × 109 hertz. GIGO (garbage in garbage out)—An electronic engineer’s joke, also familiar to the writers of computer programs. Glue logic—The relatively small amounts of simple logic that are used to connect (“glue”) together—and interface between—larger logical blocks, functions, or devices. Gray code—A sequence of binary values in which each pair of adjacent values differs by only a single bit: for example, 00, 01, 11, 10. Ground plane—A conducting layer in, or on, a substrate providing a grounding, or reference, point for components. There may be several ground planes separated by insulating layers. Guard condition—A Boolean expression associated with a transition between two states in a state machine. Such an expression must be satisfied for that transition to be executed. Guided probe—A form of functional test in which the operator is guided in the probing of a circuit board to isolate a faulty component or track. Hard core—In the context of digital electronics, the term core is typically used to refer to a relatively large, generalpurpose logic function that may be used as a building block forming a portion of a much larger chip design. For example, if an ASIC contains an embedded microprocessor, that microprocessor would be referred to as a “microprocessor core.” Other functions that might fall into this category are microcontroller cores, digital signal processor (DSP) cores, communication function cores (e.g., a UART), and so forth. Such cores may be developed internally by the ■ 499 500 ■ The Design Warrior's Guide to FPGAs design team, but they are typically purchased from thirdparty intellectual property (IP) vendors. There is some difference in how the term hard core is perceived depending on the target implementation technology: ASIC or FPGA. In the case of an ASIC, the hard core will be presented as a block of logic gates whose physical locations (relative to each other) and interconnections have already been defined (that is, hard-wired and set in stone). This block will be treated as a black box by the place-and-route software that is used to process the rest of the design; that is, the location of the block as a whole may be determined by the place-and-route software, but it’s internal contents are completely locked down. The output from the place-and-route software will subsequently be used to generate the photo-masks that will in turn be used to fabricate the silicon chip. By comparison, in the case of an FPGA, any hard cores have already been physically implemented as hard-wired blocks that are embedded into the FPGA’s fabric. A design may comprise one or more hard cores combined with one or more soft cores along with other blocks of user-defined logic. See also Soft core. Hardware—Generally understood to refer to any of the physical portions constituting an electronic system, including components, circuit boards, power supplies, cabinets, and monitors. Hard IP—In the context of an FPGA, the term hard IP refers to preimplemented blocks, such as microprocessor cores, gigabit interfaces, multipliers, adders, MAC functions, and the like. These blocks are designed to be as efficient as possible in terms of power consumption, silicon real estate requirements, and performance. Each FPGA family will feature different combinations of such blocks together with various quantities of programmable logic blocks. See also Soft IP and Firm IP. Hardware description language—see HDL Glossary HDL (hardware description language)—Today’s digital integrated circuits (ICs) can end up containing hundreds of millions of logic gates, and it simply isn’t possible to capture and manage designs of this complexity at the schematic (circuit-diagram) level. Thus, as opposed to using schematics, the functionality of a high-end IC is now captured in textual form using an HDL. Popular HDLs are Verilog, SystemVerilog, VHDL, and SystemC. HDL synthesis—A more recent name for logic synthesis. See also Logic synthesis and Physically aware synthesis. Hertz—see Hz High-impedance state—The state associated with a signal that is not currently being driven by anything. A highimpedance state is typically indicated by means of the “Z” character. Hz (hertz)—Unit of frequency. One hertz equals one cycle, or one oscillation, per second. IC (integrated circuit)—A device in which components such as resistors, diodes, and transistors are formed on the surface of a single piece of semiconducting material. ICR (in-circuit reconfigurable)—An SRAM-based or similar component that can be dynamically reprogrammed on the fly while remaining resident in the system. Impedance—The resistance to the flow of current caused by resistive, capacitive, and/or inductive devices (or undesired parasitic elements) in a circuit. Implementation-based verification coverage—This measures verification activity with respect to microarchitecture details of the actual implementation. This refers to design decisions that are embedded in the RTL that result in implementation-specific corner cases, for example, the depth of a FIFO buffer and the corner cases for its “highwater mark” and “full” conditions. Such implementation details are rarely visible at the specification level. See also Macroarchitecture definition, Microarchitecture definition, and ■ 501 502 ■ The Design Warrior's Guide to FPGAs Specification-level coverage. In-circuit reconfigurable—see ICR Inductance—A property of a conductor that allows it to store energy in a magnetic field which is induced by a current flowing through it. The base unit of inductance is the henry. In-system programmable—see ISP Integrated circuit—see IC Intellectual property—see IP IP (intellectual property)—When a team of electronics engineers is tasked with designing a complex integrated circuit (IC), rather than reinvent the wheel, they may decide to purchase the plans for one or more functional blocks that have already been created by someone else. The plans for these functional blocks are known as intellectual property, or IP. IP blocks can range all the way up to sophisticated communications functions and microprocessors. The more complex functions, like microprocessors, may be referred to as “cores.” See also Hard IP, Soft IP, and Firm IP. ISP (in-system programmable)—An E2-based, FLASH-based, SRAM-based, or similar integrated circuit (IC) that can be reprogrammed while remaining resident on the circuit board. JEDEC (Joint Electronic Device Engineering Council)—A council that creates, approves, arbitrates, and oversees industry standards for electronic devices. In programmable logic, the term JEDEC refers to a textual file containing information used to program a device. The file format is a JEDEC-approved standard and is commonly referred to as a “JEDEC file.” Jelly-bean logic—Small integrated circuits (ICs) containing a few simple, fixed logical functions, for example, four 2-input AND gates. Joint Electronic Device Engineering Council—see JEDEC Glossary Kilo—Unit qualifier (symbol = K) representing one thousand, or 103. For example, 3 KHz stands for 3 × 103 hertz. LAB (logic array block)—The Altera name for a programmable logic block containing a number of logic elements (LEs). See also CLB, LC, LE, and Slice. LC (logic cell)—The core building block in a modern FPGA from Xilinx is called a logic cell (LC). Among other things, an LC comprises a 4-input LUT, a multiplexer, and a register. See also CLB, LAB, LE, and Slice. LE (logic element)—The core building block in a modern FPGA from Altera is called a logic element (LE). Among other things, an LE comprises a 4-input LUT, a multiplexer and a register. See also CLB, LAB, LC, and Slice. Least-significant bit—see LSB Least-significant byte—see LSB Level sensitive—An input to a logic function whose effect on the function depends only on its current logic value or level and is not directly related to its transitioning from one logic value to another. LFSR (linear feedback shift register)—A shift register whose data input is generated as an XOR or XNOR of two or more elements in the register chain Linear feedback shift register—see LFSR Literal—A variable (either true or inverted) in a Boolean equation. Logic function—A mathematical function that performs a digital operation on digital data and returns a digital value. Logic array block—see LAB Logic cell—see LC Logic element—see LE Logic gate—The physical implementation of a simple or primitive logic function. Logic synthesis—A process in which a program is used to automatically convert a high-level textual representation ■ 503 504 ■ The Design Warrior's Guide to FPGAs of a design (specified using a hardware description language (HDL) at the register transfer level (RTL) of abstraction) into equivalent registers and Boolean equations. The synthesis tool automatically performs simplifications and minimizations and eventually outputs a gate-level netlist. See also HDL synthesis and Physically aware synthesis. Lookup table—see LUT LSB—(1) (least-significant bit) The binary digit, or bit, in a binary number that represents the least-significant value (typically the right-hand bit). (2) (least-significant byte)—The byte in a multibyte word that represents the least-significant values (typically the right-hand byte). LUT (lookup table)—There are two fundamental incarnations of the programmable logic blocks used to form the medium-grained architectures featured in FPGAs: MUX (multiplexer) based and LUT (lookup table) based. In the case of a LUT, a group of input signals is used as an index (pointer) into a lookup table. See also CLB, LAB, LC, LE, and Slice. Macroarchitecture definition—A design commences with an original concept, whose high-level definition is determined by system architects and system designers. It is at this stage that macroarchitecture decisions are made, such as partitioning the design into hardware and software components, selecting a particular microprocessor core and bus structure, and so forth. The resulting specification is then handed over to the hardware design engineers, who commence their portion of the development process by performing microarchitecture definition tasks. See also Microarchitecture definition. Magnetic random-access memory—see MRAM Magnetic tunnel junction—see MTJ Mask—see Photo-mask Glossary Mask programmable—A device such as a read-only memory (ROM) that is programmed during its construction using a unique set of photo-masks. Maximal displacement—A linear feedback shift register (LFSR) whose taps are selected such that changing a single bit in the input data stream will cause the maximum possible disruption to the register’s contents. Maximal length—A linear feedback shift register (LFSR) with n bits that sequences through 2n – 1 states before returning to its original value. Maxterm—The logical OR of the inverted variables associated with an input combination to a logical function. MCM (multichip module)—A generic name for a group of advanced interconnection and packaging technologies featuring unpackaged integrated circuits (ICs) mounted directly onto a common substrate. Mega—Unit qualifier (symbol = M) representing one million, or 106. For example, 3 MHz stands for 3 × 106 hertz. Memory cell—A unit of memory used to store a single binary digit, or bit, of data. Memory word—A number of memory cells logically and physically grouped together. All the cells in a word are typically written to, or read from, at the same time. Metalization layer—A layer of conducting material on an integrated circuit (IC) that is selectively deposited or etched to form connections between logic gates. There may be several metalization layers separated by dielectric (insulating) layers. Metal-oxide semiconductor field-effect transistor—see MOSFET Microarchitecture definition—A design commences with an original concept, whose high-level definition is determined by system architects and system designers. The resulting specification is then handed over to the hardware design engineers, who commence their portion of the develop- ■ 505 506 ■ The Design Warrior's Guide to FPGAs ment process by performing microarchitecture definition tasks such as detailing control structures, bus structures, and primary datapath elements. A simple example would be an element such as a FIFO, to which one would assign attributes like width and depth and characteristics like blocking write, nonblocking read, and how to behave when empty or full. Microarchitecture definitions, which are often performed in brainstorming sessions on a whiteboard, may include performing certain operations in parallel verses sequentially, pipelining portions of the design versus nonpipelining, sharing common resources—for example, two operations sharing a single multiplier—versus using dedicated resources, and so forth. Micro—Unit qualifier (symbol = µ) representing one millionth, or 10–6. For example, 3 µS stands for 3 × 10–6 seconds. Microcontroller—see µC Microprocessor—see µP Milli—Unit qualifier (symbol = m) representing one thousandth, or 10–3. For example, 3 mS stands for 3 × 10–3 seconds. Minimization—The process of reducing the complexity of a Boolean expression. Minterm—The logical AND of the variables associated with an input combination to a logical function. Mixed signal—Typically refers to an integrated circuit (IC) that contains both analog and digital elements. Model checking—see Formal verification Moore’s law—In 1965, Gordon Moore (who was to cofound Intel Corporation in 1968) noted that new generations of memory devices were released approximately every 18 months and that each new generation of devices contained roughly twice the capacity of its predecessor. This observation subsequently became known as Moore’s Law, Glossary and it has been applied to a wide variety of electronics trends. MOSFET (metal-oxide semiconductor field-effect transistor) —A family of transistors. Most-significant bit—see MSB Most-significant byte—see MSB MRAM (magnetic RAM)—A form of memory expected to come online circa 2005 that has the potential to combine the high speed of SRAM, the storage capacity of DRAM, and the nonvolatility of FLASH, while consuming very little power. MSB—(1) (most-significant bit) The binary digit, or bit, in a binary number that represents the most-significant value (typically the left-hand bit). (2) (most-significant byte) The byte in a multibyte word that represents the mostsignificant values (typically the left-hand byte). MTJ (magnetic tunnel junction)—A sandwich of two ferromagnetic layers separated by a thin insulating layer. An MRAM memory cell is created by the intersection of two wires (say, a “row” line and a “column” line) with an MJT sandwiched between them. Multichip module—see MCM Multiplexer (digital)—A logic function that uses a binary value, or address, to select between a number of inputs and conveys the data from the selected input to the output. Nano—Unit qualifier (symbol = n) representing one thousandth of one millionth, or 10–9. For example, 3 nS stands for 3 × 10–9 seconds. Negative edge—A signal transition from a logic 1 to a logic 0. Nibble—see Nybble NMOS (N-channel MOS)—Refers to the order in which the semiconductor is doped in a MOSFET device, that is, which structures are constructed as N-type versus P-type material. ■ 507 508 ■ The Design Warrior's Guide to FPGAs Noise—The miscellaneous rubbish that gets added to an electronic signal on its journey through a circuit. Noise can be caused by capacitive or inductive coupling or by externally generated electromagnetic interference. Nonrecurring engineering—see NRE Nonvolatile—A memory device that does not lose its data when power is removed from the system. NPN (N-type–P-type–N-type)—Refers to the order in which the semiconductor is doped in a bipolar junction transistor (BJT). NRE (nonrecurring engineering)—In the context of this book, this refers to the costs associated with developing an ASIC, ASSP, or FPGA design. N-type—A piece of semiconductor doped with impurities that make it amenable to donating electrons. Nybble—A group of four binary digits, or bits. Ohm—Unit of resistance. The Greek letter omega, Ω, is often used to represent ohms; for example, 1 MΩ indicates one million ohms. One-hot encoding—A form of state assignment for state machines in which each state is represented by an individual state variable, and only one such variable may be “on/active” (“hot”) at any particular time. One-time programmable—see OTP OpenVera Assertions—see OVA Open Verification Library—see OVL Operating system—The collective name for the set of master programs that control the core operation and the baselevel user interface of a computer. OTP (one-time programmable)—A programmable device, such as an SPLD, CPLD, or FPGA, that can be configured (programmed) only a single time. OVA (OpenVera Assertions)—A formal verification language that has been specially constructed for the purpose Glossary of specifying assertions/properties with maximum efficiency. OVA is very powerful in creating complex regular and temporal expressions, and it allows complex behavior to be specified with very little code. This language was donated to Accellera’s SystemVerilog committee, which is controlled by the Accellera organization (www.accellera.org), and is based on IBM’s Sugar language. See also PSL, Sugar, and SVA. OVL (Open Verification Library)—A library of assertion/property models available in both VHDL and Verilog 2K1 that is managed under the auspices of the Accellera organization (www.accellera.com). Pad grid array—see PGA PAL (programmable array logic)5—A programmable logic device in which the AND array is programmable, but the OR array is predefined (see also PLA, PLD, and PROM). Parasitic effects—The effects caused by undesired resistive, capacitive, or inductive elements inherent in the material or topology of a track or component. PCB (printed circuit board)—A type of circuit board that has conducting tracks superimposed, or “printed,” on one or both sides and may also contain internal signal layers and power and ground planes. An alternative name—printed wire board (PWB)—is commonly used in America. Peta—Unit qualifier (symbol = P) representing one thousand million million, or 1015. For example, 3 PHz stands for 3 × 1015 hertz. PGA (1) (pad grid array)—A packaging technology in which a device’s external connections are arranged as an array of conducting pads on the base of the package. (2) (pin grid array)—A packaging technology in which a device’s external connections are arranged as an array of conducting leads, or pins, on the base of the package. 5 PAL is a registered trademark of Monolithic Memories ■ 509 510 ■ The Design Warrior's Guide to FPGAs Phase-locked loop—see PLL Physically aware synthesis—For most folks, physically aware synthesis means taking actual placement information associated with the various logical elements in the design, using this information to estimate accurate track delays, and using these delays to fine-tune the placement and perform other optimizations. Interestingly enough, physically aware synthesis commences with a first-pass run using a relatively traditional logic/HDL synthesis engine. See also logic synthesis. Photo-mask—A sheet of material carrying patterns that are either transparent or opaque to the ultraviolet (UV) light used to create structures on the surface of an integrated circuit (IC). Pico—Unit qualifier (symbol = p) representing one millionth of one millionth, or 10–12. For example, 3 pS stands for 3 × 10–12 seconds. Pin grid array —see PGA PLA (programmable logic array)—The most user configurable of the traditional programmable logic devices because both the AND and OR arrays are programmable (see also PAL, PLD, and PROM). PLD (programmable logic device)—An integrated circuit (IC) whose internal architecture is predetermined by the manufacturer, but which is created in such a way that it can be configured (programmed) by engineers in the field to perform a variety of different functions. For the purpose of this book, the term PLD is assumed to encompass both simple PLDs (SPLDs) and complex PLDs (CPLDs). In comparison to an FPGA, these devices contain a relatively limited number of logic gates, and the functions they can be used to implement are much smaller and simpler. PLI (programming-language interface)—One very cool concept that accompanied Verilog (the language) and Verilog-XL (the simulator) was the Verilog Glossary programming-language interface, or PLI. The more generic name for this sort of thing is application programming interface (API). An API is a library of software functions that allow external software programs to pass data into an application and access data from that application. Thus, the Verilog PLI is an API that allows users to extend the functionality of the Verilog language and simulator. PLL (phase-locked loop)—Some FPGA clock managers are based on phase-locked loops (PLLs). PLLs have been used since the 1940s in analog implementations, but recent emphasis on digital methods has made it desirable to process signals digitally. Today’s PLLs can be implemented using either analog or digital techniques. See also DLL. PMOS (P-channel MOS)—Refers to the order in which the semiconductor is doped in a MOSFET device, that is, which structures are constructed as P-type versus N-type material. PNP (P-type–N-type–P-type)—Refers to the order in which the semiconductor is doped in a bipolar junction transistor (BJT). Positive edge—A signal transition from a logic 0 to a logic 1. Power plane—A conducting layer in or on the substrate providing power to the components. There may be several power planes separated by insulating layers. Pragma—An abbreviation for “pragmatic information” that refers to special pseudocomment directives inserted in source code (including C/C++ and HDL code) that can be interpreted and used by parsers/compilers and other tools. (Note that this is a general-purpose term, and pragmabased techniques are used by a variety of tools in addition to formal verification technology.) Primitives—Simple logic functions such as BUF, NOT, AND, NAND, OR, NOR, XOR, and XNOR. These may also be referred to as primitive logic gates. Printed circuit board—see PCB ■ 511 512 ■ The Design Warrior's Guide to FPGAs Printed wire board—see PWB Procedural: In the context of formal verification, the term procedural refers to an assertion/property/event/constraint that is described within the context of an executing process or set of sequential statements such as a VHDL process or a Verilog “always” block (thus, these are sometimes called “in-context” assertions/properties). In this case, the assertion/property is built into the logic of the design and will be evaluated based on the path taken through a set of sequential statements. Product-of-sums—A Boolean equation in which all of the maxterms corresponding to the lines in the truth table for which the output is a logic 0 are combined using AND operators. Product term—A set of literals linked by an AND operator. Programmable array logic—see PAL Programmable logic array—see PLA Programmable logic device—see PLD Programmable read-only memory—see PROM Programming-language interface—see PLI PROM (programmable read-only memory)—A programmable logic device in which the OR array is programmable, but the AND array is predefined. Usually considered to be a memory device whose contents can be electrically programmed (once) by the designer (see also PAL, PLA, and PLD). Properties/assertions—see Assertions/properties Property-specification language—see PSL Pseudorandom—An artificial sequence of values that give the appearance of being random, but which are also repeatable. PSL (property-specification language)—A formal verification language that has been specially constructed for the purpose of specifying assertions/properties with maximum Glossary efficiency. PSL is very powerful in creating complex regular and temporal expressions, and it allows complex behavior to be specified with very little code. This industry standard language, which is controlled by the Accellera organization (www.accellera.org), is based on IBM’s Sugar language. See also OVA, Sugar, and SVA. P-type—A piece of semiconductor doped with impurities that make it amenable to accepting electrons. PWB (printed wire board)—A type of circuit board that has conducting tracks superimposed, or “printed,” on one or both sides and may also contain internal signal layers and power and ground planes. An alternative name—printed circuit board (PCB)—is predominantly used in Europe and Asia. QFP (quad flat pack)—The most commonly used package in surface mount technology to achieve a high lead count in a small area. Leads are presented on all four sides of a thin square package. Quad flat pack—see QFP Quantization—(1) Part of the process by which an analog signal is converted into a series of digital values. First of all the analog signal is sampled at specific times. For each sample, the complete range of values that the analog signal can assume is divided into a set of discrete bands or quanta. Quantization refers to the process of determining which band the current sample falls into. See also Sampling. (2) The process of changing floating-point representations into their fixed-point equivalents. RAM (random-access memory)—A data-storage device from which data can be read out and into which new data can be written. Unless otherwise indicated, the term RAM is typically taken to refer to a semiconductor device in the form of an integrated circuit (IC). Random-access memory—see RAM Read-only memory—see ROM ■ 513 514 ■ The Design Warrior's Guide to FPGAs Read-write memory—see RWM Real estate—Refers to the amount of area available on a substrate. Register transfer level—see RTL Rising edge—see Positive edge ROM (read-only memory)—A data storage device from which data can be read out, but into which new data cannot be written. Unless otherwise indicated, the term ROM is typically taken to refer to a semiconductor device in the form of an integrated circuit (IC). RTL (register transfer level)—A hardware description language (HDL) is a special language that is used to capture (describe) the functionality of an electronic circuit. In the case of an HDL intended to represent digital circuits, such a language may be used to describe the functionality of the circuit at a variety of different levels of abstraction. The simplest level of abstraction is that of a gate-level netlist, in which the functionality of the digital circuit is described as a collection of primitive logic gates (AND, OR, NAND, NOR, etc.) and the connections between them. A more sophisticated (higher) level of abstraction is referred to as register transfer level (RTL). In this case, the circuit is described as a collection of storage elements (registers), Boolean equations, control logic such as if-then-else statements, and complex sequences of events (e.g., “If the clock signal goes from 0 to 1, then load register A with the contents of register B plus register C”). The most popular languages used for capturing designs in RTL are VHDL and Verilog (with SystemVerilog starting to gain a larger following). RWM (read-write memory)—An alternative (and possibly more appropriate) name for a random-access memory (RAM). Sampling—Part of the process by which an analog signal is converted into a series of digital values. Sampling refers to Glossary observing the value of the analog signal at specific times. See also Quantization. Schematic—Common name for a circuit diagram. Sea of cells—Popular name for a channelless gate array. Sea of gates—Popular name for a channelless gate array. Seed value—An initial value loaded into a linear feedback shift register (LFSR) or random-number generator. Semiconductor—A special class of material that can exhibit both conducting and insulating properties. Sequential logic—A digital function whose output values depend not only on its current input values, but also on previous input values. That is, the output value depends on a “sequence” of input values. See also Combinational logic. Signal conditioning—Amplifying, filtering, or otherwise processing a (typically analog) signal. Signature—Refers to the checksum value from a cyclic redundancy check (CRC) when used in the guided-probe form of functional test. Signature analysis—A guided-probe functional-test technique based on signatures. Silicon chip—Although a variety of semiconductor materials are available, the most commonly used is silicon, and integrated circuits (ICs) are popularly known as “silicon chips,” or simply “chips.” Simple PLD—see SPLD Single sided—A printed circuit board (PCB) with tracks on one side only. Skin effect—The phenomenon where, in the case of highfrequency signals, electrons only propogate on the outer surface (the “skin”) of a conductor. Slice—The Xilinx term for an intermediate logical partition/entity between a logic cell (LC) and a configurable logic block (CLB). Why “slice”? Well, they had to call it something, and—whichever way you look at it—the term slice is ■ 515 516 ■ The Design Warrior's Guide to FPGAs “something.” At the time of this writing, a slice contains two LCs. See also CLB, LAB, LC, and LE. SoC (system on chip)—As a general rule of thumb, a SoC is considered to refer to an integrated circuit (IC) that contains both hardware and embedded software elements. In the not-so-distant past, an electronic system was typically composed of a number of ICs, each with its own particular function (say a microprocessor, a communications function, some memory devices, etc.). For many of today’s high-end applications, however, all of these functions may be combined on a single device, such as an ASIC or FPGA, which may therefore be referred to as a system on chip. Soft core—In the context of digital electronics, the term core is typically used to refer to a relatively large, generalpurpose logic function that may be used as a building block forming a portion of a much larger chip design. For example, if an ASIC contains an embedded microprocessor, that microprocessor would be referred to as a “microprocessor core.” Other functions that might fall into this category are microcontroller cores, digital signal processor (DSP) cores, communication function cores (e.g., a UART), and so forth. Such cores may be developed internally by the design team, but they are often purchased from third-party intellectual property (IP) vendors. In the case of a soft core, the logical functionality of the core is often provided as RTL VHDL/Verilog. In this case, the core will be synthesized and then placed-androuted along with the other blocks forming the design. (In some cases the core might be provided in the form of a gate-level netlist or as a schematic, but these options are rare and extremely rare, respectively). One advantage of a soft core is that it may be customizable by the end user; for example, it may be possible to remove or modify certain subfunctions if required. Glossary There is some difference in how the term soft core is perceived, depending on the target implementation technology: ASIC or FPGA. In the case of an ASIC, and assuming that the soft core is provided in RTL, the core is synthesized into a gate-level netlist along with the other RTL associated with the design. The logic gates forming the resulting gate-level netlist are then placed-and-routed, the results being used to generate the photo-masks that will, in turn, be used to fabricate the silicon chip. This means that the ultimate physical realization of the core will be in the form of hard-wired logic gates (themselves formed from transistors) and the connections between them. By comparison, in the case of an FPGA, the resulting netlist will be used to generate a configuration file that will be used to program the lookup tables and configurable logic blocks inside the device. A design may comprise one or more soft cores combined with one or more hard cores, along with other blocks of user-defined logic. See also Hard core. Soft IP—In the context of a FPGA, the term soft IP refers to a source-level library of high-level functions that can be included in users’ designs. These functions are typically represented using a hardware description language (HDL) such as Verilog or VHDL at the register transfer level (RTL) of abstraction. Any soft IP functions the design engineers decide to use are incorporated into the main body of the design, which is also specified in RTL, and subsequently synthesized down into a group of programmable logic blocks (possibly combined with some hard IP blocks like multipliers, etc.). See also Hard IP and Firm IP. Software—Refers to programs, or sequences of instructions, that are executed by hardware. Solder—An alloy of tin and lead with a comparatively low melting point used to join less fusible metals. Typical solder contains 60 percent tin and 40 percent lead; increasing the proportion of lead results in a softer solder with a lower ■ 517 518 ■ The Design Warrior's Guide to FPGAs melting point, while decreasing the proportion of lead results in a harder solder with a higher melting point. Specification-based verification coverage—This measures verification activity with respect to items in the high-level functional or macroarchitecture definition. This includes the I/O behaviors of the design, the types of transactions that can be processed (including the relationships of different transaction types to each other), and the data transformations that must occur. See also Macroarchitecture definition, Microarchitecture definition, and Implementation-level coverage. SPLD (simple PLD)—Originally all PLDs contained a modest number of equivalent logic gates and were fairly simple. These devices include PALs, PLAs, PROMs, and GALs. As more complex PLDs (CPLDs) arrived on the scene, however, it became common to refer to their simpler cousins as simple PLDs (SPLDs). SRAM (static RAM)—A memory device in which the core of each cell is formed from four or six transistors configured as a latch or a flip-flop. The term static is used because, once a value has been loaded into an SRAM cell, it will remain unchanged until it is explicitly altered or until power is removed from the device. Standard cell—A form of application-specific integrated circuit (ASIC), which, unlike a gate array, does not use the concept of a basic cell and does not have any prefabricated components. The ASIC vendor creates custom photomasks for every stage of the device’s fabrication, allowing each logic function to be created using the minimum number of transistors. State diagram—A graphical representation of the operation of a state machine. State machine—see FSM State variable—One of a set of registers whose values represent the current state occupied by a state machine. Glossary Static formal verification—Formal verification tools that examine 100 percent of the state space without having to simulate anything. Their disadvantage is that they can typically be used for small portions of the design only because the state space increases exponentially with complex properties and one can quickly run into “state space explosion” problems. See also Formal verification and dynamic formal verification. Static RAM—see SRAM Structured ASIC—A form of application-specific integrated circuit (ASIC) in which an array of identical modules (or tiles) is prefabricated across the surface of the device. These modules may contain a mixture of generic logic (implemented either as gates, multiplexers, or lookup tables), one or more registers, and possibly a little local RAM. Due to the level of sophistication of the modules, the majority of the metallization layers are also predefined. Thus, many structured ASIC architectures require the customization of only two or three metallization layers (in one case, it is necessary to customize only a single via layer). This dramatically reduces the time and cost associated with creating the remaining photo-masks used to complete the device. Sum-of-products—A Boolean equation in which all of the minterms corresponding to the lines in the truth table for which the output is a logic 1 are combined using OR operators. SVA (SystemVerilog Assertions)—The original Verilog did not include an assert statement, but SystemVerilog has been augmented to include this capability. Furthermore, in 2002, Synopsys donated its OpenVera Assertions (OVA) to the Accellera committee in charge of SystemVerilog. The SystemVerilog folks are taking what they want from OVA and mangling the syntax and semantics a tad. The result of this activity may be referred to as SystemVerilog Assertions, or SVA. ■ 519 520 ■ The Design Warrior's Guide to FPGAs Synchronous—(1) A signal whose data is not acknowledged or acted upon until the next active edge of a clock signal. (2) A system whose operation is synchronized by a clock signal. Synthesis—see Logic synthesis and Physically aware synthesis. Synthesizable subset—When hardware description languages (HDLs) such as Verilog and VHDL were first conceived, it was with tasks like simulation and documentation in mind. One slight glitch was that logic simulators could work with designs specified at high levels of abstraction that included behavioral constructs, but early synthesis tools could only accept functional representations up to the level of RTL. Thus, design engineers are obliged to work with a synthesizable subset of their HDL of choice. See also HDL and RTL. System gate—One of the problems FPGA vendors run into occurs when they are trying to establish a basis for comparison between their devices and ASICs. For example, if someone has an existing ASIC design that contains 500,000 equivalent gates, and they wish to migrate this design into an FPGA implementation, how can they tell if their design will “fit” into a particular FPGA. In order to address this issue, FPGA vendors started talking about “system gates” in the early 1990s. Some folks say that this was a noble attempt to use terminology that ASIC designers could relate to, while others say that it was purely a marketing ploy that doesn’t do anyone any favors. System on chip—see SoC SystemVerilog—A hardware description language (HDL) that, at the time of this writing, is an open standard managed by the Accellera organization (www.accellera.com). SystemVerilog Assertions—see SVA Tap—A register output used to generate the next data input to a linear feedback shift register (LFSR). Glossary Tera—Unit qualifier (symbol = T) representing one million million, or 1012. For example, 3 THz stands for 3 × 1012 hertz. Tertiary—Base-3 numbering system. Tertiary digit—A numeral in the tertiary scale of notation. Often abbreviated to “trit,” a tertiary digit can adopt one of three states: 0, 1, or 2. Tertiary logic—An experimental technology in which logic gates are based on three distinct voltage levels. The three voltages are used to represent the tertiary digits 0, 1, and 2 and their logical equivalents False, True, and Maybe. Time of flight—The time taken for a signal to propagate from one logic gate, integrated circuit (IC), or optoelectronic component to another. Toggle—Refers to the contents or outputs of a logic function switching to the inverse of their previous logic values. Trace—see Track Track—A conducting connection between electronic components. May also be called a trace or a signal. In the case of integrated circuits (ICs), such interconnections are often referred to collectively as metallization. Transistor—A three-terminal semiconductor device that, in the digital world, can be considered to operate like a switch. Tri-state function—A function whose output can adopt three states: 0, 1, and Z (high impedance). The function does not drive any value in the Z state and, when in this state, the function may be considered to be disconnected from the rest of the circuit. Trit—Abbreviation of tertiary digit. A tertiary digit can adopt one of three values: 0, 1, or 2. Truth table—A convenient way to represent the operation of a digital circuit as columns of input values and their corresponding output responses. ■ 521 522 ■ The Design Warrior's Guide to FPGAs TTL (transistor-transistor logic)—Logic gates implemented using particular configurations of bipolar junction transistors (BJTs). Transistor-transistor logic—see TTL UDL/I—In the case of the popular HDLs, Verilog was originally designed with simulation in mind, while VHDL was created as a design documentation and specification language with simulation being taken into account. The end result is that one can use both of these languages to describe constructs that can be simulated, but not synthesized. In order to address these problems, the Japan Electronic Industry Development Association (JEIDA) introduced its own HDL called the Unified Design Language for Integrated Circuits (UDL/I) in 1990. The key advantage of UDL/I was that it was designed from the ground up with both simulation and synthesis in mind. The UDL/I environment includes a simulator and a synthesis tool and is available for free (including the source code). However, by the time UDL/I arrived on the scene, Verilog and VHDL already held the high ground, and this language never really managed to attract much interest outside of Japan. µC (microcontroller)—A microprocessor augmented with special-purpose inputs, outputs, and control logic like counter timers. µP (microprocessor)—A general-purpose computer implemented on a single integrated circuit (IC) (or sometimes on a group of related chips called a chipset). ULA (uncommitted logic array)—One of the original names used to refer to gate-array devices. This term has largely fallen into disuse. Uncommitted logic array—see ULA Vaporware—Refers to either hardware or software that exists only in the minds of the people who are trying to sell it to you. Glossary Verilog—A hardware description language (HDL) that was originally proprietary, but which has evolved into an open standard under the auspices of the IEEE. VHDL—A hardware description language (HDL) that came out of the American Department of Defense (DoD) and has evolved into an open standard. VHDL is an acronym for VHSIC HDL (where VHSIC is itself an acronym for “very high-speed integrated circuit”). Via—A hole filled or lined with a conducting material, which is used to link two or more conducting layers in a substrate. VITAL—The VHDL language is great at modeling digital circuits at a high level of abstraction, but it has insufficient timing accuracy to be used in sign-off simulation. For this reason, the VITAL initiative was launched at the Design Automation Conference (DAC) in 1992. Standing for VHDL Initiative toward ASIC Libraries, VITAL was an effort to enhance VHDL’s abilities for modeling timing in ASIC and FPGA design environments. The end result encompassed both a library of ASIC/FPGA primitive functions and an associated method for back-annotating delay information into these library models. Volatile—Refers to a memory device that loses any data it contains when power is removed from the system, for example, random-access memory in the form of SRAM or DRAM. Word—A group of signals or logic functions performing a common task and carrying or storing similar data; for example, a value on a computer’s data bus can be referred to as a “data word” or “a word of data.” ■ 523 About the Author Clive “Max” Maxfield is 6'1" tall, outrageously handsome, English, and proud of it. In addition to being a hero, trendsetter, and leader of fashion, he is widely regarded as an expert in all aspects of electronics (at least by his mother). After receiving his B.Sc. in control engineering in 1980 from Sheffield Polytechnic (now Sheffield Hallam University), England, Max began his career as a designer of central processing units for mainframe computers. To cut a long story short, Max now finds himself president of TechBites Interactive (www.techbites.com). A marketing consultancy, TechBites specializes in communicating the value of technical products and services to nontechnical audiences through such mediums as Web sites, advertising, technical documents, brochures, collaterals, books, and multimedia. In his spare time (Ha!), Max is coeditor and copublisher of the Web-delivered electronics and computing hobbyist magazine EPE Online (www.epemag.com) and a contributing editor to www.eedesign.com. In addition to writing numerous technical articles and papers that have appeared in magazines and at conferences around the world, Max is also the author of Bebop to the Boolean Boogie (An Unconventional Guide to Electronics) and Designus Maximus Unleashed (Banned in Alabama) and coauthor of Bebop BYTES Back (An Unconventional Guide to Computers) and EDA: Where Electronics Begins. On the off-chance that you’re still not impressed, Max was once referred to as an “industry notable” and a “semiconductor design expert” by someone famous, who wasn’t prompted, coerced, or remunerated in any way! Index & (AND) 31 ^ (XOR) 31 | (OR) 31 ! (NOT) 31 ? (don’t care) 304 0-In Design Automation xvi, 118, 205, 334 1076 (IEEE VHDL standard) 167 10-gigabit Ethernet 357 1364 (IEEE Verilog standard) 166 4004 microprocessor 28 4000-series ICs 27 5400-series ICs 27 64-bit/66-bit (64b/66b) encoding 360 7400-series ICs 27 8-bit/10-bit (8b/10b) encoding 358 A ABEL 41, 156 ABV 205, 329 AccelChip Inc. xvi, 118, 232 Accellera 170 Accidental reincarnation 73 ACM 388 Actel Corp. xvi, 115 Actionprobe 280 Adaptive Computing Machine – see ACM Adder, embedded 79 Alan Turing 221 Alastair Pilkington 421 Aldec Inc. xvi, 118, 215 Algorithms, systolic 67 Altera Corp. xvi, 37, 115, 119 LAB 76 LE 75 Altium Ltd. Xvi, 117, 257 AMAZE 41 AMBA 241 Amphisbaena 466 Amplify 297 Anadigm Inc. 115 Analog-to-digital 217 Antifuse(s) 12 -based FPGAs 61, 101 Anti-Miller Effect 441 API 164 Application Programming interface—see API -specific integrated circuit—see ASIC standard part—see ASSP Applicon 140 Architectural definition Macroarchitecture 193 Microarchitecture 193 Architecturally-aware design flow Architectures (FPGA) 57 ARM 241 ARM9 385 ASIC 2, 42 -based SVP 180 gate-level 180, 181 cluster-level 183 RTL-level 184 cell 45 design flow HDL-based 157 159 528 ■ The Design Warrior's Guide to FPGAs ASIC (continued) schematic-based 141 -FPGA hybrids 53 full custom 42 gate arrays 44 channeled 44 channel-less 44 sea-of-cells 45 sea-of-gates 45 standard cell devices 46 structured ASIC 47 -to-FPGA migration 296 versus FPGA design styles 121 ASMBL 424 Assertion-based verification—see ABV Assertion/property coverage 340 Assertions versus properties 330 Assisted Technology 41 ASSP 2 design starts 3 Asynchronous structures 126 Atmel Corp. 115, 376 ATPG 131 Augmented C/C++ based design flow 205 Automatic test pattern generation—see ATPG Automobile (pipelining example) 122 Auto-skew correction 88 Axis Systems xvi, 257 B Ball grid array—see BGA Bardeen, John 26 Bard, The 73 Basic cell 44 Baud rate 362 BDD 415 Bell Labs 26 BFM 323 BGA 269 Bigit 15 Billion 420 BIM 256 Binary Decision diagrams—see BDD digit 14 Binit 15 BIRD75 273 Birkner, John 41 BIST 131, 480 Bit 14 file 99 Bitstream configuration bitstream 99 encryption 61 BJT 26 Block -based design 262 (embedded) RAMs 78 BoardLink Pro 270 Bob Sproull 182 Bogatin, Dr. Eric xvi, 429 Boolean Algebra 154 Boole. George 154 Boundry scan 112 Branch coverage 339 Brattain, Walter 26 Built-in self-test—see BIST Bus functional model—see BFM interface model—see BIM C C54xx 385 C/C++ -based design flows 193 augmented C/C++ based 205 pure C/C++ based 209 SystemC-based 198 model of CPU 253 Cache logic 376 CAD 44, 141 Cadence Design Systems xvi, 117, 165, 257 CAE 140 Calma 140 Capt. Edward Murphy 169 Car (pipelining example) 122 Carbon Design Systems Inc. xvi, 338 Index Carol Lewis xv Carry chains, fast 77 Cell—see ASIC cell and Basic cell Cell library 45 Celoxica Ltd. Xvi, 118, 206 Certify 294 Channeled ASICs 44 Channel-less ASICs 44 CheckerWare Library 334, 336 Chemical mechanical polishing—see CMP Chipscope 281 Chrysalis Symbolic Design Inc. 327 CIDs 360 CLAM 281 Claude Shannon 154 CLB 76 Clock balancing 127 domains 127 enabling 128 gating 128 managers 85 recovery 367 trees 84 Cluster-level SVP 183 clusters/clustering 183 CMOS 26 CMP 320 Coarse-grained 55, 66, 381 CODEC 218, 422 Code coverage 339, 412 assertion/property coverage 340 branch coverage 339 condition coverage 339 Covered (utility) 412 expression coverage 339 functional coverage 340 implementation-level coverage 340 property/assertion coverage 340 specification-level coverage 340 state coverage 339 Co-Design Automation 170 Combinational logic 31, 71 529 loops 126 Comma characters/detection 364 ComputerVision 140 Condition coverage 339 Constraints (formal verification) 330 Combinatorial logic—see combinational logic Commented directives 205 Complementary metal-oxide semiconductor— see CMOS Complex PLD—see CPLD Computer-aided design—see CAD engineering—see CAE Configurable I/O 90 impedances 91, 273 logic analyzer module—see CLAM logic block—see CLB stuff 364 Configuration bitstream 99 cells 99 commands 99 data 99 file 99 modes 105, 106, 113 port 102, 105 Configuring/programming FPGAs 99 bit file 99 configuration bitstream 99 cells 99 commands 99 data 99 file 99 modes 105, 106, 113 port 102, 105 JTAG port 111 parallel load (FPGA as master) 108 (FPGA as slave) 110 serial load (FPGA as master) 106 (FPGA as slave) 111 530 ■ The Design Warrior's Guide to FPGAs Configuring/programming FPGAs (continued) via embedded processor 113 Confluence 401 Consecutive identical digits—see CIDs Constants (using wisely) 174 Core 46 generators 290 hard cores 81, 241 ARM 241 MIPS 241 PowerPC 241 soft cores 83, 243 MicroBlaze 244 Nios 244 PicoBlaze 244 Q90C1xx 244 voltage 91 CoreConnect 241 Covered (utility) 412 CoWare 219, 243 CPLD 2, 28, 37 first CPLD 37 CRC 477 Crosstalk 430 induced delay effects 435 glitches 433 CUPL 41, 156 Cuproglobin 432 CVS 409 Cycle-based simulation 311 Cyclic redundancy check—see CRC D Daisy 141 Dark ages 40 Data I/O 41 David Harris 182 Daya Nadamuni xv DCM 85 Debussy 313, 326 Deck (of cards) 134 Declarative 332 Deep submicron—see DSM 58 DEF 186 Delay chains 127 formats/models 306 3-band delays 310 inertial delays 309 transport delays 309 -locked loop—see DLL Design capture/entry (graphical) 161 Compiler FPGA 294 exchange format—see DEF flows architecturally-aware 159 C/C++ based 193 augmented C/C++ based 205 pure C/C++ based 209 SystemC-based 198 DSP-based 218 embedded processor-based 239 HDL/RTL-based 154 ASIC (early) 157 FPGA (early) 158 schematic-based 134 ASIC (early) 141 FPGA (early) 143 (today) 151 inSIGHT 327 starts ASIC 3 FPGA 3 under test—see DUT VERIFYer 327 DesignPlayer 338 Device selection (FPGA) 343 diff 409 Differential pairs 354 Digital clock manager—see DCM delay-locked loop—see DLL signal processing/processor—see DSP -to-analog 218 Index Dijkstra, Edsger Wybe 413 Dillon Engineering Inc. 118, 397 Tom 351 Dinotrace 412 Distributed RAM 72 RC model 450 DLL 88, 128 Domain-specific language—see DSL DRAM 21, 28 first DRAM 28 Dr. Eric Bogatin xvi, 429 Dr. Gerard Holzmann 414 DSL 226 DSM 58, 435, 443 delay effects 443 DSP -based design flows 217 hardware implementation 221 software implementation 219 DTA 321 Dual-port RAM 77 Dummer, G.W.A 27 DUT 322 Dynamic formal 329, 335 RAM—see DRAM timing analysis—see DTA Dynamically reconfigurable interconnect 373 logic 373 E e (verification language/environment) Eagles (and jet engines) 99 ECL 26, 309 EDGE 383 EDIF 194, 289 Edsger Wybe Dijkstra 413 Edward Murphy, Capt. 169 EEPLD 20, 29 EEPROM 19 -based FPGAs 64 325 531 EETimes xv Elanix Inc. xvi, 118, 219, 232 Electrically erasable PLD—see EEPLD programmable read-only memory— see EEPROM Electronic system level—see ESL EMACS 162, 408 Embedded adders 79 MACs 79 multipliers 79 processor -based design flow 239 cores 80 hard cores 81 soft cores 83 RAMs 78 Emitter-coupled logic—see ECL Encoding schemes 64-bit/66-bit (64b/66b) 360 8-bit/10-bit (8b/10b) 358 SONET Scrambling 360 Encryption 476 EPLD 19, 20 EPROM 17 -based FPGAs 64 Equalization 366 Equivalency checking 327 Equivalent gates 95 Erasable PLD—see EPLD programmable read-only memory—see EPROM Error-Correcting Codes (book) 469 ESL 246 Event -driven simulation 299 wheel 300 Events (formal verification) 331 Exilent Ltd. 116, 382 Expression coverage 339 Eye diagrams 369 mask 370 532 ■ The Design Warrior's Guide to FPGAs F Fabric 57 Faggin, Frederica 28 Fairchild Semiconductor 27, 43 Fast -and-dirty synthesis 180 carry chains 77 Fourier Transform—see FFT Signal Database—see FSDB FET—see MOSFET FFT 68, 389, 399 Fibre Channel 357 Field -effect transistor—see MOSFET programmable analog array—see FPAA gate array—see FPGA interconnect chips—see FPIC devices—see FPID node array—see FPNA FIFO 335 LFSR applications 472 Fine -grained 54, 66, 381 -tooth comb 297 Fintronic USA Inc. 118 First CPLD 37 DRAM 28 FPGA 25 -in first-out—see FIFO Integrated circuit 27 Microprocessor 28 PLD 28 Silicon Solutions Inc. 118, 281 SRAM 28 Transistor 26 Fixed-point representations 229 FLASH -based FPGAs 64 memory 20 PLD 29 Flat schematics 148 Floating gate 17 -point representations 228 unit—see FPU Flows, design architecturally-aware 159 C/C++ based 193 augmented C/C++ based 205 pure C/C++ based 209 SystemC-based 198 DSP-based 218 embedded processor-based 239 HDL/RTL-based 154 ASIC (early) 157 FPGA (early) 158 schematic-based 134 ASIC (early) 141 FPGA (early) 143 (today) 151 Flying Circus 409 Formal verification 326, 413 assertions versus properties 330 constraints 330 declarative 332 dynamic formal 329, 335 equivalency checking 327 events 331 model checking 327 procedural 331 properties versus assertions 330 special languages 332 OVA 336 PSL 337 Sugar 336 static formal 329, 334 FORTRAN 41, 228 FPAA 115, 423 FPGA 1, 49 antifuse-based 61, 101 applications 4 architectures 57 Index -ASIC hybrids 53 -based SVP 187 bitstream encryption 61 CLB 76 clock managers 85 trees 84 configurable I/O 90 impedances 91, 273 configuring 99 bit file 99 configuration bitstream 99 cells 99 commands 99 data 99 file 99 modes 105, 106, 113 port 102, 105 JTAG port 111 parallel load (FPGA as master) 108 (FPGA as slave) 110 serial load (FPGA as master) 106 (FPGA as slave) 111 via embedded processor 113 DCM 85 design flow HDL-based 158 schematic-based 143, 151 device selection 343 EEPROM-based 64 EPROM-based 64 Exchange 271 first FPGAs 25 FLASH-based 64 future developments 420 general-purpose I/O 90 gigabit transceivers 92, 354 hard cores 81 Hybrid FLASH-SRAM-based 65 I/O 90 LAB 76 LC 74 LE 75 LUT 69, 101 -based 69 mux-based 68 origin of FPGAs 25 platform FPGAs 53 programming—see configuring rad hard 62 security issues 60 slice 75 soft cores 83 speed grades 350 SRAM-based 59, 102 -to-ASIC migration 294 -to-FPGA migration 293 versus ASIC design styles 121 years 98 FPIC 374 FPID 374 FPNA 116, 381 ACM 388 PicoArray 384 FPU 397 FR4 439 Frederica Faggin 28 Fred-in-the-shed 3 Fredric Heiman 26 Frequency synthesis 86 FSDB 304 Full custom ASICs 42 Functional coverage 340 representations 155 verification 133 Fusible links 10 Future Design Automation 205 G Gain-based synthesis 181 GAL 36 Gartner DataQuest xv Gary Smith xv Gated clocks 128 533 534 ■ The Design Warrior's Guide to FPGAs Gate Array ASICs 44 -level abstraction 154 netlist 134 SVP 180, 181 Gates equivalent gates 95 system gates 95 Gateway Design Automation 163 gcc 408 General-purpose I/O 90 Generic array logic—see GAL GenToo 119 Linux 410 Geometry 58 George Boole 154 Germanium 26 GHDL 303 Gigabit transceivers 92, 354 clock recovery 367 comma characters/detection 364 configurable stuff 364 differential pairs 354 encoding schemes 64-bit/66-bit (64b/66b) 360 8-bit/10-bit (8b/10b) 358 SONET Scrambling 360 equalization 366 eye diagrams 369 ganging multiple blocks 362 jitter 369 pre-emphasis 365 standards 357 10-gigabit Ethernet 357 Fibre Channel 357 InfiniBand 357 PCI Express 357 RapidIO 357 SkyRail 357 Giga Test Labs xvi Gilbert Hyatt 28 Glitch 433 Global reset/initialization 129 Glue logic 4 GNU 408 Goering, Richard xv GOLD code generator 389 Graphical design entry 161 Granularity coarse-grained 55, 66, 381 fine-grained 54, 66, 381 medium-grained 55, 381 Green Hills Software Inc. 118 grep 410 Groat 119 GTKWave 412 Guided probe 479 Guido Van Rossum 409 gvim 408 G.W.A Dummer 27 H Handel-C 206 Hard cores 81, 241 ARM 241 MIPS 241 PowerPC 241 Hardware description language—see HDL modeler 254 verification language—see HVL Harris, David 182 Harris Semiconductor 15 Hawkins, Tom xvi HDL 153 RTL 155, 303 Superlog 170 SystemC 171, 198 SystemVerilog 170 assert statement 336 UDL/I 169 Verilog 163 VHDL 165, 167 VITAL 167 wars 169 HDL/logic synthesis 160, 314 Index HDL/RTL-based design flow 154 ASIC (early) 157 FPGA (early) 158 Heiman, Fredric 26 Heinrich Rudolf Hertz 86 Hemoglobin 432 Hertz 86 Heinrich Rudolf 86 Hierarchical schematics 149 Hier Design Inc. xvi, 118, 188, 265 High-impedance 304 HILO logic simulator 163 Hoerni, Jean 27 Hoff, Marcian “Ted” 28 Hofstein, Steven 26 HOL 416 Holzmann, Dr. Gerard 414 Hot (high energy) electron injection 18 HVL 325 Hyatt, Gilbert 28 Hybrid FLASH-SRAM-based FPGAs 65 development environment—see IDE Intel 17, 28 Intellectual property—see IP International Research Corporation 28 Inter-symbol interference—see ISI InTime Software xvi, 185 I/O 90 IP 46, 287 core generators 290 ParaCore Architect 397 System Generator 235, 291 firm IP 94 hard IP 93 open source IP 417 soft IP 94 sources of IP 287 IPflex Inc. 116, 382 IPO 185 ISI 360 ISP 1 ISS 254 Italian Renaissance 40 Ivan Sutherland 182 I IBIS (versus SPICE) 272 IC 27 first IC 27 Icarus 119 Verilog 411 IDE 244 IEEE 1076 167 IEEE 1364 166 Implementation-level coverage 340 Incisive 257 Incremental design 263 place-and-route 190 Inertial delay model 309 InfiniBand 357 In-place optimization—see IPO Instruction set simulator—see ISS In-system programmable—see ISP Integrated circuit—see IC J Jack Kilby 27 Japan Electronic Industry Development Association—see JEIDA Jean Hoerni 27 JEDEC 41 JEIDA 169 Jelly-bean devices 27 logic 1 Jiffy 421 Jitter 86, 369 John Bardeen 26 Birkner 41 Wilder Tukey 14, 15 JTAG 132, 251 port 111 Jurassic 443 535 536 ■ The Design Warrior's Guide to FPGAs K Kilby, Jack 27 L LAB 76 Language reference manual—see LRM Latches 129 Latch inference 174 Latency 125 Lattice Semiconductor Corp. 115 Launchbird Design Systems Inc. xvi, 118, 401 LC 74 LE 75 LEF 186 Leopard Logic Inc. 115 Levels of logic 125 Lewis, Carol xv LFSR 389, 465 BIST applications 480 CRC applications 477 encryption applications 476 many-to-one 465 maximal length 467 one-to-many 469 previous value 475 pseudo-random numbers 482 seeding 470 taps 465 Library cell library 45 symbol library 141 Linear feedback shift register—see LFSR Linus Torvalds 407 Linux 407 LISP 408 Literal 33 Logic analyzers (virtual) 280 array block—see LAB cell—see LC element—see LE levels 125 simulation 134 cycle-based 311 event-driven 299 HILO 163 Verilog-XL 163 synthesis 160, 314 Logical effort (the book) 182 exchange format—se LEF Logic/HDL synthesis 160, 314 Lookup table—see LUT Loops, combinational 126 LRM 166 Lumped load model 449 LUT 50, 69, 101 3, 4, 5, or 6-input 71 as distributed RAM 72 as shift register 73 -based FPGAs 69 M MAC 80 Macroarchitecture definition 193 Magma Design Automation xvi, 182 Magnetic RAM—see MRAM tunnel junction—see MJT make (utility) 408 MandrakeSoft 410 Many-to-one LFSRs 465 Mapping 144 Marcian “Ted” Hoff 28 Mask-programmed devices 14 Mask—see photo-mask MATLAB 219, 226 M-code 226 M-files 226 Maximal length LFSRs 467 Mazor, Stan 28 MCM 82, 241 M-code 226 Medium-grained 55, 381 MegaPAL 37 Memory devices 14 Mentor Graphics Corp. xv, 117, 141, 209, 257 Index Metalization layers 14, 134 Metal-oxide semiconductor field-effect transistor—see MOSFET MetaPRL 416 M-files 226 Micromatrix 43 Micromosaic 43 Microarchitecture definition/exploration 193, 223 MicroBlaze 244 Microprocessor 28 first microprocessor 28 Micros 1 Miller Effect 438 MIPS 241 Mixed-language designs 169 environments/simulation 214, 236, 305 MJT 23 Model checking 327 ModelSim 215, 306 Modes, configuration 105, 106, 113 Modular design 262 Monolithic Memories Inc. 36, 37 Monty Python 409 Moorby, Phil 163 MOSFET 26 Motorola 116, 382 MPEG 383 MRAM 22, 63, 426 Multichip module—see MCM Multipliers, embedded 79 Multiply-and-accumulate—see MAC Murphy, Capt. Edward 169 Murphy’s Law 169 Mux-based FPGAs 68 N Nadamuni, Daya xv Nano 58 Negative slack 317 Netlist, gate-level Nexar 257 Nibble—see nybble Nios 244 NMOS 26 Nobel Peace Prize 98 Nonrecurring engineering—see NRE volatile 14 Novas Software Inc. 118, 304, 313, 326 Noyce, Robert 27 NRE 3 Nibble 108 NuSMV 406, 415 O OCI 280 OEM 116 On-chip instrumentation—see OCI One -hot encoding 131, 334 -time programmable—see OTP -to-many LFSRs OpenCores 417 Open Source IP 417 tools 407 SystemC Initative—see OSCI Vera Assertions—see OVA Verification Library—see OVL Verilog International—see OVI OpenSSH 410 OpenSSL 410 Original equipment manufacturer—see OEM Origin of FPGAs 25 OSCI 198 OTP 1, 12 Ouroboros 465 OVA 336 OVI 166 OVL 337, 417 P Packing 145 PACT XPP Technologies AG 116, 382 537 538 ■ The Design Warrior's Guide to FPGAs PAL 36 MegaPAL 37 PALASM 41, 156 ParaCore Architect 397 Parallel load (FPGA as master) 108 (FPGA as slave) 110 Patent (EP0437491-B1) 296 PCB 239, 267 PCI 94 Express 357 Performance analysis 340 PERL 409 PGA 267 Phase -locked loop—see PLL shifting 87 Phil Moorby 163 Physically-aware synthesis 161, 314 Photo-mask 14 PHY 357 Physical layer—see PHY PicoBlaze 244 PicoArray 384 PicoChip Designs Ltd. Xvi, 116, 382, 384 Pilkington 421 Alastair 421 Microelectronics—see PMEL Pin grid array—see PGA Pipelining 122, 123 wave pipelining 124 PLA 33 Place-and-route 146 incremental 190 Platform FPGAs 53 PLD 2 GAL 36 PAL 36 PLA 33 PROM 15, 30 PLI 164 PLL 88, 128 PMEL 421 PMOS 26 Point-contact transistor 26 Positive slack 317 PowerPC 241 Pragma 205, 332 Pragmatic information—see pragma Precision C 209 Pre-emphasis 365 Printed circuit board—see PCB Procedural 331 Processor cores, embedded 80 hard cores 81 soft cores 83 Process (technology) node 58 Product term 33 sharing 35 Programmable array logic—see PAL logic array—see PLA device—see PLD read-only memory—see PROM Programming FPGAs—see configuring programming language interface—see PLI PROMELA 404, 414 Property/assertion coverage 340 Properties versus assertions 330 Property specification language—see PSL Pseudo-random numbers 482 PSL 337 Pure C/C++ based design flow 209 LC model 450 Python 405, 409, 413 Q Q90C1xx 244 QoR 159 Quagmire (system gates) 97 Quality-of-Results—see QoR Quantization 229 Quartz window 19 QuickLogic Corp. 71, 115 QuickSilver Technology Inc. xvi, 116, 382, 388 Index R Rad-hard 62 Radiation 62 RAM 14 block (embedded) RAM 78 dual-port RAM 77 embedded (block) RAM 78 single-port RAM 77 Random access memory—see RAM (also DRAM, MRAM, and SRAM) RapidIO 357 RC 5, 374 cache logic 376 dynamically reconfigurable interconnect 373 logic 373 virtual hardware 376 RCA 26, 27 Read-only memory—see ROM Real-time operating system—see RTOS Reconfigurable computing—see RC Red Hat 410 Register transfer level—see RTL Reincarnation (accidental) 73 Renaissance 40 Replication 316 Resource sharing 130, 175, 222 Resynthesis 316 Retiming 316 Reverberating notchet tattles 121 Richard Goering xv RLC model 451 Robert Noyce 27 ROM 14 Rossum, Guido Van 409 RTL 155, 303 -level SVP 184 RTOS 196, 246 S SATS 391 Schematic(s) -based design flow 134 ASIC (early) 141 FPGA (early) 143 (today) 151 flat 148 hierarchical 149 SDF 147, 164, 304 Seamless 257 Sea-of-cells/gates 45 Security issues 60 Secret squirrel mode 388 Seeding LFSRs 470 Serial load (FPGA as master) 106 (FPGA as slave) 111 Shadow registers 476 Shannon, Claude 154 Shockley, William 26 SI 272, 429 Signal integrity—see SI SignalTap 281 Signatures 480 Signetics 41 Silicon 26 Explorer II 280 virtual prototype—see SVP SilverC 394 Silverware 394 Simple PLD—see SPLD Simpod Inc. 254 Simucad Inc 118 Simulation cycle-based 311 event-driven 299 primitives 301 Simulink 219, 394 Single-port RAM 77 Sirius 383 SkyRail 357 Slack 182, 317 Slice 75 Smith, Gary xv SoC 4 539 540 ■ The Design Warrior's Guide to FPGAs Soft cores 83, 243 MicroBlaze 244 Nios 244 PicoBlaze 244 Q90C1xx 244 Software 15 SONET Scrambling 360 SPARK C-to-VHDL 209 Spatial and temporal segmentation—see SATS Special formal verification languages 332 OVA 336 PSL 337 Sugar 336 Specification-level coverage 340 Specman Elite 326 SPEEDCompiler 338 Speed grades (FPGAs) 350 SPICE (versus IBIS) 272 SPIN (model checker) 406, 414 SPLD 2, 28 Sproull, Bob 182 SRAM 21, 28 -based FPGAs 59, 102 first SRAM 28 SSTA 319 STA 147, 306, 319 Standard cell ASICs 46 delay format—see SDF Stan Mazor 28 State coverage 339 machine encoding 131 one-hot 131 Static formal 329, 334 RAM—see SRAM timing analysis—see STA Statistical static timing analysis—see SSTA Stephen Williams 411 Steven Hofstein 26 Stripe, The 81 Structural representations 155 Structured ASICs 47 Sugar 336 Sum-of-products 33 Superlog 170 Sutherland, Ivan 182 SVP ASIC-based 180 gate-level 180, 181 cluster-level 183 RTL-level 184 FPGA-based 187 SWIFT interface/models 253 Switch-level 154 Symbol library 141 Symbols (in data transmission) 360 Synopsys Inc. xvi, 117, 294 Synplicity Inc. xvi, 118, 294, 297 Synthesis fast-and-dirty 180 gain-based 181 HDL/logic 160, 314 logic/HDL 160, 314 physically-aware 161, 314 replication 316 resynthesis 316 retiming 316 Synthesizable subset 166 System gates 95 Generator 235, 291 HILO 310 -level design environments 227 representations 156 -on-Chip—see SoC SystemC 171, 198 -based design flow 198 model of CPU 253 SystemVerilog 170 assert statement 336 Systolic algorithms 67 T Tap-dancers Taps 465 122 Index TDM 130 TDMA 383 Technology node 58 Tenison Technology Ltd. 338, 412 Tertiary logic 304, 325 Testbench 235 Texas Instruments 27 The Mathworks Inc. xvi, 118, 219 Three-letter acronym—see TLA Throw a wobbly 29 Timed C domain 214 Time-division multiple access—see TDMA multiplexing—see TDM Timing analysis/verification 133 dynamic timing analysis—see DTA static timing analysis—see STA TLA 6 Tom Dillon 351 Hawkins xvi Torvalds, Linus 407 TPS 416 TransEDA PLC 118, 323 Transistor 26 bipolar junction transistor—see BJT field-effect transistor—see MOSFET -transistor logic—see TTL Transmission line effects 441 Transport delay model 309 Triple redundancy design 62 Tri-state buffers 176 Trit 304 TTL 26, 309 Tukey, John Wilder 14, 15 Turing Alan 221 -complete 389 Machine 221 U UDL/I 169 UDSM 58, 435, 443 delay effects 443 ULA 44 Ultradeep submicron—see UDSM Ultraviolet—see UV Uncommitted logic array—see ULA Untimed C domain 214 UV 19 V Valid 141 Value change dump—see VCD Variety halls 122 VCD 304, 326, 411 Vera 336 Verdi 313, 326 Verification environments 324 e 325 OpenVera 336 Vera 336 formal—see formal verification functional 133 IP 322 Reuse 329 timing 133 Verilator 412 Verilog Icarus Verilog 411 OVI 166 the language 163 the simulator 163 Verilog 2001 (2K1) 167 Verilog 2005 167 Verilog 95 167 Verilog-XL 163 Verisity Design Inc. xvi, 118, 325 VHDL 165, 167 International 170 VITAL 167 VHSIC 167 VI 161, 408 Virtual hardware 376 logic analyzers 280 Machine Works 282 541 542 ■ The Design Warrior's Guide to FPGAs VirtualWires 282 Visibility into the design 250, 277 multiplexing 278 special circuitry 280 virtual logic analyzers 280 VirtualWires 282 Visual interface—see VI VITAL 167 Volatile 14 VTOC 338, 412 W Walsh code generator 389 Walter Brattain 26 Wave pipelining 124 W-CDMA 383 Weasels (and jet engines) 99 Wideband code division multiple access—see W-CDMA William Shockley 26 Williams, Stephen 411 Wind River Systems Inc. 118 Work functions 186 Wortsel Grinder Mark 4 Wrapper (node) 389 121 X X (unknown) 304 XAUI 363 Xblue architecture 425 Xilinx Inc. xv, 25, 115, 119, 235, 424 CLB 76 DCM 85 LC 74 slice 75 XM Radio 383 XoC 257 Y Years, FPGA years 98 Z Z (high-impedance) 3–4 ELSEVIER SCIENCE CD-ROM LICENSE AGREEMENT PLEASE READ THE FOLLOWING AGREEMENT CAREFULLY BEFORE USING THIS CD-ROM PRODUCT. THIS CD-ROM PRODUCT IS LICENSED UNDER THE TERMS CONTAINED IN THIS CD-ROM LICENSE AGREEMENT (“Agreement”). BY USING THIS CDROM PRODUCT, YOU, AN INDIVIDUAL OR ENTITY INCLUDING EMPLOYEES, AGENTS AND REPRESENTATIVES (“You” or “Your”), ACKNOWLEDGE THAT YOU HAVE READ THIS AGREEMENT, THAT YOU UNDERSTAND IT, AND THAT YOU AGREE TO BE BOUND BY THE TERMS AND CONDITIONS OF THIS AGREEMENT. ELSEVIER SCIENCE INC. (“Elsevier Science”) EXPRESSLY DOES NOT AGREE TO LICENSE THIS CD-ROM PRODUCT TO YOU UNLESS YOU ASSENT TO THIS AGREEMENT. IF YOU DO NOT AGREE WITH ANY OF THE FOLLOWING TERMS, YOU MAY, WITHIN THIRTY (30) DAYS AFTER YOUR RECEIPT OF THIS CD-ROM PRODUCT RETURN THE UNUSED CD-ROM PRODUCT AND ALL ACCOMPANYING DOCUMENTATION TO ELSEVIER SCIENCE FOR A FULL REFUND. DEFINITIONS As used in this Agreement, these terms shall have the following meanings: “Proprietary Material” means the valuable and proprietary information content of this CD-ROM Product including all indexes and graphic materials and software used to access, index, search and retrieve the information content from this CD-ROM Product developed or licensed by Elsevier Science and/or its affiliates, suppliers and licensors. “CD-ROM Product” means the copy of the Proprietary Material and any other material delivered on CD-ROM and any other human-readable or machine-readable materials enclosed with this Agreement, including without limitation documentation relating to the same. OWNERSHIP This CD-ROM Product has been supplied by and is proprietary to Elsevier Science and/or its affiliates, suppliers and licensors. The copyright in the CD-ROM Product belongs to Elsevier Science and/or its affiliates, suppliers and licensors and is protected by the national and state copyright, trademark, trade secret and other intellectual property laws of the United States and international treaty provisions, including without limitation the Universal Copyright Convention and the Berne Copyright Convention. You have no ownership rights in this CD-ROM Product. Except as expressly set forth herein, no part of this CDROM Product, including without limitation the Proprietary Material, may be modified, copied or distributed in hardcopy or machine-readable form without prior written consent from Elsevier Science. All rights not expressly granted to You herein are expressly reserved. Any other use of this CD-ROM Product by any person or entity is strictly prohibited and a violation of this Agreement. SCOPE OF RIGHTS LICENSED (PERMITTED USES) Elsevier Science is granting to You a limited, non-exclusive, non-transferable license to use this CD-ROM Product in accordance with the terms of this Agreement. You may use or provide access to this CD-ROM Product on a single computer or terminal physically located at Your premises and in a secure network or move this CD-ROM Product to and use it on another single computer or terminal at the same location for personal use only, but under no circumstances may You use or provide access to any part or parts of this CD-ROM Product on more than one computer or terminal simultaneously. You shall not (a) copy, download, or otherwise reproduce the CD-ROM Product in any medium, including, without limitation, online transmissions, local area networks, wide area networks, intranets, extranets and the Internet, or in any way, in whole or in part, except that You may print or download limited portions of the Proprietary Material that are the results of discrete searches; (b) alter, modify, or adapt the CD-ROM Product, including but not limited to decompiling, disassembling, reverse engineering, or creating derivative works, without the prior written approval of Elsevier Science; (c) sell, license or otherwise distribute to third parties the CD-ROM Product or any part or parts thereof; or (d) alter, remove, obscure or obstruct the display of any copyright, trademark or other proprietary notice on or in the CD-ROM Product or on any printout or download of portions of the Proprietary Materials. RESTRICTIONS ON TRANSFER This License is personal to You, and neither Your rights hereunder nor the tangible embodiments of this CD-ROM Product, including without limitation the Proprietary Material, may be sold, assigned, transferred or sub-licensed to any other person, including without limitation by operation of law, without the prior written consent of Elsevier Science. Any purported sale, assignment, transfer or sublicense without the prior written consent of Elsevier Science will be void and will automatically terminate the License granted hereunder. TERM This Agreement will remain in effect until terminated pursuant to the terms of this Agreement. You may terminate this Agreement at any time by removing from Your system and destroying the CD-ROM Product. Unauthorized copying of the CDROM Product, including without limitation, the Proprietary Material and documentation, or otherwise failing to comply with the terms and conditions of this Agreement shall result in automatic termination of this license and will make available to Elsevier Science legal remedies. Upon termination of this Agreement, the license granted herein will terminate and You must immediately destroy the CD-ROM Product and accompanying documentation. All provisions relating to proprietary rights shall survive termination of this Agreement. LIMITED WARRANTY AND LIMITATION OF LIABILITY NEITHER ELSEVIER SCIENCE NOR ITS LICENSORS REPRESENT OR WARRANT THAT THE INFORMATION CONTAINED IN THE PROPRIETARY MATERIALS IS COMPLETE OR FREE FROM ERROR, AND NEITHER ASSUMES, AND BOTH EXPRESSLY DISCLAIM, ANY LIABILITY TO ANY PERSON FOR ANY LOSS OR DAMAGE CAUSED BY ERRORS OR OMISSIONS IN THE PROPRIETARY MATERIAL, WHETHER SUCH ERRORS OR OMISSIONS RESULT FROM NEGLIGENCE, ACCIDENT, OR ANY OTHER CAUSE. IN ADDITION, NEITHER ELSEVIER SCIENCE NOR ITS LICENSORS MAKE ANY REPRESENTATIONS OR WARRANTIES, EITHER EXPRESS OR IMPLIED, REGARDING THE PERFORMANCE OF YOUR NETWORK OR COMPUTER SYSTEM WHEN USED IN CONJUNCTION WITH THE CD-ROM PRODUCT. If this CD-ROM Product is defective, Elsevier Science will replace it at no charge if the defective CD-ROM Product is returned to Elsevier Science within sixty (60) days (or the greatest period allowable by applicable law) from the date of shipment. Elsevier Science warrants that the software embodied in this CD-ROM Product will perform in substantial compliance with the documentation supplied in this CD-ROM Product. If You report significant defect in performance in writing to Elsevier Science, and Elsevier Science is not able to correct same within sixty (60) days after its receipt of Your notification, You may return this CD-ROM Product, including all copies and documentation, to Elsevier Science and Elsevier Science will refund Your money. YOU UNDERSTAND THAT, EXCEPT FOR THE 60-DAY LIMITED WARRANTY RECITED ABOVE, ELSEVIER SCIENCE, ITS AFFILIATES, LICENSORS, SUPPLIERS AND AGENTS, MAKE NO WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT TO THE CD-ROM PRODUCT, INCLUDING, WITHOUT LIMITATION THE PROPRIETARY MATERIAL, AN SPECIFICALLY DISCLAIM ANY WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. If the information provided on this CD-ROM contains medical or health sciences information, it is intended for professional use within the medical field. Information about medical treatment or drug dosages is intended strictly for professional use, and because of rapid advances in the medical sciences, independent verification f diagnosis and drug dosages should be made. IN NO EVENT WILL ELSEVIER SCIENCE, ITS AFFILIATES, LICENSORS, SUPPLIERS OR AGENTS, BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, LOST SAVINGS OR OTHER INCIDENTAL OR CONSEQUENTIAL DAMAGES, ARISING OUT OF YOUR USE OR INABILITY TO USE THE CD-ROM PRODUCT REGARDLESS OF WHETHER SUCH DAMAGES ARE FORESEEABLE OR WHETHER SUCH DAMAGES ARE DEEMED TO RESULT FROM THE FAILURE OR INADEQUACY OF ANY EXCLUSIVE OR OTHER REMEDY. U.S. GOVERNMENT RESTRICTED RIGHTS The CD-ROM Product and documentation are provided with restricted rights. Use, duplication or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraphs (a) through (d) of the Commercial Computer Restricted Rights clause at FAR 52.22719 or in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.2277013, or at 252.2117015, as applicable. Contractor/Manufacturer is Elsevier Science Inc., 655 Avenue of the Americas, New York, NY 10010-5107 USA. GOVERNING LAW This Agreement shall be governed by the laws of the State of New York, USA. In any dispute arising out of this Agreement, you and Elsevier Science each consent to the exclusive personal jurisdiction and venue in the state and federal courts within New York County, New York, USA.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Encryption : Standard V1.2 (40-bit) User Access : Fill forms, Extract, Assemble, Print high-res Page Count : 560 About : uuid:4c7a6cae-156c-4b3e-ac20-1195493094f8 Mod Date : 2004:03:18 04:48:29-06:00 Creation Date : 2004:03:16 16:46:41-05:00 Subject : Devices, Tools and Flows Author : Clive "Max" Maxfield Modify Date : 2004:03:18 04:48:29-06:00 Create Date : 2004:03:16 16:46:41-05:00 Metadata Date : 2004:03:18 04:48:29-06:00 Document ID : uuid:4ca0b5f8-bd89-4f7a-a824-579d4debf0c0 Format : application/pdf Title : The Design Warrior's Guide to FPGAs Description : Devices, Tools and Flows Creator : Clive "Max" Maxfield Page Mode : UseOutlines Has XFA : No Producer : Acrobat Distiller 4.05 for WindowsEXIF Metadata provided by EXIF.tools