BSP_Design_Strategy BSP Design Strategy
BSP_Design_Strategy BSP_Design_Strategy
User Manual: BSP_Design_Strategy
Open the PDF directly: View PDF .
Page Count: 37
Download | |
Open PDF In Browser | View PDF |
Burroughs BURROUGHS SCIENTIFIC PROCESSOR PARALLELISM - THE DESIGN STRATEGY FOR THE BSP SSP -~------------------ BURROUGHS SCI ENTI F IC PROCESSOR CONTENTS Page AN OVERVIEW BSP BSP BSP BSP BSP Objective System Key Features Organization Characteristics Parallel Processor Conflict-free Memory Access Vector Performance Performance Optimization File Memory Vectorizing FORTRAN Compiler BSP Design BSP Superiority IN PERSPECTIVE The BSP - ANew Approach Linear Vectors A Different Kind of Supercomputer System Manager Overlapped Instruction Mode Linear Vector Approach to Parallelism Scalar Operations BSP Approach to Scalars The BSP Design 110 Subsystem Computational Envelope File Memory Summary PARALLEL ARCHITECTURE Parallelism Templates Arithmetic Elements Conflict-free Memory Access Parallel Processor Control Unit Scalar Processing Unit BSP Software A-l A-l A-2 A-3 A-3 A-4 A-4 A-4 A-5 A-5 A-6 A-6 A-7 A-7 A-9 A-9 A-l0 A-ll A-ll A-ll A-12 A-13 A-14 A-15 A-15 A-16 A-16 A-17 A-19 A-19 A-~H A-22 A-25 A-27 A-29 A-30 A-iii ~~~ ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFIC PROCESSOR SSP HU R H () U ( 1 H .(~ ;-; C lEN T IF! C PH () CESS 0 R BURROUGHS SCIENTIFIC PROCESSOR PARALLEL ARCHITECTURE PARALLELISM The capability of the Burroughs Scientific Processor (BSP) to sustain high processing rates is achieved via unique parallel designs. The BSP comprises multiple processors arranged to operate in parallel. The combined potential of multiple processors is brought to bear on large computational applications. Figure 3 illustrates the overall architecture of the Burroughs Scientific Processor (BSP). Four types of parallelism are featured within this architecture; that is, four different classes of computation occur simultaneously. They are: 1. The arithmetic performed by the 16 arithmetic elements (AE's), 2. Memory fetches and stores, and the transmission of data between memory and the AE' s, 3. Indexing, vector length, and loop control computations in the parallel processor control unit, 4. The generation of linear vector operation descriptions, which takes place in the scalar processor unit (SPU). The BSP is analogous to an efficiently operated business. The SPU and its control memory are the executive suite. The executive's instructions are passed to the administrative clerks in the parallel processor control unit. This unit then does the bookkeeping and keeps all the major components of the business as busy and as efficient as possible. A-19 ~~p ~~~~~~~~~~~~~~~~~~~~-BURROUGHSSC'ENT'F'CPROCESSOR FILE MEMORY FI LE STORAGE UNIT (4 - 64 M WORDS) DATA AND PROGRAM FI LE TRANSFERS (1.5 M BYTES/SEC) FILE MEMORY CONTROLLER 75 M BYTE/SEC CONTROL UNIT PARALLEL PROCESSOR CONTROL PROCESSOR MEMORY (256 K WORDS) SCALAR PROCESSOR -- PARALLEL MEMORY (0.5 - 8 M WOR OS) 100 M WORDS/SEC I ALIGNMENT NETWORK MCPAND MAINTENANCE COMMUNICATIONS CONTROL AND • MAl NTENANCE UNIT 100 M WORDS/SEC 16 PARALLEL ARITHMETIC ELEMENTS PARALLEL PROCESSOR CONTROL UNIT Figure 3. A-20 -" ----ao (50 MILLION FLOATING-POINT OPERATIONS PER SECOND) BSP Block Diagram 11 BSP BlJRROU(;HS SCIENTI Fie PROCESSOR A fallout from the use of CCD' s is excellent reliability. While disc errors are likely to be multiple-bit errors l CCD errors l with proper partitioning l are typically single bits l and l therefore l easily corrected and bypassed using Hamming codes. The BSP file memory features single-error correctionl double-error detection (SEC I DED) with all storage and data paths. The maximum size file memory available on the BSP is 67 1 108 1 864 words (nominally 64 million words l where a "million" is 220). The smallest file memory size is 4 million words. In certain circumstances l some files may overflow file memory. For this reason l an additional file attribute is provided l allowing the user to specify that a file is to be "chaptered"l with only one chapter available on file memory at any given time. The operating system automatically transfers chapters between the file memory and the discs on the system manager when the user "releases" a given chapter. The operating system assumes that such files are sequential and it double-buffers the chapters l unless the user asks for a chapter out of sequence. SUMMARY Figure 1 shows the BSP connected to a B 7800 or a B 7700 system manager and illustrates that the BSP is the realization of the computational envelope (Figure 2). The high-speed 110 transfers occur inside the BSP between main memory and file memory. New jobs are staged to the file memorYI and output from finished jobs is staged to the system manager from the file memory. Figure 1 also shows some specialized communication paths between the BSP and the system manager. These are used for operating system communications l for performance logging l for hardware error loggingl and for maintenance and diagnostic purposes. The connection to the B 7700 or B 7800 is through a standard 110 port. Hence l if a B 7700 owner wished to attach a BSP I he would install the BSP I connect the cables to a B 7700 110 processorl recompile the operating system with the BSP option set l and go. It is evident from the way in which the ESP is connected to the system manager l and the arguments upon which the computational envelope approach is based l that normal job flow through the ESP is first-in/first-out. However l priority overrides are provided. These are primarily for job debug purposes l because the system manager will be providing the text editing l compiling l file management l etc. that constitute the usual time-sharing load on a scientific processing system. I The file memory controller is the key to fast file memory response l user controll low operating system overhead l and file security. On a file-open command by a user l the operating system in the BSP is invoked. This system determines the user's access rights to the file and then sets status bits in the file memory controller to correspond with these access rights. Subsequent references to the file by the user are done with in-line code in user model since the file memory A-17 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR controller will not respond to an improper request. There are two potential "users"J the current job running on the BSPJ and the system manager. Both are treated in essentially the sarne way. AlthoughJ in the case of dealings with the system managerJ the BSP operating system will also have to manage file memory space allocations before it responds to a system manager request and space deallocation after the system manager has removed a file from file memory. The file memory is paged and file references are logical addresses J which the file memory controller translates to physical addresses. Hence J a file need not occupy contiguous pages in file memory. COMPUTATIONAL ENVELOPE -- PROBLEM SECONDARY STORAGE (FI LE MEMORY) - MODE RATE SPEED I/O TO/FROM BACKING STORAGE .I . I I - -- f- - --1 HIGH-SPEED I/O CONTROL , I I I I SCIENTIFIC PROCESSOR MAIN MEMORY Figure 2. A-IS Scientific Problem I/O Characteristics BSP - -----------------~-------- -------------------------- BUR R0 UG HS SC I E NT I Fie PR OC ESSO R processed with reasonable efficiency. The idea is that a conversion may be done in manageable stages" with useful effect for one's efforts at each stage. In summary" the BSP approach was to design a more general vector processor" and to forego the very fast scalar hardware. Is the science of parallelism too young for such a design? No one can say for sure. But the next few years should be revealing. THE BSP DESIGN The major BSP design elements include the system manager" 110 subsystem" parallel main memory" arithmetic elements" and scalar processor" parallel processor control" and the control and maintenance processor. Also included are BSP software" job flow" and the various user interfaces. I/O Subsystem In scientific computations" the dominant I/O patterns differ radically from those in the business data processing arena. With business data processing" small numbers of operations are performed on a very large data base. Also" the amount of main memory required to efficiently process a business data job is relatively small. Hence" in business data processing" I/O becomes a bottleneck" because of the limited number of operations performed on data while it resides in main memory. But" short of placing the entire database in Inain rneTIlory" a given job does not demand too much memory to execute with adequate efficiency. This is an ideal environment for fostering multiprogramming. Many problems may reside in main memory at once. A few will be in an active state; the rest will be waiting in I/O. The situation is quite different in the case of scientific computations. A given job usually requires a large amount of memory before it can execute efficiently. With present processor speeds and memory sizes" the larger bread-and-butter jobs execute best if each one has main memory to itself. In the case of many scientific jobs" some of the data on secondary storage is best regarded as an overflow of main memory - this data is what would not fit in main memory" but the programmer really wishes it were there. Hence" this overflow data is quite tightly coupled to the processing of data in TIlain nlenlory" and the programmer may want to exercise a great degree of control over the I/O process. Compare such a situation with business data processing. In business data processing" the programmer is delighted to have sophisticated operating systems doLYlg his I/O for him. And he is not concerned if the operating system is trying to optimize 110 for all the jobs in the mix. The scientific programmer resents such a situation. He wants as much memory as he can get" and then he wants direct control over I/O whenever feasible. For this reason" and due to details of particular hardware systems" many scientific programmers have reported spending the bulk of their programming effort trying to optimize 1/ O. A-15 ~~p ~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR Such a state of affairs is unfortunate~ because the overall flow of the high-speed I/O in most scientific problems is very simple. If the scientific programmer were not simultaneously battling an operating system~ as well as often inadequate I/O devices~ he could describe his I/O needs with a few simple statements. In contrast with these difficulties~ the scientific programmer has certain advantages which are irrelevant to the commercial programmer. For example" his file sizes are relatively small. Of course~ immense file sizes may be a consideration in both cases for large problems. In general~ however~ scientific problems require much smaller files. Also~ the scientific problem performs more operations on each data word it retrieves from secondary storage. Further~ the scientific problem programmer can typically state the flow of 1/ O. That is~ the high-speed I/O flow is not usually data-dependent. In other words~ efficient double-buffering I/O schemes are normally applicable. computational Envelope How did all this affect BSP I/O design? The BSP design is based on the premise that the high -speed I/O and storage requirements be specified in what is called the computational envelope. The performance of the secondary I/O subsystem is designed to be sufficient to support the processor and main memory. This performance is completely under user control. Finally~ for simplicity~ a single 1/ 0 device~ rather than a hierarchy~ is used for this secondary storage system. (See Figure 2. ) I Although the scientific problem program makes more extensive use of I/O data than does the business data program~ the speed of present supercomputers is so great that no conventional I/O device can support them. Also~ the access times associated with conventional devices are much too long. Because access time is a discontinuity that must be smoothed in I/O opera tions~ slow access times imply large I/O buffers. If~ at the same time~ the transfer rate must be increased~ then the buffers must be still larger. For many problems simulated in designing the BSP~ cache buffer sizes would have approached half a million words~ if disc technology were used for the secondary storage. File Memory Hence~ the BSP secondary storage~ called file memory~ is based on semiconductor technology - 64-bit charge-coupled device (CCD) technology~ to be specific. The results are average latencies well under one millisecond and sustainable transfer rates over 60 megabytes per second. Buffer sizes are reasonable and optimum performance is attained with simple standard FORTRAN statements. In other words~ straightforward programming gets all the performance there is to get~ and this performance is adequate to the task. A-16 .~ I SSP BURROUGHS SCIENTIFIC PROCESSOR The BSP's memory system handles problem 2. The solution to problem 3 may be inferred from the reference already made to the very high level instruction set in the BSP. This same instruction set is part of the solution to problem 4. The needed high system utilization rate implied by problem 1 is gained in part by the parallel processor control unit, which is described later. And the BSP does take advantage of the emerging science of parallelism to help it gain an unusual speed on linear recurrences. Due to what has become known as the "scalar problem", there is a substantial difficulty implicit in the simultaneous solution to problems 1, 4, and 6. The problem is easily described, but difficult to resolve. For example, imagine a linear vector processor that could process vectors at infinite speed, but could process scalars no faster than one operation every microsecond. Then, if the total problem comprised 90% vector and 100/0 scalar processing, the vectors would be done in no time at all, but the scalars would be done one operation per microsecond. Because only 10% of the problem would be scalars" one operation per microsecond would be divided by O. 1 to obtain an overall speed of 10 operations per microsecond on the example problem. This is not especially fast because users now want at least 20 floating-point operations per microsecond. Yet the example is not unreasonable, because many vector machines" with maximum speeds over 50 floating-point operations per microsecond, have a difficult time sustaining 10 floating-point operations per microsecond. Scalar Operations Before discussing potential solutions to the problem of how to do scalars fast" it is beneficial to first explain what a scalar operation entails. This, however, is no simple task. First of all, some apparent scalars are not evident. For example, the memory indexing hardware on most vector computers fetches the entire linear vector, based only on some simple information such as start of vector, address difference between vector elements, and length of vector. Similarly, the execution of the vector operation is the same as executing an inner loop of a program. This means that many indexing operations, and much of the loop overhead present in an ordinary scalar machine, are eliminated as a result of the basic idea of the linear vector processor. But certainly, some work rernains, for exarnple, generation of the simple vector descriptors referred to previously. Is this a sequence of scalar operations? Perhaps it is. On some vector machines, nothing else can happen while a vector is processed. The instruction processor can be busy retrieving a description of the neAi: vector operation., vT/hile the present vector operation is executing. On the BSP, the SPU can describe and queue a sequence of vector operations, while a given vector operation executes. Vector setup operations are countable scalars on some machines, while on other machines, setups are counted only if they can not be overlapped with a vector operation already executing. A-13 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR There are other situations in which machines are difficult to compare directly. For example l on the BSP the DO loop: DO I = 21 N A(I) = C(I) END DO ~:~ A(I-l) + B (I) is executed in parallel l with a maximum speed well over 10 operations per microsecond. On other vector machines l this common construct must be executed as a scalar sequence. And l if it is to execute rapidlYI the vector machine must also contain a very fast scalar processor. BSP Approach to Scalars This is where the BSP parts company with the other recent vector machines. To solve this recurrence and some other problems J conventional wisdom J at presentJ says a fast scalar processor must be included in the design. l But there are three major problems with this viewpoint. The first is that the fast scalar processor may be a high cost item. The second problem is more insidious J but probably more severe. To the extent that the compiler must choose between the use of the scalar hardware and vector hardware J the compiler has the job of compiling to two machines. This is probably sufficiently difficult that the compiler will be unable to generate object code for all the parallelism it has found. For example l if the scalar code is intimately dependent on the vector code or vice versa J either the hardware must have extremely clever synchronizing mechanisms to tie the processors togetherJ or the compiler must decide that some mixed code will arbitrarily be categoriz ed as all being of one type. J The third problem is also insidious J and possiblYJ the most costly. This problem is that the arbitrary inclusion of a fast scalar processorJ to solve a problem in an ad hoc waYJ almost guarantees that a successor machine from the same manufacturer will require a substantial reconversion effort. The successor machine is not likely to retain the structure of its predecessor. For these reasons l although the BSP FORTRAN compiler will use the SPU for selected scalar operations J the ESP compiler is likely to treat a tloating-point scalar as a vector of length one - or to treat a sequence of floating-point scalars as a nonlinear vector operation sequence. This enables the ESP to forego the mixed blessing of the ultra-fast scalar unit. It allows the compiler to concentrate on capitalizing on detected parallelism. And it guarantees upward compatibility with a successor machine l recompilation being the maximum conversion penalty. This approach also permits a smooth initial conversion to the ESP. In the beginningJ a conversion may leave an undesirable amount of scalars. Eut l with uniform treatment of operands J a scalar does not have to be made part of a vector of length 100 to be processed efficiently. If it becomes part of a vector of length 3 then it is processed three times as fast as before. Vectors of length on the order of 10 are J A-14 (",:, ! BSP BURROUGHS SCIENTIFIC PROCESSOR A DIFFERENT KIND OF SUPERCOMPUTER So far, this section has attempted to explain the basic rationale behind the current crop of supercomputers, namely, the linear vector. And, further, because of this basic rationale, the use of parallel arithmetic elements in the BSP and in the ILLIAC IV does not cause them to be fundamentally very different from the pipelinebased supercomputers. However, one important difference has been identified, that is, from the beginning, the BSP was intended to be paired with another processor, namely, the Burroughs B 7700/B 7800. System Manager In this respect, the BSP is somewhat akin to the IBM 3838 signal data processor. The IBM 3838, however, only executes functions or subroutines passed to it by its manager, whereas the BSP executes either entire programs or substantial portions of programs. Thus, the prime motivation for attaching an IBM 3838 to its manager is to enhance the power of the manager by off-loading. The basic motivation for attaching the BSP to a system manager, on the other hand, is to free the BSP for concentrating on processing large computational problems. A second motivation is to create a total system that features application and throughput capabilities not economically feasible with a single specialized processor. Overlapped Instruction Mode The BSP differs from its supercomputer alternatives in another important respect. Its instruction processor is loosely coupled to the parallel arithmetic elements. The approach is a generalizetion of the overlapped instruction execution mode in ILLIAC IV. ILLIAC IV runs more than twice as fast in the overlapped mode than in a nonoverlapped mode. In order to achieve this overlap, the BSP has a queue between the instruction processor and the unit that drives the arithmetic elements. The queue is comparable to the ILLIAC IV implementation. In contrast, however, it contains hardware features that check for out-of-bound array references and optimize the choice between inner and outer FORTRAN DO loops. The latter feature facilitates such functions as the Fast Fourier Transform (FFT), which has an inner loop whose length is decreasing, while the next outer loop's length is increasing. In the ESP, this loop length optimization maintains a 256-point (or larger) FFT running at over 75% of maximum machine speed. This is because all vectors will be of length 16 or more (and hence efficient on 16 AE's), even though the programmer wrote a structure that implied vector lengths of 8, 4, 2, and 1 in the final FFT stages. The BSP's ability to run fully overlapped surpasses the ILLIAC IV's ability to run fully overlapped. Whereas the ILLIAC IV's instruction processor must call on the parallel section for instruction storage and for some arithmetic operations, the BSP's instruction processor, called the scalar processing unit (SPU), has full arithmetic capability. The SPU is also equipped with local memory called control A-I1 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR memory (CM), which is used for storage of indexing parameters, and vector descriptors. In total, these features further the overlap implementation between vector instruction processing and vector instruction execution introduced with the ILLIAC IV. Linear Vector Approach to Parallelism The last basic difference between the ESP and supercomputer alternatives is perhaps the most controversial. It stems from the ESP's timing in the evolution of linear vector-based supercomputers. In designing the ESP, some experience had been accumulated relative to the ways in which the linear vector approach to parallelism could be applied to real world problems. In this respect, it is not unreasonable to assert that the ESP is the forerunner of a second generation of supercomputers. What substantiation is there for this rather strong assertion? The following is a list of some ideas or problems that were understood when the ESP design started: 1. Maximum speed is not nearly as important as sustainable speed. 2. A one-dimensional memory - one that is efficient only for linear vectors whose elements are packed adjacent to one another - is not sufficiently general. 3. Assembly language level programming is almost incompatible with linear vector programming. Even the set of FORTRAN primitives cannot directly express many simple linear vector constructs. If the programmer is to think effectively about his problem at the linear vector level, he must be insulated from concern with machine details. 4. It is possible to construct FORTRAN program analyzers which find a large percentage of the intrinsic parallelism in programs. However, if the target machine structure is not simple and general at a high level, an analyzer cannot create useful object code from the parallelism it has found. A-12 5. Although the use of parallelism still has many vestiges of black art practice, a science is beginning to emerge. In particular, linear recurrence relations are now known to be susceptible to parallelism. 6. Conversion to a linear vector machine should be accomplished once. Any new design should consider the future, so the user will not confront difficulties converting to a successor machine. \ I BSP --~--------------~-------- BU R ROUG HS SCI ENTI F I CPR OCESSO R BURROUGHS SCIENTIFIC PROCESSOR IN PERSPECTIVE THE BSP - A NEW APPROACH Early in 1973, Burroughs assembled a select team to design a commercial supercomputer. In 1977, the team's efforts resulted in the Burroughs Scientific Processor (BSP) - a product that presents a new approach to large-scale computational processing. This section places the ESP in perspective, discusses its more interesting features, and describes various design trade-offs. The BSP uses a large-scale Burroughs B 7700/B 7800 general-purpose computer as a system manager. As a result, the global structure of the BSP itself is simple. It consists of an instruction processor, a set of parallel arithmetic elements, a main memory, an instruction or control memory, and a single 1/0 device (Figure 1). This 110 device is called file memory. It is controlled by the BSP's instruction processor. It functions as a high-speed 110 device for programs running on the BSP and as a staging point for lower-speed 110 going to or coming from the system manager. The BSP parallel processor has 16 parallel arithmetic elements (AE's) driven in lock-step by a single instruction stream. Hence, the BSP is a single-instruction, multiple-data stream (SIMD) architecture. In this respect, it is comparable with other large pipeline or parallel scientific processors. A-9 ~~p ~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR LINEAR VECTORS Single-instruction~ multiple-data stream (SIMD) machines were designed to process "linear vectors". A vector is an operand consisting of a series of numbers or values. A linear vector is a vector that is mapped into storage in a linear fashion; the addresses of the constituents differ by a constant. Such vectors are the most elementary vectors that can be formed by looping structures in programming languages (00 loops~ etc.). Linear vectors are naturally generated when programming language array element index expressions are linear functions of loop parameters. It is this latter fact that has caused the SIMD architecture to emerge as the front- runner in attempts to gain increased scientific processing speed through parallelism. That is~ once parallelism is selected~ it must be noted that the bulk of scientific processing involves processing contained within looping structures. The simplest array element describable in looping structures is a single quantity~ a scalar. However~ parallelism requires operations on more than one object at once. This being so" the simplest data type susceptible to parallelism is the linear vector. The linear vector has two significant advantages relative to other parallel operands. First" it is straightforward to design hardware that will efficiently fetch linear vectors from memory" under the control of a simple vector descriptor. The second advantage is that" inside a loop structure" the same operation is specified between all the consecutive element pairs of a pair of vector operands. Together" these two advantages imply that" while operations between linear vectors can be done using parallel hardware" the control of such operations can be from a single instruction using a simple data descriptor. Consequently" the relatively simple SIMD architecture provides sufficient control capability to exploit this particular kind of parallelism. The SIMD architecture has previously appeared in several forms: 64 processing elements" with their private memories" driven by a single instruction processor in the ILLIAC IV; sets of pipelines" each performing a part of an arithmetic operation" as in the CDC STAR" TI ASC" and CRA Y -1. Regardless of the nature and method of implementation" however" all of these machines" including the BSP" have been designed to function optimally with linear vectors as operands. Hence" it is reasonable to categorize all of them as linear vector machines" or" more commonly" vector machines. Because the linear vector is the basic programming entity" the BSP's instruction set is designed around the concept of linear vectors of arbitrary length. The granularity in vector operations" caused by the fact that 16 arithmetic elements do 16 identical things at once" as well as the need to properly manipulate vectors whose lengths are not integer multiples of 16" is handled automatically by the control hardware. The BSP FORTRAN compiler is unaware that there are 16 AE's. The compiler simply deals with vectors and issues vector instructions. A-10 1"'1 B S p------------------~-------------.-~.------------------------- SSP DESIGN . . . -- SUR ROUGHS SCI EN T I Fie PROCESSOR meets the specific requirements of large-scale scientific processing What are these requirements? First, the performance level of supercomputers requires some type of concurrent computational capability. Second, the bulk of operations characterizing scientific computation are floating-point numerical operations, indexing and branching. Third, many large codes have execution times best measured in terms of hours; some require days. Fourth, a key characteristic of scientific programs (and one that distinguishes them from commercial business codes) is that they generate and regenerate their own data bases, often in a very regular way. This feature confines high-speed 110 to an envelope containing the floating-point processor and a fast secondary store. Fifth, the scientific marketplace is FORTRAN -dominated with codes involving man-years of preparation and tens of thousands of source statements. The BSP has been designed to meet all these requirements. SSP SUPERIORITY is based on several significant advantages over other systems in the supercomputer class Clearly, the BSP is a superior performer. It is competitively priced. The machine derives its performance capabilities from a number of considerations. The BSP is a total system, combining a most advanced general-purpose processor with a floating-point system of exceptional speed. Its design philosophy is such that extensibility is an integral part of it. Another significant feature of the BSP is its reliability. The system has been constructed from standard BCML circuits and packages. All paths to and from memory are equipped with error-correcting capability (SECDED). In addition, there is residue checking on all arithmetic operation, instruction retry, and an extensive set of on-line system device diagnostics. Because of these features offered by the BSP, Burroughs can expand its market potential and extend its competitive range. A-7 ro CJ) 1) B 7800/B 7700 SYSTEM MANAGER CENTRAL PROC ESSO R ...--=__--I-=~ SCI ENTI FIC PROCESSOR INPUT/OUTPUT PROCESSOR DATA AND CODE FILES ~""'----'''''----'r-----'~ ~7; - _ 1.5M BYTES/SECOND ,, CHARGE-COUPLED DEVICE (CCD) FILE MEMORY (4 - 64M WORDS) l\ , 75M BYTES/SECOND PARALLEL PROCESSOR , INSTRUCTION OR CONTROL MEMORY (256K WORDS) -...., r'--_ _ _-~ ... t"""-......,- y .~ ~ PE RIPHERALS ,, NETWORKS ,, n, ~ '" INSTRUCTION PROCESSOR MAIN MEMORY (0.5 - 8M WORDS) --- -- 14---· ARITHMETIC ELEMENTS OJ C :D :D o C G) I en en (") m Figure 1. BSP System Block Diagram Z -I "TI (") -0 :D o (") m en en o :D BSP BURROUGHS SCIENTIFIC PROCESSOR Vector Performance The parallel architecture equips the BSP with an outstanding performance capability. A commonly used figure of merit in scientific computations is the number of million floating-point operations per second (MOPS). For vector lengths greater than 16 1 the system has the performance potential of 50 MOPS. Performance Optimization The BSP has three particular hardware and software design features that influence performance. First, the BSP is equipped with the capability of handling recurrences. tem can detect and transform expressions of the form: The sys- A(I) = A(I -1) >:' B(I) This is a particularly useful capability because such expressions appear to be scalar in nature. Second, the indexing hardware on the system is able to reorder DO LOOPs. This is important because long vector computations are more efficiently processed than short vector computations. For example, the expression, DO 4 I = I, 70 00 4 J 4 = 1, 5 A (I, J) = B (II J) >:' C (II J) as it appears here consists of 70 computations on vectors of length 5. But there are no reasons (data dependencies special sequencing) why these loops could not be inverted to: l DO 4 J = 11 5 4 DO 4 I 11 70 A (II J) B (II J) >:' C (II J ) so that there are now five computations on vectors of length 70. Finally, the system has the capability to handle conditional statements in parallel by using "bit" vectors. These are sequences of ones and zeros that can be used to mask out unwanted results. A-5 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR FILE MEMORY One of the truly significant innovations in the BSP is the file memory. It serves as the secondary storage system for the parallel processor memory, and is important because of the greatly enhanced performance capability it gives the BSP. On most systems (even supercomputers), secondary storage is provided by discs. In supercomputers this can be a problem because the rate at which information can be transferred from secondary storage to main memory is simply not matched to the tremendous processing capability of the CPU. In fact, for several classes of problems where the program and data spill out of main memory onto discs, overall system performance can be very seriously degraded. \ , The most important feature about the file memory for BSP performance is that it sustains a transfer rate to parallel memory of 10 M words/ second, complementing the processing capability of the AE' s well and providing system balance. VECTORIZING FORTRAN COMPILER One of the very strongest assets of the BSP is its software. The BSP is the first supercomputer developed as a total system, and that concept extends to BSP software. The BSP is provided with a mature operating system (the MCP) and a vectorizing FORTRAN compiler. What does vectorizing mean? It is merely the recognition of computational sequences that can be done simultaneously. On a serial or scalar processor, the sequence of computations DO 10 I = 1, 100 10 A(I) = B(I) + C(I) ,:~ I ~ I \ I ' D(I) would be done one at a time. In examining a code, the vectorizing compiler recognizes that such sequences can be done simultaneously. It is, therefore, a means of converting scalar or sequential programs into parallel programs. Users will also be able to program in FORTRAN exclusively. No assembly language programming will be necessary to achieve the performance of the BSP. For new program development, the language will also be equipped with vector extensions that will allow for the introduction of parallel computing concepts from the beginning. A-6 BSP ~ ... --~-----~------- BURROUGHS SCI ENTI Fie P.ROCESSOR include a system manager, the SSP elements, and a vectorizing FO RTRAN compiler BSP KEY FEATURES The system manager is responsible for overall ESP job scheduling and control. Through it, program preparation and data input and output are accomplished. It serves as the device for interactive program preparation and debugging and provides archival storage. The control processor portion of the BSP is a high-speed, asynchronous unit that controls the parallel processor and performs scheduling, file allocation, and 1/0 management. It is characterized by an 80-nanosecond cycle time (12. 5-megaHertz clock) and is equipped with 262K words of 4K MOS memory with an access time of 145 nanoseconds. The control processor also serves to interface the BSP with maintenance and diagnostic devices. Programs to be run on the BSP are compiled on the system manager using a vectorizing FORTRAN compiler, which is a significant part of the system software. It is used to maximize the efficiency of the BSP across a wide spectrum of scientific applic at ions. SSP ORGANIZATION ... consists of three basic units: control unit, parallel processor, file memory The control vnit is made up of a "scalar" processor unit that handles setup of vector operations for the parallel processor, 262 K words of memory in which the program to be executed is stored, a parallel processor control unit that sets up vector calculations" and a control and maintenance unit that is used to interface the maintenance features of the system manager to the BSP. The parallel processor is made up to 16 arithmetic elements (AEs) connected to a parallel processor memory by means of alignment network. The network is a cross-bar switch that connects the parallel memory banks to the AEs and is used to guarantee conflict-free memory access. The BSP is completed by the file memory that consists of charge-coupled device (CCD) storage media and a file memory control unit. A-3 ~~p ~~~~~~~~~~~~~~~~~~~~~SURROUGHSSCIENTIFICPROCESSOR SSP CHARACTERISTICS . . . include the parallel processor, file memory, and vectorizing "FORTRAN compiler" PARALLEL PROCESSOR The parallel processor portion of the BSP is designed to perform "vector" oriented computations at a very high rate. The BSP itself is a single instruction stream/ multiple data stream computing device. The high execution rate is achieved by partitioning computations onto the 16 arithmetic elements of the parallel processor. Consider the following FORTRAN statement: DO 10 I = 1" A(I) = B(I) + 1000 C(I) ':< D(I). The sequence of computations performed is: A(l) B(l) + C(l) ':< D(l) A(2) B(2) + C(2) ':< D(2) A(N) B(N) + C(N) ':< D(N). Quite obviously" there is no dependence in these expressions of A(N) on (N-1). That is" the computations are independent of one another. There is" therefore" no reason not to perform these computations simultaneously. That is" if there were an ensemble of arithmetic elements (AE1" AE 2 " AE " etc.) then at the same time that A(l) was being computed in AE1" A(2) courd be computed in AE2" A(N) in AEn" and so forth. This is the basic idea behind the computational philosophy of the BSP. What makes the philosophy truly usable is that large classes of scientific problems exhibit this type of computational concurrency. Conflict-free Memory Access One of the key reasons the BSP is able to sustain such tremendous computation rates is the conflict-free memory access. The system is designed so that the number of memory banks is relatively prime to the number of processing elements. With this design decision" it is possible to map data into the parallel memory so that rows" columns" diagonals (in fact" any sequence of elements in a two-dimensional array that are regularly spaced) can be accessed at full memory bandwidth. This situation contrasts with other supercomputers in which only rows or columns or diagonals can be accessed at the full rate. A-4 SSP ·~~SURROUGHS SCIENTIFIC PROCESSOR BURROUGHS SCIENTIFIC PROCESSOR AN OVERVIEW SSP OBJECTIVE to extend Burroughs product excellence into the domain of very high-speed scientific computing Traditionally" Burroughs has been very strong in covering the entire spectrum of general-purpose data processing, from the B 80 to the B 7800. With the BSP" Burroughs is complementing these general-purpose capabilities with a most innovative and powerful scientific system. The demands of large-scale scientific processing impose vigorous requirements on the machine that supports it. The BSP has been designed to meet and surpass these requirements. In particular" the BSP is a very high-performance system. It is in the "supercomputer" class of machines and will deliver floating-point results up to 50 X 10 6 operations per second. In contrast with other currently available supercomputers" the BSP is a total system. It combines the general-purpose processing capability of a Burroughs large-scale or very large-scale system with exceptional floating-point capability. Burroughs has chosen to build the scientific processor from a new circuit family, BCML (Burroughs Current Mode Logic). As a result" the BSP enjoys high reliability" availability and maintainability and is exceptionally cost-effective. Finally" there is a large degree of extensibility inherent in the BSP design. system has an impressive potential for modular growth. The A-l ~~p ~~~~~~~~~~~~~~~~~~~~~SURROUGHSSCIENTIFICPROCESSOR SSP SYSTEM . . . consists of the system manager and the scientific processor The system manager (typicallYJ a E 7800) schedules and allocates tasks to the scientific processor. It supports networking and data communications and has a complete set of peripherals (printers J disks J tapes) that can be extended to include archival storage. The scientific processor consists of the control processor J the parallel processor J and the file memory. There are three basic models of the ESP. For the user who already has a E 7700 or E 7800 there is the basic ESP. For other users J the basic configurations are the ESP/7811 and ESP/7821. J • Easic ESP - 16 arithmetic elements J 524K words of parallel processor memorYJ a control processor with 262K words of memorYJ and 4 M words of file memory. • ESP /7811 - a ESP with a E 7811 system manager. The E 7800 has one central processor J one I/O processor J one maintenance processor J and an operator display console with dual displays. • ESP /7821 - a ESP with a E 7821 system manager. The E 7821 has two central processors l two I/O processors J one maintenance processor and an operator display console with dual displays. Dual ESP interface adapters provide a connection between the ESP and both E 7800 I/O processors. However l only one interface adapter is active at anyone time; the other is used for backup or system reconfigura tion. l A-2 SSP BURROUC;HS SCIENTI Fie PROCESSOR TEMPLATES The problem of keeping the AE' s~ the alignment network~ and the memory simultaneously busy is interesting. Collectively, these global elements form a pipeline. That is~ data is first fetched from memory, transmitted to the AE' s over the alignment network, processed by the AE' s~ transmitted back to memory, and stored. In total, this is a five-step process~ executed with four major elements (memory, input alignment network, AE's and output alignment network). The parallel processor control unit keeps these system elements as busy as possible relying on precoded microinstructions called "templates". A template is a description of an entire sequence of operations that a group of associated sets of 16 numbers follows. For example~ such a group of sets of numbers could be the addition of 16 elements of "A" to 16 elements of "E" to form 16 elements of "e". In other words, one template can be used to control 16 arithmetic operations which can be done simultaneously by the 16 AE' s~ plus all the associated memory and data transmission operations. The problem that the parallel processor control unit must solve is the interleaving of one such set of operations, or template, with the next set. In general, one template will not be identical to the one that follows it. For example, a succeeding template may be generating the first 16 elements of Z = Q ':( R + P, while the forerunner template is doing the last 16 (or less) elements of e = A + E. The reason for wanting to interleave dissimilar types of templates is that, if it is not done~ then the pipeline formed by memory~ the alignInent networks~ and the AE's must be emptied between the completion of one vector operation and the start of the next. If this were to happen~ then the ESP would suffer from the same vector startup problem that has plagued other pipeline machines. The manifestation of this problem to the user is that the machine is inefficient for operation on short vectors because of the startup idle time. Given that a template is a microsequence specified at system design time~ the job of the parallel processor control unit is substantially simplified. Instead of having to efficiently interleave the control of several dissimilar units, the parallel processor control unit simply has to select the next stored sequence. ESP templates can be characterized satisfactorily by two numbers. One number specifies the clock count between a template's last fetch from memory and its last store into memory. In other words~ the ternplate leaves this many memory clocks available for the next template. The other number is the number of memory clocks the template needs at its beginning. This number must be less than or equal to the number of clocks left by the preceding template. For example~ let a template be T1 (2, 3). This means the template needs two contiguous memory clocks to start up, and it leaves three contiguous memory clocks between its last fetch from memory and its last store into memory. If another template is T2 (3, 5), then the sequence Tl (2, 3) followed by T2 (3~ 5) would have complete overlap between T 1 and T2~ with no wasted cycles. If one used the sequence Tl (2~ 3) followed by another T1 (2~ 3)~ there would be one A-21 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR clock lost in the interface between the two templates. And" of course" if a T 1 (2" 3) is followed by a T3 (4" 2) there are four wasted clocks" because 4 is not less than or equal to 3. In the ESP" an adequate number of template choices exists so that the average time lost between two dissimilar templates is small. Template control entails the selection between templates already in control storage. The criterion of the selection is optimized efficiency of the system. Clearly" the power of the processor required to make this selection is miniscule compared with the power required to attempt dynamically to optimize system utilization. There is an extra bonus attributable to using templates in the parallel processor control. This is the ability to implement retry on vector operations. Upon detection of a noncorrectable error" the control lets the last successful template finish" resets any partially started templates back to the start point" and retries the template which failed. The ESP is the only supercomputer that features vector operation retry. A problem can occur in a system that has this much overlap. The problem is called a vector "hazard". For example" if A = E + C is followed by Q = A ~:( R" then the elements of A must be stored by the first statement before they are used in the second. If the vector operations are long" no problem exists. If they are short" it may be necessary to hold up the second operation until the first has finished" even though this costs some lost cycles. The parallel processor control unit in the ESP automatically detects and solves this problem situation. The template control processor adopts a different strategy in this case. Instead of maximizing overlap" it selects templates which minimize time lost between operations. ARITHMETIC ELEMENTS All 16 arithmetic elements (AE' s) in the parallel processor are identical. The set of 16 is driven from a single microsequence" in keeping with the SIMD nature of the machine. Each arithmetic element is quite soft" in the sense that only the most primitive operators are hardwired. The control word is over 100 bits wide. In part" this large control word width is due to direct access to primitive functions; it is large because the arithmetic element has an unusual range of processing capability. That is" besides being a floating point machine" the AE has substantial nonnumeric capability. A comprehensive set of field manipulation and editing operators is available. Additionally, a spec ial set of operators is available specifically for FORTRAN format conversion. While the ESP is marked as a floating-point number processor" in actuality" with its charge-coupled device (CCD) file memory" exceptionally large main memory" and versatile AE' s" it may well represent the largest available nonnumeric processor. A-22 BSP BURROUGHS SCIENTIFIC PROCESSOR Floatirlg point add, subtract, and multiply each take two memory clocks in an AE. The use of two clocks enables memory bandwidth to be balanced exactly with AE bandwidth for triadic operations. A triadic operation is defined as having three operands and one result. Evidently, this does result in balance, because the four memory clocks required to handle the operands and result are balanced by the four clocks required by the two arithmetic operations, which convert three operands into one result. The ESP memory system cycle time is 160 nanoseconds. The AE cycle time is the same. In this day of cycle times much shorter than 160 nanoseconds, it is reasonable to wonder at such a long cycle time. A shorter AE clock would have allowed more effective utilization of AE parts because they would have been used more often per operation. However, offsetting factors were the high level of integration in the ECL circuits, and the desire for ease of manufacturing and straightforward maintenance through absence of need to fine tune the clock signal. To some extent, accumulated gate delays at the level of integration required a long clock (although certainly not as long as 160 nanoseconds), but primarily the level of integration made the long clock affordable. The extra ICs did not substantially impact total system size. Floating-point divide is implemented by generating the reciprocal in a NewtonRaphson iteration. The square root is performed in the hardware as well. It is also implemented via the Newton-Raphson algorithm. ROMs exist in each AE to give the first approximations for the divide and square root iterations. One advantage of using parallel AE 1 s, instead of pipeline ilTlplenlentation, is the relative ease with which a full-length divide and square root can be generated. The single-precision, floating-point format is 48 bits long. It has 36 bits of significant exponent and 10 bits of binary exponent. This gives a range of 10 ± 307 and about 11 decimal digits of precision. The floating point arithmetic is done using guard bits and effi-cient rounding algorithms. Even for large problems, it is rare for more precision to be needed. Double precision is available, however, should added precision be required. The AE has double-length accumulators and double length-registers in key places. This permits direct implementation of double-precision operators in the hardware. The AE also permits software implementations of triple-precision, etc. Note that with 16 AE's, each generating an add, subtract, or multiply in 320 nanoseconds, and with parallel-processor control fully overlapped, the maximum speed of the ESP is 50 million floating-point operations per second. While sustained operation at this 111axinlunl speed will be infrequent, it should be evident tha t overall design philosophy has been to produce a machine which can sustain a reasonable fraction of its maximum operation rate. A-23 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR 4 X 5 ARRAY °11 °12 °13 °14 °15 °21 °22 °23 °24 °25 °31 °32 °33 °34 °35 °41 °42 °43 °44 °45 n STAN DARD FORTRAN COLU MNWISE MAPPING ARRAY ELEM ENTS M EMORY ADDRESS o 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 =° LINEAR VECTOR COMPONENTS ARE SEPARATED BY A CONSTANT ADDRESS DISTANCE d. COLU M NS HAVE Q. = 1, ROWS HAVE ~ = n, FORWARD DIAGONALS HAVE Q. = n + 1, ETC.. - Figure 4. FOR EXAMPLE, Standard Array Memory Mapping Features M = THE NUMBER OF MEMORY BANKS N = THE NUM BER OF ARITHMETIC ELEMENTS CHOOSE M TO BE A PRIME NUMBER. CHOOSE N ~ M. TH EN, FOR ADDRESS a: MEMORY MODULE NUMBER: p. OFFSET IN THE MODULE: i =I~ 1M l~l FOR EXAMPLE. IF M=7, N=6, THE 4X5 ARRAY IS MAPPED: ARRAY ELEMENTS °= o 1 1 2 o 2 3 3 4 4 5 5 = o 0 o o o o p.= i Figure 5. A-24 6 7 6 o 8 9 10 11 12 13 14 15 16 17 18 19 2 3 4 5 2 6 2 0 2 1 2 2 2 3 2 4 5 3 3 1 The BSP Memory Mapping Algorithm BSP 8URFiOUC,HS SCif:::i\nIFIC PROCESSOR CONFLICT-FREE MEMORY ACCESS A unique feature of the BSP is its memory system which delivers a useful operand to each AE, per each memory cycle. That is, the distance in memory between elements of a vector need not be unity. Therefore, DO loops may contain nonunity increments, or the program may access rows, columns, or diagonals of matrices without penalty. This kind of memory capability has not been available before with memory parts of modest speed. Supercomputer designers have elected either to use memories with severe access restrictions, or have used expensive fast memory parts to attain a degree of conflict-free access through bandwidth overkill. The hardware techniques used to ensure conflict-free access are a prime number of memory ports" full crossbar switches between the memory ports and the AE's, and special memory index generation along with crossbar switch tag generation. The index and tag generators compute the proper addresses for a particular address pattern. This address pattern is the one used by orthodox serial computers. That is, each higher memory address refers to the "next" word in memory. With this pattern, the parallel memory is completely compatible with all the constructs of present programming languages. In particular, FORTRAN EQUIVALENCE, COMMON, and array parameter passing can be implemented in the same way as on a conventional computer. As an example, consider Figure 4. This shows a 4 by 5 matrix mapped columnwise into the memory of a serial machine. For purposes of illustration, assume a 6 AE, 7 memory bank parallel machine. (The BSP has 1 7 memory banks.) The index and tag equations are shown in Figure 5. The index is the floor of the integer quotient of the address ~ divided by the number of AE' s. Thus, the index will remain constant for a cycle equal to the number of AE' s; then it will jump by one value. On the other hand, the tag (or memory bank number in which the value associated with address ~ is stored) will be ~ modulo the number of memory banks. Hence the tags will be repeated cycles of the same values, with no value repeating in one cycle, and the length of the cycle equal to the number of memory banks. As long as the number of AE' s is less than or equal to the number of memory banks" the sequence of tag values will cause a different memory bank to be connected to each AE. Thus, each AE may receive or send a unique value. The particular storage pattern produced in this 6 AE, 7 memory bank system for the 4 by 5 example array is shown in Figure 6. Figure 7 shows index and tag calculations for the second row of the array. Note that the equations yield an AE centrist vantage point. That is, the first logical element of the vector goes to the first AE, etc. There is nothing special about this approach beyond a certain confort in thinking. The important point is the following: As long as the same set of equations is always applied to the data, from the first time it comes in as I/O onward" then the storage pattern is completely invisible to the user. This applies to program dumps, etc., as well because the hardware always obeys the same rules. A-25 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR o 1 0 o °11 2 8 7 °42 i = 9 °23 15 °34 16 °44 21 [>< °33 °43 [X °25 [X 28 [X [X 29 4 [X 35 5 ./ IqIfL =0, 11 23 3 NOTE THAT FOR °22 17 °15 22 :"-. °'2 30 5 °41 10 °13 14 2 6 4 °31 =p. 5 3 °21 4 3 2 18 12 6 Q 32 13 °14 19 °24 20 °45 °35 24 25 26 27 31 32 33 34 ----- - ALL ADDRESSES ARE IN TH E SAM E M EMORY BAN K. OTHERWISE, THERE IS NO CONFLICT AT ALL FOR A LINEAR VECTOR. Figure 6. The Physical Mapping of the Example Case IF ROW 2 IS WANTED, THEN: START ADDRESS IS 1 . SKIP DISTANCE d IS 4. fL=11I 7 , = 1, =l~J ' = 0, 151 7 , 1917 ' 1131 7 ' 5, 2, 6, l~J ' lU l~J ' 0, 1, 2, 11717 ' 3 ll;J 2 REFER TO FIGURE 4 TO SEE THAT ARRAY ELEMENTS 0,1,2,3,4 RECEIVE 02"022,023,024,025 RE SPECTIVELY. Figure 7. A-26 Index and Tag Calculations Used to Fetch Row 2 in the Example BSP The unused memory cells are the result of having one less AE than there are memory banks. For the example case, this is not a useful occurrence. For the real BSP, with 16 AE' sand 1 7 memory banks, divis ion by 16 is much simpler and faster than division by 1 7. However, one then pays the penalty of supplying some extra memory to reach a given usable size. Note that conflict does occur if the addresses are separated by an integer multiple of the number of memory banks. In this case, all the values one wants are in the same memory bank. For the BSP, this means that skip distances of 1 7, 34, 51, etc., should be avoided. In practice, 51 is a likely problem skip. This is because it is the skip of a forward diagonal of a matrix with column length 50. If conflict occurs in the BSP, the arithmetic is performed correctly, but at 1/16 the normal speed. The system logs the occurrence of conflicts and their impact on the total running time. This information is given to the programmer for corrective action, if the impact was significant. BSP memory reliability has been mentioned. Diagnosibility is also an important feature. Instead of placing the Hamming code generators, detectors, and correctors on the memories, as is usual, they are placed at the AE' s. This way the entire loop, from the AE's to the memory and back again, is Hamming code corrected. A side benefit of the conflict-free access hardware is that control information can be buried in the Hamming code in such a way that failing control elements can easily be detected and identified. Hence, not only are memory elements and data paths checked in the BSP design, but control logic is checked as well. PARALLEL PROCESSOR CONTROL UNIT Many of the functions of the parallel processor control unit have been mentioned. The unit is essentially a pipe, with each stage concerned with translating or checking input instructions into accurate control signals for the parallel processor. The first stage in this pipeline is called the vector input and validation unit (VIVU). The vector and array instructions, assembled by the SPU, are inserted at this point. The VIVU assembles a sequence of such instructions into a single global description of an operation defined over a set of program arrays. It also establishes the relationship between this operation and the one before, to guard against vector hazards. The form of instruction inserted by the SPU into the VIVU is interesting. In keeping with Burroughs policy of designing standard product computers as language processors, it was determined early in the BSP design cycle that the machine would pi~ocess assignn1.ent stateITlents. ArithlTIetic assignment statements are the basic ingredient to most numerical processing. The templates are memory-to-memory entities, because assignment statements are memory-to-memory statements. In the case of Burroughs stack machines, such as the B 7800, the source language translates almost directly into object language with little need for run-time processing to aid in the description of the object language. Hence, in the B 7800 this runtime processing is automatically invoked in the hardware. A-27 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR It was not possible to take the same approach with the BSP. The most general BSP parallel processor construct is a sequence of vectors, essentially a nested pair of DO loops. Because in the general case, all of the parameters used to describe an operation could be run-time variables, so much language processing is involved that it makes no sense to try to do it in automatic hardware. For example, parametrized descriptors of operations on sets of vectors, where the parameters are computed at run time, involve a great deal of run-time processing to convert the parametrized source language into object code. This general consideration defines the need for the SPU as well as the point in the processing sequence at which the SPU passes instructions to the parallel processor control unit. This point is where run-time source language processing ceases, and all subsequent vector operation processing can be done in a uniform way. From a language point of view, this point is the time at which actual parameters can be inserted into a formal assignment statement. This is, consider the triadic formal assignment statement RBV , Z = A OP 1 B OP 2 C, OBV ; where Z, A, B, and C are vector descriptors, oPl and oP2 are operators, and OBV /RBV are the optional operand and result bit vector descriptors. The executing this operation, the SPU issues the following sequence to the parallel processor control unit: VFORM OBV RBV VOPERAND VOPERAND VOPERAND VRESULT A B C Z The information in the VFORM instruction will name the actual operators to be used, designate the level of loop nesting, indicate presence of bit vectors, etc. The OBV and RBV descriptors will give a start address and a length. The starting addresses are not referenced to an array, because bit vectors are not FORTRAN concepts. However, the vector operand and result descriptors give the start of the vector relative to an array location, the location of the array, the volume of the array, the skip distance in the location of the array, the volume of the array, the skip distance in memory between vector elements, and the optional skip distance between the start of vectors in a nested loop pair. Some consideration should convince the reader that this is the point where run-time language processing has ceased. The remaining processing, for example, array bounds checking, will be constant for all operations. Hence, it is seen that the BSP is a FORTRAN language processor. A-28 SSP After the VIVU has processed the vector form and its associated operand and result descriptors~ the finished package description is stored in the template descriptor mernory (TDM). The TDM functions as a queue J thereby permitting many vector forms to be present in it at once. The template control unit (TeD) fetches information from the TDM~ updates it~ and stores the updated information back in the TDM. The function of the TeU is to drive the parallel processor. It does this by selecting an appropriate template and then using the address information in the TDM to generate tags and indices for the quantities dealt with by the template. Because a template normally processes sets of 16 elements~ the TeD normally updates addresses by 16 and stores the new addresses back into the TDM. However~ for a memory conflict case~ or compress/ expand/ merge operations~ the TeD adds the correct updating number to addresses before restoring them to the TDM. For retry~ checkpointing~ and arithmetic fault identification~ the TDM also contains a copy of the original vector form~ as received from the VIVD. The programaddress counter value of the instruction which issued the vector form is also stored in the TDM. In the case of program termination due to arithmetic faults~ the programmer is given the index of the first operand pair causing a fault in a vector form~ the nature of the fault~ the line number of the associated source statement~ and the calling sequence used to reach that statement. If the TeD is processing the last template of a vector form~ there will generally not be sets of 16 elements involved. This is because the last template will require only enough elements to process the vector length specified by the program. For this case J the TeD will pad the partial lengths out to 16 using "nUll" operands. The null operands are genera ted by the input alignment network aid have the following properties in the AE IS: operand ....41---- "null" (operator) operand operand "'4~-- operand (operator) "nUll" Also~ memory bank will refuse to store a null. Hence~ "nUll" times 1 equals 1~ "null" plus I equals 1. And J of course J two nulls produce a null. SOl a vector that is not an integer multiple of 16 is padded out with a quantity which results in no improper stores occurring. But the null is more than that. For example~ in a sum reduction~ where numbers are being transmitted between AE I s~ the correct sums are formed because the nulls do not intrude into the result. The same is true for a product reduction J or for a reduction using "maximum" or "minimum" as the operator~ in which case a search is occurring. SCALAR PROCESSING UNIT The SPU is a conventional~ register-oriented processor in most respects. It operates on an 80-nanosecond cycle. It has 16 48-bit~ general-purpose registers~ a full complement of numeric and nonnumeric operators and an instruction processor which includes content-addressable~ memory-controlled~ in-buffer J looping A-29 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR capability. The unusual features of the SPU relate to subroutine entry and to instructions issued to the parallel processor control unit. In addition to the 16 general purpose registers l there are 16 120-bit vector data buffers (VDB's). The 120-bit width is the maximum width of any parallel processor control unit instruction. If a vector instruction or descriptor has been completely assembled at compile time l then it will be fetched from control memory into a VDB. From there l a single clock will be sent into the VIVU input buffer. If some run-time vector instruction or vector descriptor processing is necessarYI then as many as four of the general-purpose registers can be referenced to insert fields into the 120-bit wide VIVU input buffer. This data will overwrite the associated fields in the 120-bit word coming from the VDB set. This facility permits the formation of skeletal descriptors at compile time l with the expectation that missing fields will be filled in at run time. It also permits a compile-time-generated descriptor to be overwritten at run time l which can be used to optimize VDB usage and memory references. The SPU uses a push-down stack for subroutine linkage. The data which goes on the stack is the return address and a register address translation table. The 48bit and 120-bit registers are referred to via an address translation unit. This capability assists in parameter passing and in the minimization of register saves and restores during subroutine entry and exit. A second stack is maintained for the SPU executive state. Control memory is divided into the executive areal user read/ write areal and user read-only area. The user stack is at one end of the user read/ write area. BSP SOFTWARE Wherever possible l the BSP takes advantage of existing B 7700/B 7800 software. Necessary additional items are a small operating system to run on the SPU I a BSP compiler and the associated linkage editor and intrinsic functions l and the diagnostic package. Perhaps the most interesting aspect of the SPU operating system is its facility for staying out of the way. For example l I/O is done in user model and assuming I/O is overlapped by computations l the SPU spends less than a microsecond in managing a transfer. Overlay management and chaptered file management are the major operating system functions performed for a running program. The hardware assists in overlay management by allowing presence bits to be assigned to each phase of a program. Hence l the operating system is dropped into automaticallYI if and only if the program attempts to enter a routine in a phase which is not present. The FORTRAN compiler has a number of interesting features. The most important is the vectorization pass over the programl which converts serial FORTRAN into vector constructs. The BSP vectorization pass is a more complete parallel analysis A-30 \ .! SSP BURROUGHS SCIENTIFIC PROCESSOR of a program than has been previously inserted into product compilers. The usual approach has been to attempt to vectorize only loops which did not contain branches or cyclic dependencies. The BSP vectorizer issues vector instructions even for branching cases" as long as the branch paths depend on vectors which are known at branch time. The vectorizer also detects cyclic dependencies. If an analysis of the dependency shows it is equivalent to a linear recurrence" then appropriate vector instructions are issued. These types of parallelism are the most frequent which can be detected using rigorously defined algorithms. There are some important cases" such as when a user has undertaken to manage indices" for which only ad hoc vectoriza tion techniques are known. These will be vectorized as well" but clear-cut statements about the extent of vectorization are not possible. The FORTRAN compiler also contains the facility to directly express vector and array constructs. Assignment statements like A = A + 1" where A is an array" are permitted. Index expressions are permitted on array identifiers. For example,ll if A is a 3-dimensional array" then A (10:50:2" 6" 1:10) would refer to a 21 by 10 array which is a subspace of A. Clearly" this reference need only generate a descriptor" which is pushed into the VIVU queue. No actual subset of A is generated as a temporary. An interesting special feature of the compiler is a source code regenerator. This regenerator creates standard FORTRAN statements. Hence" a programmer can indulge in the use of array extensions" but retain full compatibility with standard FORrl'RAN. Because of the way in which the B 7700/B 7800 is connected to the BSP" it is possible to use standard B 7700/B 7800 terminals or network connections to run BSP diagnostics. A field engineer can invoke a sequence of routines which will result in a printout of suspect IC locations. He can quickly replace the suspect IC's and bring the BSP back up again" if it is down. This idea is extended to the ability to use BSP development engineers on-line in the diagnostic process. Thus" the field engineer and the customer have excellent assurance of prompt system recovery. A -31 ~~p ~~~~~~~~~~~~~~~~~~~~~BURROUGHSSCIENTIFICPROCESSOR
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : No Create Date : 2011:04:05 11:46:57-04:00 Creator : Adobe Acrobat 9.4.3 Modify Date : 2011:04:05 12:07:12-04:00 XMP Toolkit : Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04 Metadata Date : 2011:04:05 12:07:12-04:00 Creator Tool : Adobe Acrobat 9.4.3 Format : application/pdf Document ID : uuid:6ce66753-43d2-4210-a2b3-912cb7986dc5 Instance ID : uuid:ac90313d-a2ed-4f61-b13b-0dec4e213563 Producer : Adobe Acrobat 9.43 Paper Capture Plug-in Page Count : 37EXIF Metadata provided by EXIF.tools