MIT LCS TM 479
MIT-LCS-TM-479 MIT-LCS-TM-479
User Manual: MIT-LCS-TM-479
Open the PDF directly: View PDF .
Page Count: 11
Download | |
Open PDF In Browser | View PDF |
Experience with Fine-Grain Synchronization in MIMD Machines for Preconditioned Conjugate Gradient Donald Yeung and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Abstract This paper discusses our experience with fine-grain synchronization for a variant of the preconditioned conjugate gradient method. This algorithm represents a large class of algorithms that have been widely used but traditionally difficult to implement efficiently on vector and parallel machines. Through a series of experiments conducted using a simulator of a distributed shared-memory multiprocessor, this paper addresses two major questions related to finegrain synchronization in the context of this application. First, what is the overall impact of fine-grain synchronization on performance? Second, what are the individual contributions of the following three mechanisms typically provided to support fine-grain synchronization: language-level support, full-empty bits for compact storage and communication of synchronization state, and efficient processor operations on the state bits? Our experiments indicate that fine-grain synchronization improves overall performance by a factor of 3.7 on 16 processors using the largest problem size we could simulate; we project that a significant performance advantage will be sustained for larger problem sizes. We also show that the bulk of the performance advantage for this application can be attributed to exposing increased parallelism through language-level expression of fine-grain synchronization. A smaller fraction relies on a compact implementation of synchronization state, while an even smaller fraction results from efficient full-empty bit operations. 1 Introduction This paper describes an in-depth investigation of the impact of finegrain synchronization in MIMD machines on the performance of the preconditioned conjugate gradient method. The method uses the modified incomplete Cholesky factorization of the coefficient matrix to form the preconditioner (we will henceforth refer to this application as MICCG3D). An application study of this sort is important because it tells architects not only how programmers will use the mechanisms provided in parallel machines, but also the relative usefulness of various mechanisms provided in the system as evidenced by their impact on end application performance. One of the challenges in such a design methodology lies in finding appropriate applications which will provide meaningful information concerning a specific set of mechanisms. The problem of finding an application that is both important and suitable for investigating fine-grain synchronization is particularly difficult because most parallel application studies for MIMD multiprocessors have focused on problems that are uninteresting from the standpoint of synchronization. One of the reasons for this lack of benchmarks is the lack of machines that support fine-grain synchronization. The MICCG3D application meets our criteria because it is an important application with challenging synchronization requirements. Our investigation of fine-grain synchronization has two major aspects. First, we investigate how fine-grain synchronization can be gainfully employed in MICCG3D and determine quantitatively the resulting performance benefits. We analyze the benefits both with a fixed problem size and when problem size is scaled with machine size. Our study uses a simulator of Alewife, a distributed memory multiprocessor that provides hardware support for the sharedmemory abstraction [1]. Our first result is that applications for which synchronization is challenging do exist. Furthermore, implementations on MIMD machines can achieve good performance by employing fine-grain synchronization. Second, through a sequence of experiments, we try to understand exactly where the “muscle” of fine-grain synchronization lies. A common conception of fine-grain synchronization – one which has contributed to the preference for coarse-grain approaches – has been that its success relies on efficient,but expensive, hardware-supported synchronization primitives. We demonstrate that the most significant contributions of fine-grain synchronization for MICCG3D do not rely on hardware acceleration; rather, they arise from the expressiveness and flexibility of language-level support. The rest of this paper proceeds as follows. Section 2 discusses the different styles of synchronization, and identifies the levels of support for fine-grain synchronization that machines can provide. Section 3 describes the MICCG3D application and motivates the need for fine-grain synchronization by discussing why MICCG3D is difficult to parallelize. Section 4 discusses how we actually parallelize MICCG3D using coarse- and fine-grain synchronization. Section 5 describes our experimental environment. Section 6 presents our results and discusses their significance. Finally, Section 7 summarizes our work and makes some concluding remarks. 2 Synchronization Synchronization in shared-memory MIMD multiprocessors ensures correctness by enforcing two conditions: read-after-write data dependency and mutual exclusion. Read-after-write data dependency is a contract between a producer and a consumer of shared data. It ensures that a consumer reads a value only after it has been written by a producer. Mutual exclusion enforces atomicity. When a data object is accessed by multiple threads, mutual exclusion allows the accessesof a specific thread to proceed without intervening accesses by the other threads. Because the MICCG3D application uses the producer-consumer style of synchronization, the rest of this paper will only address read-after-write data dependency. need a large synchronization name space. Providing special synchronization state can lead to an efficient implementation from the standpoint of the memory system. We refer to this benefit as memory efficiency. Finally, the last component of support addresses the fact that synchronizations will occur frequently. Therefore, support for the manipulation of synchronization objects can reduce the number of processor cycles incurred. We refer to this benefit as cycle efficiency. A coarse-grain solution to enforcing read-after-write data dependency is barrier synchronization. Barriers are typically used in programs involving several phases of computation where the values produced by one phase are required in the computation of subsequent phases. Parallelism is realized within a single phase, but between phases, a barrier is imposed which requires that all work from one phase be completed before the next phase is begun. Under the producer-consumer model, this means that all the consumers in the system must wait for all the producers at a common synchronization point. In contrast, a fine-grain solution provides synchronization at the data level. Instead of waiting on all the producers, fine-grain synchronization allows a consumer to wait only for the data that it is trying to consume. Once the needed data is made available by the producer(s), the consumer is allowed to continue processing. 3 Preconditioned Conjugate Gradient The Conjugate Gradient (CG) algorithm is a semi-iterative method for solving a system of linear algebraic equations expressed in matrix notation as Ax = b. The rate of convergence of the CG method can be improved substantially by preconditioning the system of equations with a matrix K ,1 and then applying the CG method to the preconditioned system. The idea is to choose a preconditioner such that K ,1 A is close to the identity matrix I [5]. Since the operations in the basic CG method consist of vector updates, inner products, and sparse matrix-vector multiplies, efficient parallel versions of the algorithm have been demonstrated on many vector machines and MIMD multiprocessors [4, 6]. Preconditioned CG methods, however, have not enjoyed the same success. In many of the most popular preconditioning techniques, the preconditioner steps involve recurrence relations which do not vectorize or parallelize easily. Algorithmic solutions have been proposed which use different preconditioning techniques to obtain better parallel performance [8, 11, 13, 14]; however, these approaches commonly suffer from reduced convergence rates. Also, some of these algorithms target a specific number of processors and are too complex to generalize to arbitrary machine configurations. Fine-grain synchronization provides two primary benefits over coarse-grain synchronization. Unnecessary waiting is avoided because a consumer waits only for the data it needs. Global communication is eliminated because consumers communicate only with those producers upon which they depend. The significance of the first benefit is that parallelism is not artificially limited. Barriers impose false dependencies and thus inhibit parallelism because of unnecessary waiting. The significance of the second benefit is that each fine-grain synchronization operation is much less costly than a barrier. This means that synchronizations can occur more frequently without incurring significant overhead. In this paper, we study the preconditioned CG method known as the Modified Incomplete Cholesky Factorization Conjugate Gradient in 3-Dimensions (MICCG3D). Although our study centers around a particular implementation, the general problem being addressed involves increasing parallel performance through the recurrence relations in the preconditioner steps, a problem that is common to almost all preconditioned iterative methods. Therefore, our solution using fine-grain synchronization has general consequences for the large number of algorithms that MICCG3D represents. It is important to emphasize that these benefits are manifested by the expressivenessof fine-grain synchronization; they will be felt regardless of the underlying hardware implementation. This observation is important because it underscores the fact that fine-grain expression of synchronization and the implementation of synchronization primitives are orthogonal issues. 3.1 MICCG3D In this paper, we identify three mechanisms to support fine-grain synchronization. They are: MICCG3D is a preconditioned conjugate gradient method that assumes the coefficient matrix A in the system Ax = b is sparse, and symmetric positive definite (SPD). Such a matrix commonly arises from the discretization of elliptic partial differential equations. In this paper, we use MICCG3D to solve Laplace’s equation in three dimensions. Language-level support for the expression of fine-grain synchronization. Memory hardware support to compactly store synchronization state. Discretizing Laplace’s equation in three dimensions using a standard 7-point discretization results in a SPD coefficient matrix A that has 7 non-zero diagonals; all other elements are zero. Since matrix A is symmetric, we can write A = L + diag(A) + LT where L is a lower triangular matrix. Using the incomplete Cholesky factorization method, we obtain an approximate L-U factorization of matrix A which we denote as K . K can be computed as follows and is given in [13]. Processor hardware support to operate efficiently on synchronization state. The first component of support provides the programmer with a means to express synchronization at a fine granularity resulting in increased parallelism. Another attractive consequence is simpler, more elegant code [3]. The second component of support addresses the fact that an application using fine-grain synchronization will K = (L + D)D,1(D + L T 2 ) (1 ) Vector Operation Vector Update Sparse Matrix-Vector Multiply Inner Product Vector Update Vector Update Solver Inner Product Cycles 10589 56099 6212 9295 9288 74058 6212 % 6.17 32.7 3.62 5.41 5.41 43.1 3.62 Table 1: Cycle breakdown of one iteration of MICCG3D. In this expression, L is the same lower triangular matrix mentioned above, and D is a diagonal matrix which can be easily computed from matrix A (see [13]). Since K is an approximate factorization of A, we use K ,1 as the preconditioning matrix and apply the conjugate gradient method to the preconditioned system. ny As mentioned earlier, the challenge of MICCG3D lies in parallelizing the vector solution step involving the preconditioner (which we shall refer to as the “solver operation”). Table 1 shows a cycle breakdown for one iteration of MICCG3D on a problem size of 8 8 8, where the problem size nx ny nz signifies the degree of discretization in the x, y, and z dimensions, respectively. The numbers were acquired from a single processor simulation of the Alewife machine which will be described in Section 5. Notice the solver is the most costly vector operation. If poor parallel performance is suffered in this part of the application, the potential parallel performance of the entire application will be severely limited. nx Figure 1: Computational wavefront at an instant in time. nx , ny , and nz denote the discretization degree in the x, y, and z dimensions, respectively. Machine CRAY-1 CRAY-2 CRAY X-MP CYBER 205 ETA-10P IBM 3090 NEC SX/2 Alliant FX/80 3.2 Parallelization Issues MICCG3D is difficult to parallelize because the recurrence relations in the solver operation impose data dependencies which are numerous and complex. The solver computes wi, the residual vector in the preconditioned system. wi is given by w i = = K ,1b , K ,1Ax K ,1r (2) x is the solution vector of the current iteration step and b , Ax is the residual vector in the original system without preconditioning. Although K ,1 is the preconditioner, actually where r i = i =( r , l ,1 2 w ,1 , l , i i ; i i nx ;3 w , x ,l , i n i nx ny ;4 w, i nx ny ) Inner Product 75 88 166 100 83.3 29 905 30.5 Sparse MxV 54 77 178 100 83.3 27 843 16.8 2-Term Recur 10 8.5 5.7 2.9 1.6 6.5 21 3 Wavefront computation in MICCG3D is difficult to parallelize for two reasons. First, the parallelism is not uniform. While there is sufficient parallelism in the middle of the computation, there is very little parallelism at the beginning and the end of the computation. Second, the dependencies exist across all three spatial dimensions. That is, an element can be computed only if all the elements to the left of it, behind it, and below it have been computed. Consequently, it is impossible to choose any cartesian axis in the solution space along which to partition work for the different processors and simultaneously avoid heavy dependencies. Since we have the factorization of K as the product of a lower triangular matrix and an upper triangular matrix (equation 1), we can solve for wi by first employing back substitution followed by forward substitution. As an example, the backward substitution step can be expressed as follows. i Vector Update 46 71 166 200 167 21 548 17.1 its name from the fact that the solutions which can be computed in parallel at any instant in time form a wavefront in the solution space which propagates forward as time elapses. Figure 1 shows a snapshot of the wavefront in the three-dimensional solution space of MICCG3D. i calculating it is infeasible because it is the inverse of a sparse matrix (and thus will be dense). Therefore, instead of solving equation 2, we solve Kwi = ri (3 ) w Peak MFlops 160 1951 941 400 333 800 1300 188 Table 2: Performance of vector operations in MICCG3D on vector machines. The last column is for the solution of a 2-term recurrence relation. i i nz Computational Wavefront Moving in this Direction =l i;1 (4) where L is the lower triangular factor in K and li;j is the ith element in the jth non-zero diagonal away from the center diagonal in matrix L. Because of the recurrence in w, it is not possible to perform the entire backward substitution step in parallel. A similar problem exists for the forward substitution step. The dependencies imposed by the recurrence relation in equation 4 result in what is known as “wavefront computation.” This form of computation derives The difficulty in performing computations involving recurrence relations is well known. Table 2 shows performance numbers on some vector computers as presented in [4]. The first column of numbers shows the absolute peak floating point performance of the machine. The remaining columns give the maximum performance in MFlops on each of the four vector operations that appear in 3 barrier z barrier barrier barrier Barriers are placed in between vector operations to ensure that results are fully completed before being used in subsequent computation. For all but the solver operation, this is sufficient to guarantee correctness. Dependencies arising from the recurrence relation in the solver require further use of barriers. Computation in the solver is further sectioned into k parts along the x dimension. Each processor computes all the results in one block and enters a barrier before it is allowed to move on to the next block. Dependencies between blocks across processors must be enforced by staggering the computation. This process is illustrated in Figure 2 on an example that has been partitioned for four processors with k equal to four. The blocks that are computed in parallel between barriers are filled with the same hash pattern. Staggering the blocks results in a staircase-like propagation of computation. y x P3 P2 P1 Notice this scheme suffers from limited parallelism. At the beginning of the solver operation, when processor P0 computes its first block, processors P1, P2, and P3 must remain idle. After the first barrier, P0 moves on to its second block, but only P1 is allowed to start computing; P2 and P3 are still idle, and so on. Not until P0 is on its 4th block are all processors busy (note the same problem occurs at the end of the solver operation). The degree to which parallelism is limited depends on the value of k. The larger k is, the finer the grain of data associated with each synchronization and thus the smaller the amount of serialization. For all the simulations reported in Section 6, we used k = P . P0 k Figure 2: Coarse-grain parallel implementation of the solver operation for 4 processors. Similarly shaded blocks are computed in parallel. Pi denotes processor i, and k is the number of computational blocks in the x-dimension. To find an upper bound on speedup in the solver computation, we observe that sequential execution time is proportional to kP , the total number of blocks. Parallel execution time is proportional to the number of blocks per processor, k, added to the number of 1. intervals between barriers each processor spends idling, P Taking the ratio of sequential to parallel execution time gives the upper bound on speedup. , MICCG3D. Notice how performance degrades for recurrence relations as compared to the other vector operations. Although only results for 2-term recurrence relations are given, the general trends apply to the solver operation in MICCG3D (which involves a 3-term recurrence relation). S upper = kP k+P ,1 (5) 4 Parallel Implementation Notice this is only an upper bound because it ignores the overhead of barrier operations which becomes more significant as k increases. In this section, we discuss two ways of parallelizing MICCG3D. One uses coarse-grain barrier synchronization and the other uses fine-grain data-level synchronization. Since we will study these implementations side-by-side in Section 6, it is important that they are in some sense a fair comparison. We believe our two implementations present a fair comparison based on the fact that both require equal programming effort. We do not claim that either implementation is the best that can be done; however, we are confident that they are both reasonable implementations. 4.2 Fine-Grain MICCG3D Like the coarse-grain implementation, each processor is assigned nz =P planes partitioned along the z dimension; however, in the fine-grain implementation, these planes are not contiguous. Instead, processors are allocated planes modulo P . This scheme is illustrated in Figure 3. Notice that compared to the coarse-grain implementation, this partitioning scheme results in substantially more communication in the sparse matrix-vector multiply and solver operations because the computation of every element in the solution space depends on a value belonging to a remote processor. 4.1 Coarse-Grain MICCG3D Synchronization is done at the word-level using fine-grain synchronization. The need for barriers is completely eliminated except in the inner product where an implicit barrier occurs in the accumulate of all the individual scalar multiplies. Word-level synchronization automatically enforces the recurrence dependencies in the solver operation. In the fine-grain version of the solver, each processor can compute results as fast as possible. If a thread tries to read a value that has not yet been computed, the semantics of the data-level synchronization force that thread of execution to stop and wait until the value becomes available. Therefore, processors never wait unnecessarily. In the coarse-grain approach, we partition the solution space along the z dimension and assign nz =P contiguous planes in the solution space to each processor, where P is the number of processors. To maximize physical locality of data reference, we ensure that all planes assigned to a processor are allocated in that processor’s local memory. For this partitioning of the data, communication occurs only in the sparse matrix-vector multiply and solver operations. Furthermore, communication occurs only between adjacent processors and only when computing elements in the solution space which reside in planes at partition boundaries. 4 | | | 12.0 | 10.0 8.0 6.0 4.0 | | | | | | | 0.0 | 0 14.0 | | | 2.0 16.0 | | 4.0 16 | | Speedup | 6.0 | ny Ideal Fine-Grain Coarse-Grain | 8.0 12 | 10.0 8 | nz | 12.0 4 | 14.0 0 | P3 P2 P1 P0 P3 P2 P1 P0 P3 P2 P1 P0 P3 P2 P1 P0 16.0 4 8 12 16 2.0 | 0.0 Number of Processors nx Figure 3: Fine-grain parallel implementation of the solver operation for 4 processors. nx , ny , and nz denote the discretization degree in the x, y, and z dimensions, respectively. Figure 4: Coarse-grain and fine-grain speedups on MICCG3D. Problem Size = 16 16 16. transaction. If the datum and synchronization variable were stored in separate memory locations, two memory transactions would be needed. For this implementation, the theoretical speedup is linear (i.e., Supper = P ). Alewife facilitates cycle efficiency by providing special processor hardware that enables the manipulation of a full-empty bit simultaneously with the load or store of its associated data. Thus, a synchronized data access costs no more than a normal load or store. Without hardware support, a synchronized access would take at least three instructions: one to access the full-empty bit, another to test the bit, and a third to access the datum. Alewife uses the result of full-empty tests to conditionally trap the processor. This trapping mechanism is used to identify exceptional cases such as failed synchronizations so they can be handled by software trap handlers. This approach provides efficient hardware support for successful synchronization operations, but relegates the processing of failed synchronizations to less efficient software handlers. The philosophy driving this approach is that successful synchronization operations will be the common case when using fine-grain synchronization. 5 Implementation Environment The results we report in Section 6 are in the context of the Alewife Machine. Alewife consists of a scalable number of homogeneous processing nodes connected in a 2-D mesh topology. The network channels are bidirectional; end-around connections do not exist at the edges of the network. Each Alewife node consists of a 32-bit RISC processor, a floating point unit, 64K bytes of cache memory, 8M bytes of dynamic RAM, and a network routing chip. Although the memory is distributed across processors, a sharedmemory abstraction is implemented in hardware which includes support for maintaining cache coherence. Alewife supports the fine-grain synchronization capabilities described in Section 2. At the language level, L-structure and Jstructure constructs allow the expression of synchronization at data-level granularity. L-structures enforce mutual exclusion and J-structures provide producer-consumer synchronization (a detailed discussion on language-level support for fine-grain synchronization in Alewife appears in [7]). Memory efficiency and cycle efficiency as discussed in Section 2 are facilitated by full-empty bits in the memory hardware and fast operations on full-empty bits in the processor hardware, respectively. In Section 6, we will compare the importance of cycle efficiency, memory efficiency, and the benefit of increased parallelism offered by language-level support. 6 Results Simulation results were obtained for both the coarse-grain and finegrain implementations on 1, 4, and 16 processor Alewife configurations. A problem size of 16 16 16 was used. Alewife facilitates memory efficiency by allocating a full-empty bit for every word in memory. Each full-empty bit acts as a dedicated hardware synchronization variable. The memory efficiency benefit has two consequences. First, the memory overhead for synchronization objects is low. Without full-empty bits, a programmer would have to explicitly allocate extra memory for every synchronization variable. Second, memory efficiency reduces communication. A synchronized data access on Alewife brings both the datum and the synchronization variable to the processor in one memory system Figure 4 shows the speedups calculated by normalizing execution times against the uniprocessor execution time of the coarsegrain implementation. We see that the fine-grain implementation does consistently better than the coarse-grain implementation. The difference in performance can be predominantly attributed to the solver operation. To show this graphically, we recorded a synchronization trace of both solver implementations in Figure 5. 16 5 processors are shown executing the solver on a 16 16 16 problem size. Black bars represent useful work and the interspersed white space signifies waiting for synchronization. In the fine-grain trace, failed J-structure references (which we will refer to as “JREF misses”) are traced by a cross appearing above the bar line of the processor that experienced the event. After every JREF miss, processors idle until the desired value is filled by a producer. Comparing the amount of idle time (white space) between the two traces, we see that the coarse-grain implementation waits much more than the fine-grain implementation. (Notice that the two time axes are on different scales; the coarse-grain trace is approximately four times longer than the fine-grain trace). relative computational progress neighboring processors make with respect to one another. This intuition is supported by the simulation results reported in Figure 6. We ran simulations on 4 and 16 processors while varying the problem size and recorded the JREF miss rates under two different failed synchronization policies which we call spinning and backoff. When spinning on a JREF miss, the processor waiting for the JREF continually spins on the missing value. Once the value gets filled by the producer, the consumer immediately reads it and continues computing. This policy allows consumers to consume values very close to when they are produced. In the backoff policy, whenever a processor encounters a failed JREF, it idles for a fixed number of cycles before retrying the read. Backoff allows the producer to make computational progress ahead of the consumer. Figure 6 verifies that backoff achieves a dramatically lower JREF miss rate and also confirms that for non-trivial problem sizes, the miss rate is constant with respect to problem size. Along with a lower JREF miss rate, backoff has the added benefit of reducing false sharing effects in the cache. Table 3 shows a cycle breakdown of the 16 16 16 problem size simulation for both implementations. There are three sources of overhead: waiting for the memory system labeled “cache”, waiting for JREFs labeled “JREF”, and waiting at barriers. Barrier overhead is split into two components. The first, “Bar Time,” is the number of cycles from when the last thread enters the barrier until the last thread leaves the barrier. The second is “Bar Skew” which is the number of cycles from when the first thread arrives at the barrier until the last thread arrives at the barrier. “Bar Time” is a measure of the cost of the barrier operation after all threads have arrived, and “Bar Skew” is a measure of skew between the runtimes of the threads. Both are totals across all barriers averaged for each processor. 6.3 Barrier Overhead The most serious barrier overhead reported in Table 3 appears in the coarse-grain implementation of the solver where barriers are used to enforce the dependencies caused by the recurrence relations. To better understand how this overhead effects performance as a function of problem size and machine size, we rederive the theoretical speedup in the coarse-grain solver operation, equation 5, to include barrier overhead. With barrier overhead (but still ignoring communication costs), the time to execute the solver in parallel, Tpar , is Tpar = Tseq + BnB (6) In the next three sections, we discuss each of these sources of overhead and their significance. In particular, we try to extrapolate the behavior as problem size and machine size are increased beyond what we are able to simulate in a reasonable amount of time. 6.1 Memory System Overhead S The overhead of the memory system is consistently one of the smallest overheads in Table 3, and we expect the effect of the memory system to be even less at larger problem sizes. In the coarsegrain implementation, the number of remote accesses will grow with the surface area of each processor partition while the number of local accesses will grow with the volume. Thus, as problem size is increased, the overall cost of memory system overhead will decrease. In the fine-grain implementation, a new J-structure is allocated on each iteration thereby bypassing the need to reset the J-structure in between iterations. This results in no data reuse. In a real implementation, J-structures will be reset between iterations and reused thus giving rise to better cache performance. Moreover, because of the static nature of the computation, we expect very naive prefetching to be effective in hiding most of the memory latency in both implementations. upper where Tseq is the sequential execution time, Supper is the theoretical solver speedup, B is the average cost of a barrier synchronization (includes skew), and nB is the number of barriers encountered per processor. Using equation 5, we get k P , 1) + B (k + P , 1) (7) kP where we have used n = k + P , 1 by recognizing that n is T par = T seq ( + B B equivalent to the number of blocks encountered by each processor as discussed in Section 4.1. The solver speedup including barrier overhead S (k; P ) = Tseq =Tpar is S (k; P ) = = 6.2 Controlling J-Structure Synchronization Overhead In the fine-grain implementation of the solver, processor Pi consumes values produced by processor Pi,1 . We expect the number of JREF misses to be related to how far producers and consumers on neighboring processors are apart in their computations. If producers are well ahead of their consumers, then we expect very few JREF misses. If, however, values are consumed immediately after they are produced, then there is a much greater chance for JREF misses. The significance of this observation is that JREF miss rate is not dependent on the problem size; rather, it depends only on the T seq Tseq (k +P kP ,1) + B (k + P , 1) T kP k+P ,1 seq T seq + BkP (8) From Figure 2, we see that the run-length between barriers is proportional to a single block of computation. If we define this run-length as rl , then we can express Tseq as Tseq = rl kP since there are kP blocks in total. Therefore, we can rewrite equation 8 as r S (k; P ) = k +kP P ,1 r +B l (9) l Equation 9 is the product of the ideal theoretical solver speedup and the overhead of a barrier operation in comparison to the average run-length between barriers. Notice that both these terms cannot be 6 Processor Processor | | | | | | | | | | | 0 1 2 3 4 5 6 7 8 9 10 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 | | | 0 1 2 time (cycles x 100000) time (cycles x 100000) Coarse-Grain Fine-Grain Figure 5: Traces in the solver operation of MICCG3D. Problem size = 16 16 16. Bar lines show useful work, and white space shows waiting for synchronization. Crosses indicate JREF misses. The staircase shape is the signature of the back and forward substitution steps. Processors 1 4 16 Total 6943004 2769725 2428515 Cache 568540 85157 179888 Coarse-Grain MICCG3D. % JREF % Bar Time 8.19 N/A N/A 17921 3.07 N/A N/A 73799 7.41 N/A N/A 945195 % 0.258 2.66 38.92 Bar Skew 0 946855 735474 % 0.0 34.19 30.28 Processors 1 4 16 Total 6831696 2328728 662230 Cache 472287 305456 65929 Fine-Grain MICCG3D. % JREF % Bar Time 6.91 0 0.0 1270 13.12 111627 4.79 11636 9.96 36301 11.55 12496 % 0.019 0.50 1.89 Bar Skew 0 57288 68996 % 0.0 2.46 10.42 10000 20000 30000 40000 0.0 | 0 | 10000 | | | 20000 30000 | Problem Size | | | | | | | | | 0.0 | 8.0 6.0 | 2.0 10.0 4.0 | 2.0 16 Processors 4 Processors 2.0 | 0.0 | | Miss Rate (percent) | | | | | | | 4.0 | 4.0 40000 12.0 | | 0.0 | 0 6.0 30000 | | 6.0 8.0 | 8.0 20000 | | 2.0 10.0 10000 | | 4.0 10.0 0 | | 6.0 12.0 | 8.0 16 Processors 4 Processors 40000 12.0 | 30000 | 20000 | 10000 | 10.0 0 | 12.0 | Miss Rate (percent) Table 3: Cycle breakdown for simulations. Problem size = 16 16 16. “Cache” is waiting on the memory system, “JREF” is waiting on failed J-structure references, “Bar Time” is the cost of all the barriers (without skew), and “Bar Skew” is the total skew between the runtimes of the threads at all the barriers. 40000 Problem Size Spinning Backoff Figure 6: JREF miss rate as a function of problem size in the Spinning and Backoff synchronization failure policies. Problem size is given by nx ny nz where nx , ny , and nz are the degree of discretization in the x, y, and z dimensions, respectively. 7 | | 32000 48000 | | 1.5 | 1.2 | 0.9 0.6 0.3 Our results thus far indicate that an implementation of MICCG3D that uses fine-grain synchronization performs better than an implementation that uses coarse-grain synchronization. We will now investigate the causes of this difference. In particular, we investigate the impact of cycle efficiency and memory efficiency provided by the Alewife implementation of fine-grain synchronization (discussed in Sections 2 and 5). To understand the effect of cycle efficiency, we simulated a 16 16 16 problem size on 4 and 16 processors, varying the cost of a successful JREF between 1 cycle and 21 cycles. This was accomplished by artificially introducing stall cycles immediately before each JREF. The results of these simulations appear in Figure 8 and show the effect on both the solver operation in isolation and on one entire MICCG3D iteration. As we can see, increasing the cost of a successful JREF does not dramatically impact overall runtime. We simulated the cost out to 21 cycles only to show extreme effects. We expect any realistic JREF implementation to cost less than 10 cycles. Notice the 4 processor simulation is more sensitive to the cost of a successful JREF than the 16 processor simulation (exhibited by a steeper slope). This is because for a fixed problem size, there are more JREFs in the critical path of execution for smaller machine sizes; thus, a greater increase in execution time will result from a given increase in the cost of each JREF. | 0.0 64000 | | 16000 1.8 | | | | 0.0 | 0 6.4 Evaluating Support for Fine-Grain Synchronization 64000 | | Time (Mcycles) | | 0.3 | 0.6 BAR TIME BAR SKEW BUSY | 0.9 48000 | 1.2 | 1.5 32000 | 1.8 16000 | 0 Block Size Figure 7: Barrier overhead as a function of block size for 16 processors. Block size is given by nx ny nz where nx , ny , and nz are the number of elements in one block in the x, y, and z dimensions, respectively. optimized simultaneously. Making k large increases the theoretical speedup term; however, it reduces the run-length between barriers, rl , which makes the barrier overhead B more significant. Similarly, decreasing k helps the overhead term but lowers the theoretical speedup term. The degree to which performance is affected by the cost of a successful JREF depends on the frequency of JREFs. The more frequent JREFs are, the greater the effect increasing successful JREF cost will have. For MICCG3D on 16 processors, we have observed frequencies of approximately 1 JREF every 80 cycles. We were surprised to find such a low JREF frequency. To understand how barrier overhead behaves for large problem sizes, we simulated one block of computation in the solver on a 16 processor Alewife machine. Observing the barrier overhead for a single block is equivalent to looking at the total barrier overhead for a problem size that is a factor kP larger since there are kP blocks in the entire coarse-grain solver operation (see Section 4.1). Even for modest values of P and k, simulating the largest feasible block size is in fact equivalent to looking at fairly large problem sizes. The result of this experiment appears in Figure 7 which shows the average barrier operation cost (appearing as “BAR TIME”) and the average skew in thread runtimes (appearing as “BAR SKEW”) for various block sizes. Since the cost of a barrier operation depends only on the number of processors, it is constant with respect to block size and is trivial for most block sizes. Skew, however, is a significant source of overhead even at the largest block sizes simulated (roughly 30%). We have observed that skew in MICCG3D is due to nonuniform cache hit rates and network latency across the machine and not due to load imbalance. The frequency of JREFs is determined by three factors: the amount of computation between JREFs, the amount of waiting on the memory system between JREFs, and the amount of waiting on failed synchronizations between JREFs. For this particular application, computation between JREFs is minimal, but for each JREF, at least one remote data value needs to be fetched; this cost is significant. Waiting on failed synchronization attempts also lowers the JREF rate significantly. This effect tends to be greater on large machines since the number of failed JREF attempts increases with more processors. We expect greater synchronization failure rates to be a trend as machine size grows; therefore, on larger machines, it will be more difficult to sustain high JREF rates. The effect of memory efficiency can be measured by implementing J-structures with explicit synchronization variables for each Jstructure element in place of the full-empty bits. This doubles the memory requirement and forces two memory transactions to occur for each JREF. We performed this modification to the solver operation and compared its performance with the original full-empty bit implementation of J-structures in simulations of 4 and 16 processors on 16 16 16 and 32 32 32 problem sizes each. The results of this experiment appear in Figure 9. We draw two conclusions from Figure 7. First, the barrier overhead of 68% reported for our 16 processor simulation in Table 3 is pessimistic and is due to the fact that we can only simulate small problem sizes. Second, although we expect better performance as problem size is scaled, synchronization overhead will still be significant at large problem sizes. In the left graph of Figure 9, we compare execution times. For each machine size and problem size, there is a group of three bars. The “H” bar (for Hardware) uses the Alewife support for Jstructures. The “S1” bar (for Software1) uses explicit synchronization variables for each J-structure element, and “S2” (for Software2) uses the same J-structure implementation as “S1” except the cache size has been doubled from 64K bytes to 128K bytes. All three 8 | | | | 4 6 8 | | 2.0 | 1.6 | 1.2 | | | | | | | | | | Successful JREF Cost (cycles) | 2.4 2 | 4 Processors 16 Processors 0.8 | | | | | | 2.8 0.4 | | | Execution Time (Mcycles) | | | | | | | | | | | Execution Time (Mcycles) | 0.0 | 0 0.0 | | | | | | | | 10 12 14 16 18 20 | 0.4 | | 0.8 | | 10 12 14 16 18 20 1.2 | | 8 1.6 | | 6 0.8 2.0 8 | | 4 1.2 0.4 | 1.6 2 2.0 2.4 6 | 2.4 2.8 4 | 2.8 2 | | 0.0 | 0 0 | | 0.4 | 0.8 4 Processors 16 Processors | 1.2 10 12 14 16 18 20 | 1.6 8 | 2.0 6 | 2.4 4 | 2.8 2 | 0 0.0 10 12 14 16 18 20 Successful JREF Cost (cycles) Solver Operation Entire MICCG3D | | | 2.0 | 1.8 | 1.6 | 1.4 | 1.2 | 1.0 | 0.8 | 0.6 | 6,764K 0.4 0.2 0.2 0.0 | H S1S2 H S1S2 H S1S2 H S1S2 0.0 4 Proc 4 Proc 16 Proc 16 Proc 16x16x16 32x32x32 16x16x16 32x32x32 | 0.0 | H S1S2 H S1S2 H S1S2 H S1S2 0.0 4 Proc 4 Proc 16 Proc 16 Proc 16x16x16 32x32x32 16x16x16 32x32x32 | 0.4 | 1,314K | Normalized JREF/Cache Overhead | | | | | | | | | | | 0.6 2,160K 2.2 | 0.2 0.8 471K 2.4 | 0.4 1.0 Waiting for JREF Misses Waiting for Cache Misses | 0.6 1.2 | 0.8 1.4 | 1.0 1.6 | 1.2 1.8 | 1.4 2.0 | 1.6 2.2 | 1.8 | | 1,364K | 210K 2.0 2.4 | 4,051K 2.2 | 564K 2.4 | | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 | 1.2 | 1.4 | 1.6 | 1.8 | 2.0 | 2.2 | 2.4 | Normalized Execution Time Figure 8: Effect of increasing successfulJREF cost in both the solver operation and an entire MICCG3D iteration. Problem size = 16 16 16. Figure 9: Execution times and overheads in hardware versus software implementations of J-structures in the solver operation. The “H” bars (for “Hardware”) use full-empty bit support, the “S1” bars (for “Software1”) use explicit software variables, and the “S2” bars (for “Software2”) use the same implementation as the “S1” bars except cache size has been doubled to study the effects of cache pollution. 9 | 1.0 | 0.8 | 0.6 | 0.4 | 0.2 | 0.0 0.67 0.6 0.46 0.4 0.41 | Normalized Execution Time | 1.0 | 0.21 0.2 | 0.0 | 0.14 0.13 | 0.8 1.0 | In the right graph of Figure 9, the synchronization and cache overhead components for each of the simulations appearing on the left graph are shown. Again, times have been normalized against the hardware time in each group of three, and the normalizing constant appears over each group. Notice that indeed, cache overhead is significantly higher in the software versions. In general, the JREF overhead for the software versions goes up as well. This is because in order for the backoff mechanism (discussed in Section 6.2) to be effective for the software implementations, the backoff time needs to be increased. This change in JREF overhead is most pronounced in the 16 processor, 16 16 16 problem size simulation. In this simulation, because of the relatively small problem size for the relatively large machine size, total execution time is very short. This means the JREF overhead is dominated by the JREF misses which occur at the beginning and end of the back and forward substitution steps. In these phases of the solver step, JREF misses are frequent, so the increase in backoff time for the software implementation has a significant impact on total JREF overhead. For more realistic problem sizes, this “JREF cold start” effect will not be significant and instead, we should see a “steady state” JREF overhead. This trend is already visible in the 4 processor 32 32 32 simulation. Since the JREF miss rate is independent of problem size (shown in Section 6.2), the asymptotic overhead for real problem sizes will be dominated by the cache overhead. 1.0 | bars in the group have been normalized against the “H” execution time; the normalizing constant in raw cycles appears directly above each group (the units are in thousands of cycles). We see that the software version consistently runs about 35% slower. Notice that doubling the cache size does not help, indicating that the difference in performance is not due to cache pollution from the extra synchronization variables in the software version. This is expected because in MICCG3D, the producer-consumer computation exhibits very little data reuse. In fact, all cache misses to J-structures are cold start misses. I II III IV 4 Proc 16x16x16 I II III IV 16 Proc 16x16x16 Figure 10: Benefits of the fine-grain implementation added incrementally. Bar I uses only coarse-grain expression, bar II allows fine-grain expression, bar III allows fine-grain expression with fullempty bit support, and bar IV allows fine-grain expression with full-empty bits and fast bit operations. execution times between bars III and IV. In both sets of data, this performance gain is small, on the order of 10%. The impact of memory efficiency is quantified by the difference between bars II and III, and indicates a more significant performance improvement of about 40%. But by far the largest gain in performance comes from increased parallelism due to the expressiveness of fine-grain synchronization as quantified by the difference between bars I and II. The two sets of data together also show that as machine size is scaled, fine-grain expression of synchronization becomes even more critical in improving application performance in comparison to memory and cycle efficiency. Notice that 16 processors is still a very modest machine size. With further machine scaling, we expect that the sharp difference in performance gains observed for the 16 processor data set will be even more pronounced. 6.5 Interpreting the Fine-Grain Performance Gains The discussion in the previous section examined in detail the impact of two components of support for fine-grain synchronization on application performance, those two which involve hardware-level support. In this section, we consider the essential results of this study and relate it to the importance of language-level support. There are two issues which help define the bounds within which this conclusion is valid. First, we recognize that this discussion has only considered machine scaling on a fixed problem size. This is important if we are interested in running a given problem as fast as possible. What if we are interested in scaling problem size while keeping machine size fixed? We expect that a larger problem size will improve the execution time of the coarse-grain implementation more than the fine-grain implementation, so the dramatic difference between bars I and II should decrease with problem size scaling. The question is, how much will it decrease? Recall from Section 6.3 that even for fairly large problem sizes, barrier overhead will remain significant due to skew in the runtimes of threads. Therefore, we predict that limited parallelism will remain a problem in the MICCG3D application and fine-grain expression of synchronization will continue to be important. We collect data from earlier parts of the paper to show the increase in performance of the MICCG3D application as the components of support for fine-grain synchronization are added incrementally. Figure 10 shows normalized execution times for the 16 16 16 problem size of MICCG3D. Two sets of data are shown, one for 4 processors and one for 16 processors. In each set, bar I shows the execution time of MICCG3D implemented with coarse-grain barriers. Bar II shows the execution time when J-structure style synchronization is used; however, the J-structures are implemented purely in software without the benefit of cycle efficiency and memory efficiency. Bar III shows the execution time when the J-structures are supported by full-empty bits that can be accessed in 5 cycles thus providing memory efficiency (but not cycle efficiency). Finally, bar IV shows the addition of cycle efficiency by allowing single cycle access to full-empty bits (i.e., full Alewife support for fine-grain synchronization). Second, we recognize that the importance of memory and cycle efficiency ultimately depends on the frequency of JREFs (recall the discussion in Section 6.4). JREF frequency is determined by the amount of computation, memory system latency, and synchro- The impact of cycle efficiency is quantified by the difference in 10 nization failure latency between every JREF. Therefore, the JREF frequency can be increased by employing the following optimizations. More aggressive compiler optimizations can reduce the cost of the computation, prefetching can reduce the overhead of the memory system, and multithreading combined with fast context switching can hide the latency of synchronization failures. The extent to which these optimizations will influence the importance of memory and cycle efficiency relative to fine-grain expression requires further study. more significant, but further study is needed to quantify this effect. 8 Acknowledgments This research was funded in part by NSF grant # MIP-9012773, in part by DARPA contract # N00014-87-K-0825, and in part by a NSF Presidential Young Investigator Award. We would also like to acknowledge Norm Rubin [10] for inspiring us to use the MICCG3D application for our study. 7 Summary References Previous application studies have dealt with problems which do not present a challenge for synchronization. To obtain a better understanding of what the synchronization needs of programmers will be, a more comprehensive look into how applications synchronize is needed. MICCG3D, a preconditioned conjugate gradient algorithm using the incomplete Cholesky factorization of the coefficient matrix as a preconditioner, is an application for which synchronization is a challenging problem. It is an important application having received much attention in previous work [8, 11, 13, 14], and it represents a larger class of preconditioned iterative methods which have traditionally been hard to parallelize. By implementing MICCG3D using both a coarse- and fine-grain approach, we have discovered that the application benefits greatly from fine-grain synchronization. [1] Anant Agarwal, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Kiyoshi Kurihara, Beng-Hong Lim, Gino Maa, and Dan Nussbaum. The MIT Alewife Machine: A Large-Scale DistributedMemory Multiprocessor. MIT/LCS/TM 454, MIT Laboratory for Computer Science, June 1991. [2] Gail Alverson, Robert Alverson, and David Callahan. Exploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor. Workshop on Multithreaded Computers, Proceedings of Supercomputing ’91, ACM Sigraph & IEEE, November 1991. [3] Arvind, and Rishiyur S. Nikhil. I-Structures: Data Structures for Parallel Computing. ACM Transactions on Programming Languages and Systems, pp. 598-632, Vol. 11, No. 4, October 1989. [4] Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia. 1991. In problem sizes that we simulated, we observe that the implementation using fine-grain synchronization executed 3.7 times faster than the coarse-grain implementation on a 16 processor Alewife machine. Since JREF overhead does not depend on problem size, we expect the fine-grain version to maintain its performance as problem size is scaled. Although the barrier overhead in the coarse-grain implementation will improve as problem size is scaled, simulations show that skew in the runtimes of threads will remain significant for realistic problem sizes. Fundamentally, this is due to the fact that an increase in problem size only affects an increase in run-length between barriers that is kP times smaller. Even for modest machine sizes, and especially for large machines, we expect run-lengths to be small enough on realistic problem sizes that skew at the barriers will remain significant. Therefore, we anticipate that the fine-grain implementation will sustain a significant performance advantage over the coarse-grain implementation at large problem sizes. [5] Gene H. Golub and Charles F. van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, Maryland. 1983. [6] Louis A. Hageman and David M. Young. Applied Iterative Methods. Academic Press, New York. 1981. [7] David Kranz, Beng-Hong Lim, and Anant Agarwal. Low-Cost Support for Fine-Grain Synchronization in Multiprocessors. Multithreading: A Summary of the State of the Art, Kluwer Academic Publishers, 1992. Also available as MIT/LCS/TM 470, 1992. [8] Gerard Meurant. Multitasking the Conjugate Gradient Method on the CRAY X-MP/48. Parallel Computing, Vol 5, pp 267-280, 1987. [9] G. M. Papadopoulos and D. E. Culler. Monsoon: An Explicit TokenStore Architecture. Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 82-91, New York, June 1990. [10] Norman Rubin. Data Flow Computing and the Conjugate Gradient Method. MCRC-TR-25, Motorola Cambridge Research Center, 1992. After evaluating the performance of MICCG3D implemented with both coarse- and fine-grain synchronization, we extended our study to understand how fine-grain synchronization achieves its performance advantage. First, we identified three components of support for fine-grain synchronization and enumerated the benefits they provide for applications: increased parallelism through expressiveness, cycle efficiency, and memory efficiency. Next, we ascertained the degree to which each of these benefits were responsible for the performance gains we observed in the fine-grain implementation of MICCG3D. Our conclusion is that for the MICCG3D application, cycle efficiency has the least impact while memory efficiency provides a more significant 40% increase in performance. But by far the most important benefit for MICCG3D is the ability to increase parallelism by expressing synchronization at a fine granularity. By employing optimizations to reduce the cost of the computation, memory system, and synchronization failures, we expect the contributions of memory and cycle efficiency to become [11] Youcef Saad and Martin H. Schultz. Parallel Implementations of Preconditioned Conjugate Gradient Methods. Research Report YALEU/DCS/RR-425, Yale University, Department of Computer Science, October 1985. [12] B. J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. SPIE, 298:241-248, 1981. [13] Henk A. van der Vorst. High Performance Preconditioning. SIAM Journal of Scientific Statistical Computing, Vol. 10, No. 6, pp. 11741185, November 1989. [14] Henk A. van der Vorst. The Performance of FORTRAN Implementations for Preconditioned Conjugate Gradients on Vector Computers. Parallel Computing, Vol 3, pp 49-58, 1986. 11
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.2 Linearized : No Page Count : 11 Create Date : 1910:00:12 11:15:35 Producer : Acrobat Distiller 3.01 for Windows Creator : dvips 5.495 Copyright 1986, 1992 Radical Eye SoftwareEXIF Metadata provided by EXIF.tools