007 2361 002
User Manual: 007-2361-002
Open the PDF directly: View PDF .
Page Count: 218
Download | |
Open PDF In Browser | View PDF |
MIPSpro™ Fortran 77 Programmer’s Guide Document Number 007-2361-002 CONTRIBUTORS Written by Chris Hogue Edited by Christina Carey Illustrated by Gloria Ackley Production by Julia Lin Engineering contributions by Bill Johnson, Bron Nelson, Calvin Vu, Marty Itzkowitz, Dick Lee © Copyright 1994 Silicon Graphics, Inc.— All Rights Reserved This document contains proprietary and confidential information of Silicon Graphics, Inc. The contents of this document may not be disclosed to third parties, copied, or duplicated in any form, in whole or in part, without the prior written permission of Silicon Graphics, Inc. RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure of the technical data contained in this document by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 52.227-7013 and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR Supplement. Unpublished rights are reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94043-1389. Silicon Graphics and IRIS are registered trademarks, and CASEVision, CHALLENGE, Crimson, Indigo2, IRIS 4D, IRIX, MIPSpro, and POWER CHALLENGE are trademarks of Silicon Graphics, Inc. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. VMS and VAX are trademarks of Digital Equipment Corporation. Portions of this product and document are derived from material copyrighted by Kuck and Associates, Inc. MIPSpro™ Fortran 77 Programmer’s Guide Document Number 007-2361-002 Contents Examples ix Figures xi Tables xiii Introduction xv Organization xv Additional Reading xvi Typographical Conventions 1. xvii Compiling, Linking, and Running Programs 1 Compiling and Linking 2 Drivers 2 Compilation 2 Compiling Multilanguage Programs 4 Linking Objects 5 Specifying Link Libraries 7 Driver Options 7 Compiling Simple Programs 8 Specifying Source File Format 8 Specifying Compiler Input and Output Files 9 Specifying Target Machine Features 10 Specifying Memory Allocation and Alignment 10 Specifying Debugging and Profiling 11 Specifying Optimization Levels 11 Controlling Compiler Execution 14 Object File Tools 14 Archiver 15 iii : Run-Time Considerations 15 Invoking a Program 15 Maximum Memory Allocations 16 File Formats 17 Preconnected Files 18 File Positions 18 Unknown File Status 19 Quad-Precision Operations 19 Run-Time Error Handling 19 Floating Point Exceptions 20 2. 3. iv Storage Mapping 21 Alignment, Size, and Value Ranges 22 Access of Misaligned Data 25 Accessing Small Amounts of Misaligned Data 26 Accessing Misaligned Data Without Modifying Source Fortran Program Interfaces 27 How Fortran Treats Subprogram Names 28 Working with Mixed-Case Names 28 Preventing a Suffix Underscore with $ 29 Naming Fortran Subprograms from C 29 Naming C Functions from Fortran 29 Testing Name Spelling Using nm 30 Correspondence of Fortran and C Data Types 30 Corresponding Scalar Types 30 Corresponding Character Types 32 Corresponding Array Elements 32 How Fortran Passes Subprogram Parameters 33 Normal Treatment of Parameters 34 Calling Fortran from C 35 Calling Fortran Subroutines from C 35 Calling Fortran Functions from C 38 26 Calling C from Fortran 40 Normal Calls to C Functions 41 Using Fortran COMMON in C Code 43 Using Fortran Arrays in C Code 44 Calls to C Using LOC%, REF% and VAL% Making C Wrappers with mkf2c 48 Using mkf2c and extcentry 52 Makefile Considerations 53 4. System Functions and Subroutines 55 Library Functions 55 Extended Intrinsic Subroutines 63 DATE 64 IDATE 64 ERRSNS 64 EXIT 65 TIME 65 MVBITS 66 Extended Intrinsic Functions 67 SECNDS 67 RAN 67 5. Scalar Optimizations 69 Overview 69 Performing General Optimizations 71 Enabling Loop Fusion 71 Controlling Global Assumptions 71 Setting Invariant IF Floating Limits 72 Setting the Optimization Level 74 Controlling Variations in Round Off 76 Controlling Scalar Optimizations 78 Using Vector Intrinsics 79 45 v : Performing Advanced Optimizations 82 Using Aggressive Optimization 82 Controlling Internal Table Size 83 Performing Memory Management Transformations Enabling Loop Unrolling 86 Recognizing Directives 88 Specifying Recursion 89 vi 6. Inlining and Interprocedural Analysis 91 Overview 91 Using Command Line Options 92 Specifying Routines for Inlining or IPA 93 Specifying Occurrences for Inlining and IPA 94 Specifying Where to Search for Routines 97 Creating Libraries 98 Conditions That Prevent Inlining and IPA 100 7. Fortran Enhancements for Multiprocessors 103 Overview 104 Parallel Loops 104 Writing Parallel Fortran 105 C$DOACROSS 106 C$& 112 C$ 112 C$MP_SCHEDTYPE and C$CHUNK 113 Nesting C$DOACROSS 113 Analyzing Data Dependencies for Multiprocessing Breaking Data Dependencies 120 Work Quantum 126 Cache Effects 128 Performing a Matrix Multiply 129 Understanding Trade-Offs 129 Load Balancing 131 114 84 Advanced Features 133 mp_block and mp_unblock 133 mp_setup, mp_create, and mp_destroy 134 mp_blocktime 134 mp_numthreads, mp_set_numthreads 135 mp_my_threadnum 135 Environment Variables: MP_SET_NUMTHREADS, MP_BLOCKTIME, MP_SETUP 136 Environment Variables: MP_SUGNUMTHD, MP_SUGNUMTHD_VERBOSE, MP_SUGNUMTHD_MIN, MP_SUGNUMTHD_MAX 137 Environment Variables: MP_SCHEDTYPE, CHUNK 138 mp_setlock, mp_unsetlock, mp_barrier 138 Local COMMON Blocks 138 Compatibility With sproc 139 DOACROSS Implementation 140 Loop Transformation 140 Executing Spooled Routines 142 PCF Directives 143 Parallel Region 145 PCF Constructs 146 Restrictions 157 A Few Words About Efficiency 158 8. Compiling and Debugging Parallel Fortran Compiling and Running 159 Using the –static Option 160 Examples of Compiling 160 Profiling a Parallel Fortran Program 161 Debugging Parallel Fortran 162 General Debugging Hints 162 159 vii : 9. Fine-Tuning Program Execution 165 Overview 166 Directives 166 Assertions 168 Fine-Tuning Scalar Optimizations 170 Controlling Internal Table Size 170 Setting Invariant IF Floating Limits 170 Optimization Level 172 Variations in Round Off 173 Controlling Scalar Optimizations 174 Enabling Loop Unrolling 174 Fine-Tuning Inlining and IPA 175 Using Equivalenced Variables 176 Using Assertions 176 Using Aliasing 177 C*$* ASSERT [NO] ARGUMENT ALIASING 177 C*$* ASSERT RELATION 178 Fine-Tuning Global Assumptions 179 C*$* ASSERT [NO]BOUNDS VIOLATIONS 179 C*$* ASSERT NO EQUIVALENCE HAZARD 180 C*$* ASSERT [NO] TEMPORARIES FOR CONSTANT ARGUMENTS 181 Ignoring Data Dependencies 182 A. Run-Time Error Messages 183 Index viii 191 Examples Example 3-1 Example 3-2 Example 3-3 Example 3-4 Example 3-5 Example 3-6 Example 3-7 Example 3-8 Example 3-9 Example 3-10 Example 3-11 Example 3-12 Example 3-13 Example 3-14 Example 3-15 Example 3-16 Example 3-17 Example 3-18 Example 3-19 Example 3-20 Example 3-21 Example Subroutine Call 34 Example Function Call 34 Example Fortran Subroutine with COMPLEX Parameters 36 C Declaration and Call with COMPLEX Parameters 36 Example Fortran Subroutine with String Parameters 36 C Program that Passes String Parameters 37 C Program that Passes Different String Lengths 37 Fortran Function Returning COMPLEX*16 38 C Program that Receives COMPLEX Return Value 39 Fortran Function Returning CHARACTER*16 39 C Program that Receives CHARACTER*16 Return 40 C Function Written to be Called from Fortran 41 Common Block Usage in Fortran and C 43 Fortran Program Sharing an Array in Common with C 44 C Subroutine to Modify a Common Array 44 Fortran Function Calls Using %VAL 46 Fortran Call to gmatch() Using %REF 47 Fortran Call to gmatch() Using %VAL(%LOC()) 48 C Function Using varargs 51 C Code to Retrieve Hidden Parameters 51 Source File for Use with extcentry 52 ix Figures Figure 1-1 Figure 1-2 Figure 1-3 Figure 3-1 Compilation Process 3 Compiling Multilanguage Programs 5 Linking 6 Correspondence Between Fortran and C Array Subscripts 33 xi Tables Table 1-1 Table 1-2 Table 1-3 Table 1-4 Table 1-5 Table 1-6 Table 1-7 Table 1-8 Table 1-9 Table 1-10 Table 2-1 Table 2-2 Table 2-3 Table 3-1 Table 3-2 Table 4-1 Table 4-2 Table 4-3 Table 4-4 Table 4-5 Table 5-1 Table 5-2 Table 5-3 Link Libraries 6 Compile Options for Source File Format 8 Compile Options that Select Files 9 Compile Options for Target Machine Features 10 Compile Options for Memory Allocation and Alignment 10 Compile Options for Debugging and Profiling 11 Compile Options for Optimization Control 12 Power Fortran Defaults for Optimization Levels 13 Compile Options for Compiler Phase Control 14 Preconnected Files 18 Size, Alignment, and Value Ranges of Data Types 22 Valid Ranges for REAL*4 and REAL*8 Data Types 23 Valid Ranges for REAL*16 Data Type 23 Corresponding Fortran and C Data Types 31 How mkf2c treats Function Arguments 49 Summary of System Interface Library Routines 56 Overview of System Subroutines 63 Information Returned by ERRSNS 65 Arguments to MVBITS 66 Function Extensions 67 Optimization Options 70 Vector Intrinsic Function Names 82 Recommended Cache Option Settings 85 xiii Table 6-1 Table 6-2 Table 6-3 Table 7-1 Table 9-1 Table 9-2 Table A-1 xiv Inlining and IPA Options 92 Inlining and IPA Search Command Line Options Filename Extensions 97 Summary of PCF Directives 144 Directives Summary 167 Assertions and Their Duration 168 Run-Time Error Messages 184 97 Introduction This manual provides information on implementing Fortran 77 programs using the MIPSpro™ Fortran 77 compiler on IRIX™ 6.0.1 Power CHALLENGE, Power CHALLENGE Array, and Power Indigo systems. This implementation of Fortran 77 contains full American National Standards Institute (ANSI) Programming Language Fortran (X3.9–1978). Extensions provide full VMS Fortran compatibility to the extent possible without the VMS operating system or VAX data representation. This implementation of Fortran 77 also contains extensions that provide partial compatibility with programs written in SVS Fortran. Organization This manual contains the following chapters and appendix: • Chapter 1, “Compiling, Linking, and Running Programs,” gives an overview of components of the compiler system, and describes how to compile, link, and execute a Fortran program. It also describes special considerations for programs running on IRIX systems, such as file format and error handling. • Chapter 2, “Storage Mapping,” describes how the Fortran compiler implements size and value ranges for various data types and how they are mapped to storage. It also describes how to access misaligned data. • Chapter 3, “Fortran Program Interfaces,” provides reference and guide information on writing programs in Fortran and C that can communicate with each other. It also describes the process of generating wrappers for C routines called by Fortran. • Chapter 4, “System Functions and Subroutines,” describes functions and subroutines that can be used with a program to communicate with the IRIX operating system. xv Introduction • Chapter 5, “Scalar Optimizations,” describes the scalar optimizations you can enable from the command line. • Chapter 6, “Inlining and Interprocedural Analysis,” explains how to perform inlining and interprocedural analysis by specifying options to the compiler. • Chapter 7, “Fortran Enhancements for Multiprocessors,” describes programming directives for running Fortran programs in a multiprocessor mode. • Chapter 8, “Compiling and Debugging Parallel Fortran,” describes and illustrates compilation and debugging techniques for running Fortran programs in a multiprocessor mode. • Chapter 9, “Fine-Tuning Program Execution,” describes how to fine-tune program exection by specifying assertions and directives in your source program. • Appendix A, “Run-Time Error Messages,” lists the error messages that can be generated during program execution. Additional Reading Refer to the MIPSpro Fortran 77 Language Reference Manual for a description of the Fortran 77 language as implemented on Silicon Graphics systems. Refer to the MIPS Compiling and Performance Tuning Guide for information on the following topics: • an overview of the compiler system • improving program performance by using the profiling and optimization facilities of the compiler system • general discussion of performance tuning • the dump utilities, archiver, debugger, and other tools used to maintain Fortran programs Refer to the MIPSpro Porting and Transition Guide for information on: xvi • an overview of the 64-bit compiler system • language implementation differences Typographical Conventions • porting source code to the 64-bit system • compilation and run-time issues For information on interfaces to programs written in assembly language, refer to the MIPSpro Assembly Language Programmer's Guide. Refer to the CASEVision™/WorkShop Pro MPF User’s Guide for information about using WorkShop Pro MPF. Typographical Conventions The following conventions and symbols are used in the text to describe the form of Fortran statements: Bold Indicates literal command line options, filenames, keywords, function/subroutine names, pathnames, and directory names. Italics Represents user-defined values. Replace the item in italics with a legal value. Italics are also used for command names, manual page names, and manual titles. Courier Indicates command syntax, program listings, computer output, and error messages. Courier bold Indicates user input. [] Enclose optional command arguments. () Surround arguments or are empty if the function has no arguments following function/subroutine names. Surround manual page section in which the command is described following IRIX commands. {} Enclose two or more items from which you must specify exactly one. | Separates two or more optional items. ... Indicates that the preceding optional items can appear more than once in succession. xvii Introduction # IRIX shell prompt for the superuser. % IRIX shell prompt for users other than the superuser. Here are two examples illustrating the syntax conventions. DIMENSION a(d) [,a(d)] … indicates that the Fortran keyword DIMENSION must be written as shown, that the user-defined entity a(d) is required, and that one or more of a(d) can be optionally specified. Note that the pair of parentheses ( ) enclosing d is required. {STATIC | AUTOMATIC} v [,v] … indicates that either the STATIC or AUTOMATIC keyword must be written as shown, that the user-defined entity v is required, and that one or more of v items can be optionally specified. xviii Chapter 1 1. Compiling, Linking, and Running Programs This chapter contains the following major sections: • “Compiling and Linking” describes the compilation environment and how to compile and link Fortran programs. This section also contains examples that show how to create separate linkable objects written in Fortran, C, or other languages supported by the compiler system and how to link them into an executable object program. • “Driver Options” gives an overview of debugging, profiling, optimizing, and other options provided with the Fortran f77 driver. • “Object File Tools” briefly summarizes the capabilities of the dump, dis, nm, file, size and strip programs that provide listing and other information on object files. • “Archiver” summarizes the functions of the ar program that maintains archive libraries. • “Run-Time Considerations” describes how to invoke a Fortran program, how the operating system treats files, and how to handle run-time errors. Also refer to the Fortran Release Notes for a list of compiler enhancements, possible compiler errors, and instructions on how to circumvent them. 1 Chapter 1: Compiling, Linking, and Running Programs Compiling and Linking Drivers Programs called drivers invoke the major components of the compiler system: the C preprocessor, the Fortran compiler, the optimizing code generator, and the linker. The f77 command runs the driver that causes your programs to be compiled, optimized, assembled, and linked. The format of the f77 driver command is as follows: f77 [option] … filename [option] where f77 invokes the various processing phases that compile, optimize, assemble, and link the program. option represents the driver options through which you provide instructions to the processing phases. They can be anywhere in the command line. These options are discussed later in this chapter. filename is the name of the file that contains the Fortran source statements. The filename must always have the suffix .f, .F, .for, .FOR, or .i. For example, myprog.f. Compilation The driver command f77 can both compile and link a source module. Figure 1-1 shows the primary drivers phases. It also shows their principal inputs and outputs for the source modules more.f. 2 Compiling and Linking more.f cpp Fortran Front End Optimizing Code Generator Linker Figure 1-1 more.o a.out Compilation Process Note the following: • The source file ends with the required suffixes .f, .F, .for, .FOR, or .i. • The source file is passed through the C preprocessor, cpp, by default. cpp does not recognize Hollerith strings and may interpret a character sequence in a Holleritch string that looks like a C-style comment or a macro as a C-style comment or macro. The –nocpp option prevents this misinterpretation. (See the –nocpp option in “Driver Options” on page 7 for details.) In the example % f77 myprog.f –nocpp the file myprog.f will not be preprocessed by cpp. • The driver produces a linkable object file when you specify the –c driver option. This file has the same name as the source file, except with the suffix .o. For example, the command line % f77 more.f -c produces the more.o file in the above example. 3 Chapter 1: Compiling, Linking, and Running Programs • The default name of the executable object file is a.out. For example, the command line % f77 myprog.f produces the executable object a.out. • You can specify a name other than a.out for the executable object by using the driver option –o name, where name is the name of the executable object. For example, the command line % f77 myprog.o -o myprog links the object module myprog.o and produces an executable object named myprog. • The command line % f77 myprog.f -o myprog compiles and links the source module myprog.f and produces an executable object named myprog. Compiling Multilanguage Programs The compiler system provides drivers for other languages, including C and C++. If one of these drivers is installed in your system, you can compile and link your Fortran programs to the language supported by the driver. (See the MIPS Compiling and Performance Tuning Guide for a list of available drivers and the commands that invoke them; refer to Chapter 3, “Fortran Program Interfaces,” in this manual for conventions you must follow when writing Fortran program interfaces to C programs.) When your application has two or more source programs written in different languages, you should compile each program module separately with the appropriate driver and then link them in a separate step. Create objects suitable for linking by specifying the –c option, which stops the driver immediately after the assembler phase. For example, % cc -c main.c % f77 -c rest.f The two command lines shown above produce linkable objects named main.o and rest.o, as illustrated in Figure 1-2. 4 Compiling and Linking main.c rest.f C Preprocessor C Front End C Preprocessor Code Generator Fortran Front End main.o Code Generator rest.o Figure 1-2 Compiling Multilanguage Programs Linking Objects You can use the f77 driver command to link separate objects into one executable program when any one of the objects is compiled from a Fortran source. The driver recognizes the .o suffix as the name of a file containing object code suitable for linking and immediately invokes the linker. The following command links the object created in the last example: % f77 -o myprog main.o rest.o You can also use the cc driver command, as shown below: % cc -o myprog main.o rest.o -lftn -lm 5 Chapter 1: Compiling, Linking, and Running Programs Figure 1-3 shows the flow of control for this link. main.o rest.o Linker C All Figure 1-3 Fortran Linking Both f77 and cc use the C link library by default. However, the cc driver command does not know the names of the link libraries required by the Fortran objects; therefore, you must specify them explicitly to the linker using the –l option as shown in the example. The characters following –l are shorthand for link library files, as shown in Table 1-1. Table 1-1 Link Libraries –l Link Library Contents ftn /usr/lib64/nonshared/libftn.a Intrinsic function, I/O, multiprocessing, IRIX interface, and indexed sequential access method library for nonshared linking and compiling ftn /usr/lib64/libftn.so Same as above, except for shared linking and compiling (this is the default library) m /usr/lib64/libm.so Mathematics library See the section called “FILES” in the f77(1) manual page for a complete list of the files used by the Fortran driver. Also refer to the ld(1) manual page for information on specifying the –l option. 6 Driver Options Specifying Link Libraries You may need to specify libraries when you use IRIX system packages that are not part of a particular language. Most of the manual pages for these packages list the required libraries. For example, the getwd(3B) subroutine requires the BSD compatibility library libbsd.a. Specify this library as follows: % f77 main.o more.o rest.o -lbsd To specify a library created with the archiver, type in the pathname of the library as shown below. % f77 main.o more.o rest.o libfft.a Note: The linker searches libraries in the order you specify. Therefore, if you have a library (for example, libfft.a) that uses data or procedures from –lm, you must specify libfft.a first. Driver Options This section contains an overview of the Fortran–specific driver options. The f77(1) reference page has a complete description of the compiler options. This discussion only covers the relationships between some of the options, so as to help you make sense of the many options in the reference page. For for information you can review: • The MIPS Compiling and Performance Tuning Guide for a discussion of the compiler options that are common to all MIPSpro compilers. • The fopt(1) reference page for options related to the scalar optimizer. • The pfa(1) reference page for options related to the parallel optimizer. • The ld(1) reference page for a description of the linker options. Tip: The command f77 -help lists all compiler options for quick reference. Use the -show option to have the compiler document each phase of execution, showing the exact default and nondefault options passed to each. 7 Chapter 1: Compiling, Linking, and Running Programs Compiling Simple Programs You need only a very few compiler options when you are compiling a simple program. Examples of simple programs include • Test cases used to explore algorithms or Fortran language features • Programs that are principally interactive • Programs whose performance is limited by disk I/O • Programs you will execute under a debugger In these cases you need only specify -g for debugging, the target machine architecture, and the word-length. For example, to compile a single source file to execute under dbx on a Power Challenge XL, you could use the following commands. f77 -g -mips4 -64 -o testcase testcase.f dbx testcase However, a program compiled in this way will take little advantage of the performance features of the machine. In particular, its speed when doing heavy floating-point calculations will be far slower than the machine is capable of. For simple programs, that is not important. Specifying Source File Format The options summarized in Table 1-2 tell the compiler how to treat the program source file. Table 1-2 8 Compile Options for Source File Format Options Purpose -ansi Report any nonstandard usages. -backslash Treat \ in character literals as a character, not as the first character of an escape sequence. -col72, -col120, -extend_source, -noextend_source Specify margin columns of source lines. Driver Options Table 1-2 (continued) Compile Options for Source File Format Options Purpose -d_lines Compile lines with D in column 1. -Dname, -Dname=def, -Uname Define, undefine names to the C preprocessor. Specifying Compiler Input and Output Files The options summarized in Table 1-3 tell the compiler what output files to generate. Table 1-3 Compile Options that Select Files Options Purpose -c Generate a single object file for each input file; do not link. -E Run only the macro preprocessor and write its output to standard output. -I, -Idir, -nostdinc Specify location of include files. -listing Request a listing file. -MDupdate Request Makefile dependency output data. -o Specify name of output file. -S Specify only assembly-language source output. 9 Chapter 1: Compiling, Linking, and Running Programs Specifying Target Machine Features The options summarized in Table 1-4 are used to specify the characteristics of the machine where the compiled program will be used. Table 1-4 Compile Options for Target Machine Features Options Purpose -32, -64 Whether target machine runs 64-bit mode (the usual) or 32-bit mode. The -64 option is allowed only with the -mips3 and -mips4 architecture options. -mips3, -mips4 The instruction architure available in the target machine: use -mips3 for MIPS R4x00 machines in 64-bit mode; use -mips4 for MIPS R8000 and R10000 machines. -TARG:option,... Specify certain details of the target CPU. Most of these options have correct default values based on the preceding options. -TENV:option,... Specify certain details of the software environment in which the source module will execute. Most of these options have correct default values based on other, more general values. Specifying Memory Allocation and Alignment The options summarized in Table 1-5 tell the compiler how to allocate memory and how to align variables in it. These options can have a strong effect on both program size and program speed. Table 1-5 10 Compile Options for Memory Allocation and Alignment Options Purpose -align8, -align16, -align32, -align64 Align all variables size n on n-byte address boundaries. -d8, -d16 Specify the size of DOUBLE and DOUBLE COMPLEX variables. -i2, -i4, -i8 Specify the size of INTEGER and LOGICAL variables. -r4, -r8 Specify the size of REAL and COMPLEX variables. Driver Options Table 1-5 (continued) Compile Options for Memory Allocation and Alignment Options Purpose -static Allocate all local variables statically, not dynamically on the stack. -Gsize, -xgot Specify use of the global option table. Specifying Debugging and Profiling The options summarized in Table 1-6 direct the compiler to include more or less extra information in the object file for debugging or profiling. Table 1-6 Compile Options for Debugging and Profiling Options Purpose -g0, -g2, -g3, -g Leave more or less symbol-table information in the object file for use with dbx or Workshop Pro cvd. -p Cause profiling to be enabled when the program is loaded. For more information on debugging and profiling, see the manuals listed in the preface. Specifying Optimization Levels The MIPSpro Fortran 77 compiler contains three optimizer phases. One is part of the compiler “back end”; that is, it operates on the generated code, after all syntax analysis and source transformations are complete. The use of this standard optimizer, which is common to all MIPSpro compilers, is discussed in the MIPS Compiling and Performance Tuning Guide. In addition, MIPSpro Fortran 77 contains two phases of accelerators, one for scalar optimization and one for parallel array optimization. These operate during the initial phases of the compilation, transforming the source statements before they are compiled to machine language. The options of the scalar optimizer are detailed in the fopt(1) reference page. The options of the parallel optimizer are detailed in the pfa(1) reference page. 11 Chapter 1: Compiling, Linking, and Running Programs Note: The reason these optimizer phases are documented in separate reference pages is that, when compiling for 32-bit machines, these phases use a separate product, the Power Fortran Accelerator, which has been integrated into the MIPSpro Fortran 77 compiler. The options summarized in Table 1-7 are used to communicate to the different optimization phases. Table 1-7 12 Compile Options for Optimization Control Options Purpose -O, -O0, -O1, -O2, -O3 Select basic level of optimization, setting defaults for all optimization phases. -GCM:option,... Specify details of global code motion performed by the back-end optimizer. -OPT:option,... Specify miscellaneous details of optimization. -SWP:option,... Specify details of pipelining done by back-end optimizer. -sopt[,option,...] Request execution of the scalar optimizer, and pass options to it. -pfa Request execution of the parallel source-to-source optimizer. -WK,option,... Pass options to either phase of Power Fortran. Driver Options When you use -O to specify the optimization level, the compiler assumes default options for the accelerator phases. These defaults are listed in Table 1-8. Remember, to see all options that are passed to a compiler phase, use the -show option. Table 1-8 Power Fortran Defaults for Optimization Levels Optimization Level Power Fortran Defaults Passed -O0 –WK,–roundoff=0,–scalaropt=0,–optimize=0 -O1 –WK,–roundoff=0,–scalaropt=0,–optimize=0 -O2 –WK,–roundoff=0,–scalaropt=0,–optimize=0 -O3 –WK,–roundoff=2,–scalaropt=3,–optimize=5 -sopt –WK,–roundoff=0,–scalaropt=3,–optimize=5 In addition to optimizing options, the compiler system provides other options that can improve the performance of your programs: • Two linker options, –G and –bestG, control the size of the global data area, which can produce significant performance improvements. See Chapter 2 of the Compiling, Debugging, and Performance Tuning Guide and the ld(1) reference page for more information. • The –jmpopt option permits the linker to fill certain instruction delay slots not filled by the compiler front end. This option can improve the performance of smaller programs not requiring extremely large blocks of virtual memory. See the ld(1) reference page for more information. 13 Chapter 1: Compiling, Linking, and Running Programs Controlling Compiler Execution The options summarized in Table 1-9 control the execution of the compiler phases. Table 1-9 Compile Options for Compiler Phase Control Options Purpose -E, -P Execute only the C preprocessor. -fe Stop compilation immediately after the front-end (syntax analysis) runs. -M Run only the macro preprocessor. -Yc,path Load the compiler phase specified by c from the specified path. -Wc,option,... Pass the specified list of options to the compiler phase specified by c. Object File Tools The following tools provide information on object files as indicated: 14 elfdump Lists headers, tables, and other selected parts of an ELF-format object or archive file. dis Disassembles object files into machine instructions. nm Prints symbol table information for object and archive files. file Lists the properties of program source, text, object, and other files. This tool often erroneously recognizes command files as C programs. It does not recognize Pascal or LISP programs. size Prints information about the text, rdata, data, sdata, bss, and sbss sections of the specified object or archive files. See the a.out(4) manual page for a description of the contents and format of section data. strip Removes symbol table and relocation bits. Archiver For more information on these tools, see the MIPS Compiling and Performance Tuning Guide and the dis(1), elfdump(1), file(1), nm(1), size(1), and strip(1) manual pages. Archiver An archive library is a file that contains one or more routines in object (.o) file format. The term object as used in this chapter refers to an .o file that is part of an archive library file. When a program calls an object not explicitly included in the program, the link editor ld looks for that object in an archive library. The link editor then loads only that object (not the whole library) and links it with the calling program. The archiver (ar) creates and maintains archive libraries and has the following main functions: • copying new objects into the library • replacing existing objects in the library • moving objects about the library • copying individual objects from the library into individual object files See the Compiling, Debugging, and Performance Tuning Guide and the ar(1) manual page for additional information on the archiver. Run-Time Considerations Invoking a Program To run a Fortran program, invoke the executable object module produced by the f77 command by entering the name of the module as a command. By default, the name of the executable module is a.out. If you included the –o filename option on the ld (or f77) command line, the executable object module has the name that you specified. 15 Chapter 1: Compiling, Linking, and Running Programs Maximum Memory Allocations The total memory allocation for a program, and in some cases individual arrays, can exceed 2 gigabytes (2 GB, or 2,048 MB). Previous implementations of Fortran 77 limited the total program size, as well as the size of any single array, to 2 GB. The current release allows the total memory in use by the program to far exceed this. (For details on the memory use of individual scalar values, see “Alignment, Size, and Value Ranges” on page 22.) Local Variable (Stack Frame) Sizes Arrays that are allocated on the process stack must not exceed 2 GB, but the total of all stack variables can exceed that limit. For example, parameter (ndim = 16380) integer*8 xmat(ndim,ndim), ymat(ndim,ndim), & zmat(ndim,ndim) integer k(1073741824) integer l(33554432, 256) However, when an array is passed as an argument, it is not limited in size. subroutine abc(k) integer k(8589934592_8) Static and Common Sizes When compiling with the -static flag, global data is allocated as part of the compiled object (.o) file. The total size of any .o file may not exceed 2 GB. However, the total size of a program linked from multiple .o files may exceed 2 GB. An individual common block may not exceed 2 GB. However, you can declare multiple common blocks each having that size. 16 Run-Time Considerations Pointer-based Memory There is no limit on the size of a pointer-based array. For example, integer *8 ndim parameter (ndim = 20001) pointer (xptr, xmat), (yptr, ymat), (zptr, zmat), & (aptr, amat) xptr = malloc(ndim*ndim*8) yptr = malloc(ndim*ndim*8) zptr = malloc(ndim*ndim*8) aptr = malloc(ndim*ndim*8) It is important to make sure that malloc is called with an INTEGER*8 value. A count greater than 2 GB would be truncated if assigned to an INTEGER*4. File Formats Fortran supports five kinds of external files: • sequential formatted • sequential unformatted • direct formatted • direct unformatted • key indexed file The operating system implements other files as ordinary files and makes no assumptions about their internal structure. Fortran I/O is based on records. When a program opens a direct file or key indexed file, the length of the records must be given. The Fortran I/O system uses the length to make the file appear to be made up of records of the given length. When the record length of a direct file is 1 byte, the system treats the file as ordinary system files (as byte strings, in which each byte is addressable). A READ or WRITE request on such files consumes bytes until satisfied, rather than restricting itself to a single record. Because of special requirements, sequential unformatted files will probably be read or written only by Fortran I/O statements. Each record is preceded and followed by an integer containing the length of the record in bytes. 17 Chapter 1: Compiling, Linking, and Running Programs During a READ, Fortran I/O breaks sequential formatted files into records by using each new line indicator as a record separator. The Fortran 77 standard does not define the required result after reading past the end of a record; the I/O system treats the record as being extended by blanks. On output, the I/O system writes a new line indicator at the end of each record. If a user program also writes a new line indicator, the I/O system treats it as a separate record. Preconnected Files Table 1-10 shows the standard preconnected files at program start. Table 1-10 Preconnected Files Unit # Unit 5 Standard input 6 Standard output 0 Standard error All other units are also preconnected when execution begins. Unit n is connected to a file named fort.n. These files need not exist, nor will they be created unless their units are used without first executing an open. The default connection is for sequentially formatted I/O. File Positions The Fortran 77 standard does not specify where OPEN should initially position a file explicitly opened for sequential I/O. The I/O system positions the file to start of file for both input and output. The execution of an OPEN statement followed by a WRITE on an existing file causes the file to be overwritten, erasing any data in the file. In a program called from a parent process, units 0, 5, and 6 remain where they were positioned by the parent process. 18 Run-Time Considerations Unknown File Status When the parameter STATUS="UNKNOWN" is specified in an OPEN statement, the following occurs: • If the file does not exist, it is created and positioned at start of file. • If the file exists, it is opened and positioned at the beginning of the file. Quad-Precision Operations When running programs that contain quad-precision operations, you must run the compiler in round-to-nearest mode. Because this mode is the default, you usually do not need to be concerned with setting it. You usually need to set this mode when writing programs that call your own assembly routines. Refer to the swapRM manual page for details.: Run-Time Error Handling When the Fortran run-time system detects an error, the following action takes place: • A message describing the error is written to the standard error unit (unit 0). See Appendix A, “Run-Time Error Messages,” for a list of the error messages. • A core file is produced if the f77_dump_flag environment variable is set, as described in Appendix A, “Run-Time Error Messages.” You can use dbx to inspect this file and determine the state of the program at termination. For more information, see the dbx Reference Manual. To invoke dbx using the core file, enter the following: % dbx binary-file core where binary-file is the name of the object file output (the default is a.out). For more information on dbx, see the dbx User's Guide. 19 Chapter 1: Compiling, Linking, and Running Programs Floating Point Exceptions The library libfpe provides two methods for handling floating point exceptions. Note: Owing to the different architecture of the MIPS R8000 and R10000 processors, library libfpe is not available with the current compiler. It will be provided in a future release. When porting 32-bit programs that depend on trapping exceptions using the facilities in libfpe, you will have to temporarily change the programs to do without it. The library provides the subroutine handle_sigfpes and the environment variable TRAP_FPE. Both methods provide mechanisms for handling and classifying floating point exceptions, and for substituting new values. They also provide mechanisms to count, trace, exit, or abort on enabled exceptions. See the handle_sigfpes(3F) manual page for more information. 20 Chapter 2 2. Storage Mapping This chapter contains two sections: • “Alignment, Size, and Value Ranges” describes how the Fortran compiler implements size and value ranges for various data types as well as how data alignment occurs under normal conditions. • “Access of Misaligned Data” describes two methods of accessing misaligned data. 21 Chapter 2: Storage Mapping Alignment, Size, and Value Ranges Table 2-1 contains information about various Fortran scalar data types. (For details on the maximum sizes of arrays, see “Maximum Memory Allocations” on page 16.) Table 2-1 Size, Alignment, and Value Ranges of Data Types Type Synonym Size Alignment Value Range BYTE INTEGER*1 8 bits Byte –128…127 16 bits Half worda –32,768…32,767 32 bits Wordc –231…231 –1 INTEGER*8 64 bits Double word –263…263 –1 LOGICAL*1 8 bits Byte 0…1 LOGICAL*2 16 bits Half worda 0…1 32 bits Wordc 0…1 64 bits Double word 0...1 INTEGER*2 INTEGER LOGICAL INTEGER*4b LOGICAL*4d LOGICAL*8 REAL REAL*4e 32 bits Wordc See Table 2-2 DOUBLE PRECISION REAL*8f 64 bits Double wordg See Table 2-2 128 bits Double word See Table 2-3 REAL*16 COMPLEX COMPLEX*8h 64 bits Double wordc See the fourth bullet item below DOUBLE COMPLEX COMPLEX*16i 128 bits Double wordg See the fourth bullet item below COMPLEX*32 256 bits Double word See the fourth bullet item below CHARACTER 8 bits Byte –128…127 a. Byte boundary divisible by two. b. When the –i2 option is used, type INTEGER is equivalent to INTEGER*2; when the –i8 option is used, INTEGER is equivalent to INTEGER*8. c. Byte boundary divisible by four. 22 Alignment, Size, and Value Ranges d. When the –i2 option is used, type LOGICAL is equivalent to LOGICAL*2; when the –i8 option is used, type LOGICAL is equivalent to LOGICAL*8. e. When the –r8 option is used, type REAL is equivalent to REAL*8. f. When the –d16 option is used, type DOUBLE PRECISION is equivalent to REAL*16. g. Byte boundary divisible by eight. h. When the –r8 option is used, type COMPLEX is equivalent to COMPLEX*16. i. When the –d16 option is used, type DOUBLE COMPLEX is equivalent to COMPLEX*32. The following notes provide details on some of the items in Table 2-1. • Table 2-2 lists the approximate valid ranges for REAL*4 and REAL*8. Table 2-2 Valid Ranges for REAL*4 and REAL*8 Data Types Range REAL*4 REAL*8 Maximum 3.40282356 * 1038 1.7976931348623158 * 10308 Minimum normalized 1.17549424 * 10 -38 2.2250738585072012 * 10-308 Minimum denormalized 1.40129846 * 10-46 1.1125369292536006 * 10 -308 • REAL*16 constants have the same form as DOUBLE PRECISION constants, except the exponent indicator is Q instead of D. Table 2-3 lists the approximate valid range for REAL*16. REAL*16 values have an 11-bit exponent and a 107-bit mantissa; they are represented internally as the sum or difference of two doubles. So, for REAL*16 “normal” means that both high and low parts are normals. Table 2-3 Valid Ranges for REAL*16 Data Type Range Precise Exception Mode w/FS Bit Clear Fast Mode or Precise Exception Mode w/FS Bit Set Maximum 1.797693134862315807937289714053023* 10308 1.797693134862315807937289714053023* 10308 Minimum normalized 2.0041683600089730005034939020703004* 10 -292 2.0041683600089730005034939020703004* 10 -292 Minimum 4.940656458412465441765687928682214* 10 -324 denormalized • 2.225073858507201383090232717332404* 10-308 Table 2-1 states that REAL*8 (that is, DOUBLE PRECISION) variables always align on a double-word boundary. However, Fortran permits 23 Chapter 2: Storage Mapping these variables to align on a word boundary if a COMMON statement or equivalencing requires it. • Forcing INTEGER, LOGICAL, REAL, and COMPLEX variables to align on a halfword boundary is not allowed, except as permitted by the –align8, –align16, and –align32 command line options. See Chapter 1, “Compiling, Linking, and Running Programs.” • A COMPLEX data item is an ordered pair of REAL*4 numbers; a DOUBLE COMPLEX data item is an ordered pair of REAL*8 numbers; a COMPLEX*32 data item is an ordered pair of REAL*16 numbers. In each case, the first number represents the real part and the second represents the imaginary part. Therefore, refer to Table 2-2 and Table 2-3 for valid ranges. • LOGICAL data items denote only the logical values TRUE and FALSE (written as .TRUE. or .FALSE.). However, to provide VMS compatibility, LOGICAL variables can be assigned all integral values of the same size. • You must explicitly declare an array in a DIMENSION declaration or in a data type declaration. To support DIMENSION, the compiler • 24 – allows up to seven dimensions – assigns a default of 1 to the lower bound if a lower bound is not explicitly declared in the DIMENSION statement – creates an array the size of its element type times the number of elements – stores arrays in column-major mode The following rules apply to shared blocks of data set up by the COMMON statements: – The compiler assigns data items in the same sequence as they appear in the common statements defining the block. Data items are padded according to the alignment compiler options or the compiler defaults. See “Access of Misaligned Data” on page 25 for more information. – You can allocate both character and noncharacter data in the same common block. Access of Misaligned Data – When a common block appears in multiple program units, the compiler allocates the same size for that block in each unit, even though the size required may differ (due to varying element names, types, and ordering sequences) from unit to unit. The size allocated corresponds to the maximum size required by the block among all the program units except when a common block is defined by using DATA statements, which initialize one or more of the common block variables. In this case the common block is allocated the same size as when it is defined. Access of Misaligned Data The Fortran compiler allows misalignment of data if specified by the use of special options. As discussed in the previous section, the architecture of the IRIS-4D series assumes a particular alignment of data. ANSI standard Fortran 77 cannot violate the rules governing this alignment. Many opportunities for misalignment can arise when using common extensions to the dialect. This is particularly true for small integer types, which • allow intermixing of character and non-character data in COMMON and EQUIVALENCE statements • allow mismatching the types of formal and actual parameters across a subroutine interface • provide many opportunities for misalignment to occur Code using the extensions that compiled and executed correctly on other systems with less stringent alignment requirements may fail during compilation or execution on the IRIS-4D. This section describes a set of options to the Fortran compilation system that allow the compilation and execution of programs whose data may be misaligned. Be forewarned that the execution of programs that use these options is significantly slower than the execution of a program with aligned data. This section describes the two methods that can be used to create an executable object file that accesses misaligned data. 25 Chapter 2: Storage Mapping Accessing Small Amounts of Misaligned Data Use the first method if the number of instances of misaligned data access is small or to provide information on the occurrence of such accesses so that misalignment problems can be corrected at the source level. This method catches and corrects bus errors due to misaligned accesses. This ties the extent of program degradation to the frequency of these accesses. This method also includes capabilities for producing a report of these accesses to enable their correction. To use this method, keep the Fortran front end from padding data to force alignment by compiling your program with one of two options to f77. • Use the –align8 option if your program expects no restrictions on alignment. • Use the –align16 option if your program expects to be run on a machine that requires half-word alignment. You must also use the misalignment trap handler. This requires minor source code changes to initialize the handler and the addition of the handler binary to the link step (see the fixade(3f) manual page). Accessing Misaligned Data Without Modifying Source Use the second method for programs with widespread misalignment or whose source may not be modified. In this method, a set of special instructions is substituted by the IRIS-4D assembler for data accesses whose alignment cannot be guaranteed. The generation of these more forgiving instructions may be opted for each source file independently. You can invoke this method by specifying of one of the alignment options (–align8, –align16) to f77 when compiling any source file that references misaligned data (see the f77(1) manual page). If your program passes misaligned data to system libraries, you might also need to link it with the trap handler. See the fixade(3f) manual page for more information. 26 Chapter 3 3. Fortran Program Interfaces Sometimes it is necessary to create a program that combines modules written in Fortran and another language. For example, • In a Fortran program, you need access to a facility that is only available as a C function, such as a member of a graphics library. • In a program in another language, you need access to a computation that has been implemented as a Fortran subprogram, for example one of the many well-tested, efficient routines in the BLAS library. Tip: Fortran subroutines and functions that give access to the IRIX system functions and other IRIX facilities already exist, and are documented in Chapter 4 of this manual. This chapter focusses on the interface between Fortran and the most common other language, C. However other language can be called, for example C++. Note: You should be aware that all compilers for a given version of IRIX use identical standard conventions for passing parameters in generated code. These conventions are documented at the machine instruction level in the MIPSpro Assembly Language Programmer's Guide, which also details the differences in the conventions used in different releases. 27 Chapter 3: Fortran Program Interfaces How Fortran Treats Subprogram Names The Fortran compiler normally changes the names of subprograms and named common blocks while it translates the source file. When these names appear in the object file for reference by other modules, they are normally changed in two ways: • converted to all lowercase letters • extended with a final underscore ( _ ) character Normally the following declarations SUBROUTINE MATRIX function MixedCase() COMMON /CBLK/a,b,c produce the identifiers matrix_, mixedcase_, and cblk_ (all lowercase with appended underscore) in the generated object file. Note: The Fortran intrinsic functions are not named according to these rules. The external names of intrinsic functions as defined in the Fortran library are not directly related to the intrinsic function names as they are written in a program. The use of intrinsic function names is discussed in the MIPSpro Fortran 77 Language Reference Manual. Working with Mixed-Case Names There is no way by which you can make the Fortran compiler generate an external name containing uppercase letters. If you are porting a program that depends on the ability to call such a name, you will have to write a C function that takes the same arguments but which has a name composed of lowercase letters only. This C function can then call the function whose name contains mixed-case letters. Note: Previous versions of the Fortran 77 compiler for 32-bit systems supported the -U compiler option, telling the compiler to not force all uppercase input to lowercase. As a result, uppercase letters could be preserved in external names in the object file. As now implemented, this option does not affect the case of external names in the object file. 28 How Fortran Treats Subprogram Names Preventing a Suffix Underscore with $ You can prevent the compiler from appending an underscore to a name by writing the name with a terminal currency symbol ( $ ). The ‘$’ is not reproduced in the object file. It is dropped, but it prevents the compiler from appending an underscore. The declaration EXTERNAL NOUNDER$ produces the name nounder (lowercase, but no trailing underscore) in the object file. Note: This meaning of ‘$’ in names applies only to subprogram names. If you end the name of a COMMON block with ‘$,’ the name in the object file includes the ‘$’ and ends with an underscore regardless. Naming Fortran Subprograms from C In order to call a Fortran subprogram from a C module you must spell the name the way the Fortran compiler spells it—normally, using all lowercase letters and a trailing underscore. A Fortran subprogram declared as follows: SUBROUTINE HYPOT() would typically be declared in a C function as follows (lowercase with a trailing underscore): extern int hypot_() You must find out if the subprogram is declared with a terminal ‘$’ to suppress the underscore. Naming C Functions from Fortran The C compiler does not modify the names of C functions. C functions can have uppercase or mixed-case names, and they have terminal underscores only when the programmer writes them that way. In order to call a C function from a Fortran program you must ensure that the Fortran compiler spells the name correctly. When you control the name 29 Chapter 3: Fortran Program Interfaces of the C function, the simplest solution is to give it a name that consists of lowercase letters with a terminal underscore. For example, the following C function: int fromfort_() {...} could be declared in a Fortran program as follows: EXTERNAL FROMFORT When you do not control the name of a C function, you must cause the Fortran compiler to generate the correct name in the object file. Write the C function’s name using a terminal ‘$’ character to suppress the terminal underscore. (You cannot cause the compiler to generate an external name with uppercase letters in it.) Testing Name Spelling Using nm You can verify the spelling of names in an object file using the nm command (or with the elfdump command with the -t or -Dt options). To see the subroutine and common names generated by the compiler, apply nm to the generated .o (object) or executable file. Correspondence of Fortran and C Data Types When you exchange data values between Fortran and C, either as parameters, as function results, or as elements of common blocks, you must make sure that the two languages agree on the size, alignment, and subscript of each data value. Corresponding Scalar Types The correspondence between Fortran and C scalar data types is shown in Table 3-1. This table assumes the default precisions. Use of compiler options such as -i2 or -r8 affects the meaning of the words LOGICAL, INTEGER, and REAL. 30 Correspondence of Fortran and C Data Types Table 3-1 Corresponding Fortran and C Data Types Fortran Data Type Corresponding C type BYTE, INTEGER*1, LOGICAL*1 signed char CHARACTER*1 unsigned char INTEGER*2, LOGICAL*2 short INTEGERa, INTEGER*4, LOGICALa, LOGICAL*4 int or long INTEGER*8, LOGICAL*8 long long REALa, REAL*4 float DOUBLE PRECISION, REAL*8 double REAL*16 long double COMPLEXa, COMPLEX*8 typedef struct{float real, imag; } cpx8; DOUBLE COMPLEX, COMPLEX*16 typedef struct{ double real, imag; } cpx16; COMPLEX*32 typedef struct{long double real, imag;} cpx32; CHARACTER*n (n>1) typedef char fstr_n[n]; a. Assuming default precision The rules governing alignment of variables within common blocks are covered under “Alignment, Size, and Value Ranges” on page 22. 31 Chapter 3: Fortran Program Interfaces Corresponding Character Types The Fortran CHARACTER*1 data type corresponds to the C type unsigned char. However, the two languages differ in the treatment of strings of characters. A Fortran CHARACTER*n (n>1) variable contains exactly n characters at all times. When a shorter character expression is assigned to it, it is padded on the right with spaces to reach n characters. A C vector of characters is normally sized 1 greater than the longest string assigned to it. It may contain fewer meaningful characters than its size allows, and the end of meaningful data is marked by a null byte. There is no null byte at the end of a Fortran string. (The programmer can create a null byte using the Hollerith constant '\0' but this is not normally done.) Since there is no terminal null byte, most of the string library functions familiar to C programmers (strcpy(), strcat(), strcmp(), and so on) cannot be used with Fortran string values. The strncpy(), strncmp(), bcopy(), and bcmp() functions can be used because they depend on a count rather than a delimiter. Corresponding Array Elements Fortran and C use different arrangements for the elements of an array in memory. Fortran uses column-major order (when iterating sequentially through memory, the leftmost subscript varies fastest), whereas C uses row-major order (the rightmost subscript varies fastest to generate sequential storage locations). In addition, Fortran array indices are normally origin-1, while C indices are origin-0. To use a Fortran array in C, 32 • Reverse the order of dimension limits when declaring the array • Reverse the sequence of subscript variables in a subscript expression • Adjust the subscripts to origin-0 (usually, decrement by 1) How Fortran Passes Subprogram Parameters The correspondence between Fortran and C subscript values is depicted in Figure 3-1. You derive the C subscripts for a given element by decrementing the Fortran subscripts and using them in reverse order; for example, Fortran (99,9) corresponds to C [8][98]. C x,y y−1,x−1 or Fortran Figure 3-1 y+1,x+1 x,y Correspondence Between Fortran and C Array Subscripts For a coding example, see “Using Fortran Arrays in C Code” on page 44. Note: A Fortran array can be declared with some other lower bound than the default of 1. If the Fortran subscript is origin-0, no adjustment is needed. If the Fortran lower bound is greater than 1, the C subscript is adjusted by that amount. How Fortran Passes Subprogram Parameters The Fortran compiler generates code to pass parameters according to simple, uniform rules; and it generates subprogram code that expects parameters to be passed according to these rules. When calling non-Fortran functions, you must know how parameters will be passed; and when calling Fortran subprograms from other languages you must cause the other language to pass parameters correctly. 33 Chapter 3: Fortran Program Interfaces Normal Treatment of Parameters Every parameter passed to a subprogram, regardless of its data type, is passed as the address of the actual parameter value in memory. This simple rule is extended for two special cases: • The length of each CHARACTER*n parameter (when n>1) is passed as an additional, INTEGER value, following the explicit parameters. • When a function returns type CHARACTER*n parameter (n>1), the address of the space to receive the result is passed as the first parameter to the function and the length of the result space is passed as the second parameter, preceding all explicit parameters. Example 3-1 Example Subroutine Call COMPLEX*8 cp8 CHARACTER*16 creal, cimag CALL CPXASC(creal,cimag,cp8) The code generated from the CALL in Example 3-1 prepares the following 5 argument values: 1. The address of creal 2. The address of cimag 3. The address of cp8 4. The length of creal, an integer value of 16 5. The length of cimag, an integer value of 16 Example 3-2 Example Function Call CHARACTER*8 symbl,picksym CHARACTER*100 sentence INTEGER nsym symbl = picksym(sentence,nsym) 34 Calling Fortran from C The code generated from the function call in Example 3-2 prepares the following 5 argument values: 1. The address of variable symbl, the function result space 2. The length of symbl, an integer value of 8 3. The address of sentence, the first explicit parameter 4. The addrss of nsym, the second explicit parameter 5. The length of sentence, an integer value of 100 You can force changes in these conventions using %VAL and %LOC; this is covered under “Calls to C Using LOC%, REF% and VAL%” on page 45. Calling Fortran from C There are two types of callable Fortran subprograms: subroutines and functions (these units are documented in the MIPSpro Fortran 77 Language Reference Manual). In C terminology, both types of subprogram are external functions. The difference is the use of the function return value from each. Calling Fortran Subroutines from C From the standpoint of a C module, a Fortran subroutine is an external function returning int. The integer return value is normally ignored by a C caller (its meaning is discussed in “Alternate Subroutine Returns” on page 38). 35 Chapter 3: Fortran Program Interfaces The following two examples show a simple Fortran subroutine and a sketch of a call to it. Example 3-3 Example Fortran Subroutine with COMPLEX Parameters SUBROUTINE ADDC32(Z,A,B,N) COMPLEX*32 Z(1),A(1),B(1) INTEGER N,I DO 10 I = 1,N Z(I) = A(I) + B(I) 10 CONTINUE RETURN END Example 3-4 C Declaration and Call with COMPLEX Parameters typedef struct{long double real, imag;} cpx32; extern int addc32_(cpx32*pz,cpx32*pa,cpx32*pb,int*pn); cpx32 z[MAXARRAY], a[MAXARRAY], b[MAXARRAY]; ... int n = MAXARRAY; (void)addc32_(&z, &a, &b, &n); The Fortran subroutine in Example 3-3 is named in Example 3-4 using lowercase letters and a terminal underscore. It is declared as returning an integer. For clarity, the actual call is cast to (void) to show that the return value is intentionally ignored. The trivial subroutine in the following example takes adjustable-length character parameters. Example 3-5 Example Fortran Subroutine with String Parameters SUBROUTINE PRT(BEF,VAL,AFT) CHARACTER*(*)BEF,AFT REAL VAL PRINT *,BEF,VAL,AFT RETURN END 36 Calling Fortran from C Example 3-6 C Program that Passes String Parameters typedef char fstr_16[16]; extern int prt_(fstr_16*pbef, float*pval, fstr_16*paft, int lbef, int laft); main() { float val = 2.1828e0; fstr_16 bef,aft; strncpy(bef,”Before..........”,sizeof(bef)); strncpy(aft,”...........After”,sizeof(aft)); (void)prt_(bef,&val,aft,sizeof(bef),sizeof(aft)); } The C program in Example 3-6 prepares CHARACTER*16 values and passes them to the subroutine in Example 3-5. Observe that the subroutine call requires 5 parameters, including the lengths of the two string parameters. In Example 3-6, the string length parameters are generated using sizeof(), derived from the typedef fstr_16. Example 3-7 C Program that Passes Different String Lengths extern int prt_(char*pbef, float*pval, char*paft, int lbef, int laft); main() { float val = 2.1828e0; char *bef = "Start:"; char *aft = ":End"; (void)prt_(bef,&val,aft,strlen(bef),strlen(aft)); } When the Fortran code does not require a specific length of string, the C code that calls it can pass an ordinary C character vector, as shown in Example 3-7. In Example 3-7, the string length parameter length values are calculated dynamically using strlen(). 37 Chapter 3: Fortran Program Interfaces Alternate Subroutine Returns In Fortran, a subroutine can be defined with one or more asterisks ( * ) in the position of dummy parameters. When such a subroutine is called, the places of these parameters in the CALL statement are supposed to be filled with statement numbers or statement labels. The subroutine returns an integer which selects among the statement numbers, so that the subroutine call acts as both a call and a computed go-to (for more details, see the discussions of the CALL and RETURN statements in the MIPSpro Fortran 77 Language Reference Manual). Fortran does not generate code to pass statement numbers or labels to a subroutine. No actual parameters are passed to correspond to dummy parameters given as asterisks. When you code a C prototype for such a subroutine, simply ignore these parameter positions. A CALL statement such as CALL NRET (*1,*2,*3) is treated exactly as if it were the computed GOTO written as GOTO (1,2,3), NRET() The value returned by a Fortran subroutine is the value specified on the RETURN statement, and will vary between 0 and the number of asterisk dummy parameters in the subroutine definition. Calling Fortran Functions from C A Fortran function returns a scalar value as its explicit result. This corresponds exactly to the C concept of a function with an explicit return value. When the Fortran function returns any type shown in Table 3-1 other than CHARACTER*n (n>1), you can call the function from C and handle its return value exactly as if it were a C function returning that data type. Example 3-8 Fortran Function Returning COMPLEX*16 COMPLEX*16 FUNCTION FSUB16(INP) COMPLEX*16 INP FSUB16 = INP END 38 Calling Fortran from C The trivial function shown in Example 3-8 accepts and returns COMPLEX*16 values. Although a COMPLEX value is declared as a structure in C, it can be used as the return type of a function. Example 3-9 C Program that Receives COMPLEX Return Value typedef struct{ double real, imag; } cpx16; extern cpx16 fsub16_( cpx16 * inp ); main() { cpx16 inp = { -3.333, -5.555 }; cpx16 oup = { 0.0, 0.0 }; printf("testing fsub16..."); oup = fsub16_( &inp ); if ( inp.real == oup.real && inp.imag == oup.imag ) printf("Ok\n"); else printf("Nope\n"); } The C program in Example 3-9 shows how the function in Example 3-8 is declared and called. Observe that the parameters to a function, like the parameters to a subroutine, are passed as pointers, but the value returned is a value, not a pointer to a value. Note: In IRIX 5.3 and earlier, you can not call a Fortran function that returns COMPLEX (although you can call one that returns any other arithmetic type). The register conventions used by compilers prior to IRIX 6.0 do not permit returning a structure value from a Fortran function to a C caller. Example 3-10 Fortran Function Returning CHARACTER*16 CHARACTER*16 FUNCTION FS16(J,K,S) CHARACTER*16 S INTEGER J,K FS16 = S(J:K) RETURN END The function in Example 3-10 has a CHARACTER*16 return value. When the Fortran function returns a CHARACTER*n (n>1) value, the returned value is not the explicit result of the function. Instead, you must pass the 39 Chapter 3: Fortran Program Interfaces address and length of the result area as the first two parameters of the function. Example 3-11 C Program that Receives CHARACTER*16 Return typedef char fstr_16[16]; extern void fs16_ (fstr_16 *pz,int lz,int *pj,int *pk,fstr_16*ps,int ls); main() { char work[64]; fstr_16 inp,oup; int j=7; int k=11; strncpy(inp,"0123456789abcdef",sizeof(inp)); fs16_ ( oup, sizeof(oup), &j, &k, inp, sizeof(inp) ); strncpy(work,oup,sizeof(oup)); work[sizeof(oup)] = '\0'; printf("FS16 returns <%s>\n",work); } The C program in Example 3-11 calls the function in Example 3-10. The address and length of the function result are the first two parameters of the function. (Since type fstr_16 is an array, its name, oup, evaluates to the address of its first element.) The next three parameters are the addresses of the three named parameters; and the final parameter is the length of the string parameter. Calling C from Fortran In general, you can call units of C code from Fortran as if they were written in Fortran, provided that the C modules follow the Fortran conventions for passing parameters (see “How Fortran Passes Subprogram Parameters” on page 33). When the C program expects parameters passed using other conventions, you can either write special forms of CALL, or you can build a “wrapper” for the C functions using the mkf2c command.. 40 Calling C from Fortran Normal Calls to C Functions The C function in this section is written to use the Fortran conventions for its name (lowercase with final underscore) and for parameter passing. Example 3-12 C Function Written to be Called from Fortran /* || C functions to export the facilities of strtoll() || to Fortran 77 programs. Effective Fortran declaration: || || INTEGER*8 FUNCTION ISCAN(S,J) || CHARACTER*(*) S || INTEGER J || || String S(J:) is scanned for the next signed long value || as specified by strtoll(3c) for a "base" argument of 0 || (meaning that octal and hex literals are accepted). || || The converted long long is the function value, and J is || updated to the nonspace character following the last || converted character, or to 1+LEN(S). || || Note: if this routine is called when S(J:J) is neither || whitespace nor the initial of a valid numeric literal, || it returns 0 and does not advance J. */ #include/* for isspace() */ long long iscan_(char *ps, int *pj, int ls) { int scanPos, scanLen; long long ret = 0; char wrk[1024]; char *endpt; 41 Chapter 3: Fortran Program Interfaces /* when J>LEN(S), do nothing, return 0 */ if (ls >= *pj) { /* convert J to origin-0, permit J=0 */ scanPos = (0 < *pj)? *pj-1 : 0 ; /* calculate effective length of S(J:) */ scanLen = ls - scanPos; /* copy S(J:) and append a null for strtoll() */ strncpy(wrk,(ps+scanPos),scanLen); wrk[scanLen] = ‘\0’; /* scan for the integer */ ret = strtoll(wrk, &endpt, 0); /* || Advance over any whitespace following the number. || Trailing spaces are common at the end of Fortran || fixed-length char vars. */ while(isspace(*endpt)) { ++endpt; } *pj = (endpt - wrk)+scanPos+1; } return ret; } The following program in demonstrates a call to the function in Example 3-12. EXTERNAL ISCAN INTEGER*8 ISCAN INTEGER*8 RET INTEGER J,K CHARACTER*50 INP INP = '1 -99 3141592 0xfff 033 ' J = 0 DO 10 WHILE (J .LT. LEN(INP)) K = J RET = ISCAN(INP,J) PRINT *, K,': ',RET,' -->',J 10 CONTINUE END 42 Calling C from Fortran Using Fortran COMMON in C Code A C function can refer to the contents of a COMMON block defined in a Fortran program. The name of the block as given in the COMMON statement is altered as described in “How Fortran Treats Subprogram Names” on page 28 (that is, forced to lowercase and extended with an underscore). The name of the “blank common” is _BLNK__ (one leading, two final, underscores). In order to refer to the contents of a common block, take these steps: • Declare a structure whose fields have the appropriate data types to match the successive elements of the Fortran common block. (See Table 3-1 for corresponding data types.) • Declare the common block name as an external structure of that type. An example is shown below. Example 3-13 Common Block Usage in Fortran and C INTEGER STKTOP,STKLEN,STACK(100) COMMON /WITHC/STKTOP,STKLEN,STACK struct fstack { int stktop, stklen; int stack[100]; } extern fstack withc_; int peektop_() { if (withc_.stktop) /* stack not empty */ return withc_.stack[withc_.stktop-1]; else... } 43 Chapter 3: Fortran Program Interfaces Using Fortran Arrays in C Code As described under “Corresponding Array Elements” on page 32, a C program must take special steps to access arrays created in Fortran. Example 3-14 Fortran Program Sharing an Array in Common with C INTEGER IMAT(10,100),R,C COMMON /WITHC/IMAT R = 74 C = 6 CALL CSUB(C,R,746) PRINT *,IMAT(6,74) END The Fortran fragment in Example 3-14 prepares a matrix in a common block, then calls a C subroutine to modify the array. Example 3-15 C Subroutine to Modify a Common Array extern struct { int imat[100][10]; } withc_; int csub_(int *pc, int *pr, int *pval) { withc_.imat[*pr-1][*pc-1] = *pval; return 0; /* all Fortran subrtns return int */ } The C function in Example 3-15 stores its third argument in the common array using the subscripts passed in the first two arguments. In the C function, the order of the dimensions of the array are reversed. The subscript values are reversed to match, and decremented by 1 to match the C assumption of 0-origin indexing. 44 Calling C from Fortran Calls to C Using LOC%, REF% and VAL% Using the special intrinsic functions %VAL, %REF, and %LOC you can pass parameters in ways other than the standard Fortran conventions described under ‘“How Fortran Passes Subprogram Parameters” on page 33. These intrinsic functions are documented in the MIPSpro Fortran 77 Language Reference Manual. Using %VAL %VAL is used in parameter lists to cause parameters to be passed by value rather than by reference. Examine the following function prototype (from the random(3b) reference page). char *initstate(unsigned int seed, char *state, int n); This function takes an integer value as its first parameter. Fortran would normally pass the address of an integer value, but %VAL can be used to make it pass the integer itself. Example 3-16 demonstrates a call to function initstate() and the other functions of the random() group. 45 Chapter 3: Fortran Program Interfaces Example 3-16 Fortran Function Calls Using %VAL C declare the external functions in random(3b) C random() returns i*4, the others return char* EXTERNAL RANDOM$, INITSTATE$, SETSTATE$ INTEGER*4 RANDOM$ INTEGER*8 INITSTATE$,SETSTATE$ C We use "states" of 128 bytes, see random(3b) C Note: An undocumented assumption of random() is that C a "state" is dword-aligned! Hence, use a common. CHARACTER*128 STATE1, STATE2 COMMON /RANSTATES/STATE1,STATE2 C working storage for state pointers INTEGER*8 PSTATE0, PSTATE1, PSTATE2 C initialize two states to the same value PSTATE0 = INITSTATE$(%VAL(8191),STATE1) PSTATE1 = INITSTATE$(%VAL(8191),STATE2) PSTATE2 = SETSTATE$(%VAL(PSTATE1)) C pull 8 numbers from state 1, print DO 10 I=1,8 PRINT *,RANDOM$() 10 CONTINUE C set the other state, pull 8 numbers & print PSTATE1 = SETSTATE$(%VAL(PSTATE2)) DO 20 I=1,8 PRINT *,RANDOM$() 20 CONTINUE END The use of %VAL(8191) or %VAL(PSTATE1) causes that value to be passed, rather than an address of that value. Using %REF %REF is used in parameter lists to cause parameters to be passed by reference, that is, to pass the address of a value rather than the value itself. Passing parameters by reference is the normal behavior of Silicon Graphics Fortran 77 compilers, so there is no effective difference between writing %REF(parm) and writing parm alone in a parameter list. However, this may not be the case with Fortran compilers from other manufacturers. In other compilers, %REF(parm) might be effective and different from parm alone. 46 Calling C from Fortran Hence when calling a C function that expects the address of a value rather than the value itself, you can write %REF(parm) simply as documentation of the kind of parameter. Examine this C prototype (see the gmatch(3G) reference page). int gmatch (const char *str, const char *pattern); This function gmatch() could be declared and called from Fortran. Example 3-17 Fortran Call to gmatch() Using %REF LOGICAL GMATCH$ CHARACTER*8 FNAME,FPATTERN FNAME = 'foo.f\0' FPATTERN = '*.f\0' IF ( GMATCH$(%REF(FNAME),%REF(FPATTERN)) )... The use of %REF() in Example 3-17 simply documents the fact that gmatch() expects addresses of character strings. Note: The code in Example 3-17 passes two additional hidden parameters, the lengths of the two string parameters. Probably, a C function such as gmatch() would ignore these. However, they can be suppressed using %LOC, as discussed in the following topic. Using %LOC %LOC returns the address of its argument. It can be used in any expression (not only within parameter lists), and is often used to set POINTER variables. However, it can be used with %VAL to prevent passing the lengths of character values as hidden parameters. Refer again to the prototype of gmatch(). This function expects the address of two character strings in memory, but it is not written to expect the Fortran convention of also passing the lengths of character parameters. 47 Chapter 3: Fortran Program Interfaces Example 3-18 Fortran Call to gmatch() Using %VAL(%LOC()) LOGICAL GMATCH$ CHARACTER*8 FNAME,FPATTERN FNAME = 'foo.f\0' FPATTERN = '*.f\0' IF ( GMATCH$(%VAL(%LOC(FNAME)),%VAL(%LOC(FPATTERN))) )... The code fragment in Example 3-18 shows how to pass only the addresses. Each parameter consists of an address (%LOC) passed by value (%VAL). Since neither parameter is a character string, Fortran does not pass the character string lengths as hidden parameters. Making C Wrappers with mkf2c The program mkf2c provides an alternate interface for C routines called by Fortran. (Some details of mkf2c are covered in the mkf2c(1) reference page.) The mkf2c program reads a file of C function prototype declarations and generates an assembly language module. This module contains one callable entry point for each C function. The entry point, or “wrapper,” accepts parameters in the Fortran calling convention, and passes the same values to the C function using the C conventions. A simple case of using a function as input to mkf2c is simplefunc (int a, double df) { /* function body ignored */ } For this function, mkf2c (with no options) generates a wrapper function named simple_ (truncated to 6 characters, made lowercase, with an underscore appended). The wrapper function expects two parameters, an integer and a REAL*8, passed according to Fortran conventions; that is, by reference. The code of the wrapper loads the values of the parameters into registers using C conventions for passing parameters by value, and calls simplefunc(). 48 Calling C from Fortran Parameter Assumptions by mkf2c Since mkf2c processes only the C source, not the Fortran source, it treats the Fortran parameters based on the data types specified in the C function header. These treatments are summarized in Table 3-2. Note: Through compiler release 6.0.2, mkf2c does not recognize the C data types “long long” and “long double” (INTEGER*8 and REAL*16). It treats arguments of this type as “long” and “double” respectively. Table 3-2 How mkf2c treats Function Arguments Data Type in C Prototype Treatment by Generated Wrapper Code unsigned char Load CHARACTER*1 from memory to register, no sign extension char Load CHARACTER*1 from memory to register; sign extension only when -signed is specified unsigned short, unsigned int Load INTEGER*2 or INTEGER*4 from memory to register, no sign extension short Load INTEGER*2 from memory to register with sign extension int, long Load INTEGER*4 from memory to register with sign extension long long (Not supported through 6.0.2) float Load REAL*4 from memory to register, extending to double unless -f is specified double Load REAL*8 from memory to register long double (Not supported through 6.0.2) char name[], name[n] Pass address of CHARACTER*n and pass length as integer parameter as Fortran does char * Copy CHARACTER*n value into allocated space, append null byte, pass address of copy 49 Chapter 3: Fortran Program Interfaces Character String Treatment by mkf2c In Table 3-2, notice the different treatments for an argument declared as a character array and one declared as a character address (even though these two declarations are semantically the same in C). When the C function expects a character address, mkf2c generates the code to dynamically allocate memory and to copy the Fortran character value, for its specified length, to memory. This creates a null-terminated string. In this case, • The address passed to C points to allocated memory • The length of the value is not passed as an implicit argument • There is a terminating null byte in the value • Changes in the string are not reflected back to Fortran A character array is passed by mkf2c as a Fortran CHARACTER*n value. In this case, • The address prepared by Fortran is passed to the C function • The length of the value is passed as an implicit argument (see “Normal Treatment of Parameters” on page 34) • The character array contains no terminating null byte (unless the Fortran programmer supplies one) • Changes in the array by the C function will be visible to Fortran Since the C function cannot declare the extra string-length parameter (if it declared the parameter, mkf2c would process it as an explicit argument) the C programmer has a choice of ways to access the string length. When the Fortran program always passes character values of the same size, the length parameter can simply be ignored. If its value is needed, the varargs macro can be used to retrieve it. For example, if the C function prototype is specified as follows void func1 (char carr1[],int i, char *str, char carr2[]); mkf2c passes a total of six parameters to C. The fifth parameter is the length of the Fortran value corresponding to carr1. The sixth is the length of carr2. The C function can use the varargs macros to retrieve these hidden 50 Calling C from Fortran parameters. mkf2c ignores the varargs macro va_alist appearing at the end of the parameter name list. When func1 is changed to use varargs, the C source file is as follows. Example 3-19 C Function Using varargs #include "varargs.h" void func1 (char carr1[],int i,char *str,char carr2[],va_alist); {} The C routine would retrieve the lengths of carr1 and carr2, placing them in the local variables carr1_len and carr2_len using code like the following fragment. Example 3-20 C Code to Retrieve Hidden Parameters va_list ap; int carr1_len, carr2_len; va_start(ap); carr1_len = va_arg (ap, int) carr2_len = va_arg (ap, int) Restrictions of mkf2c When it does not recognize the data type specified in the C function, mkf2c issues a warning message and generates code to simply pass the pointer passed by Fortran. It does this in the following cases: • Any nonstandard data type name, for example a data type that might be declared using typedef or a data type defined as a macro • Any structure argument • Any argument with multiple indirection (two or more asterisks, for example char**) Since mkf2c does not support structure-valued arguments, it does not support passing COMPLEX*n values. 51 Chapter 3: Fortran Program Interfaces Using mkf2c and extcentry mkf2c understands only a limited subset of the C grammar. This subset includes common C syntax for function entry point, C-style comments, and function bodies. However, it does not include constructs such as typedefs, external function declarations, or C preprocessor directives. To ensure that only the constructs understood by mkf2c are included in wrapper input, you need to place special comments around each function for which Fortran-to-C wrappers are to be generated (see example below). Once these special comments, /* CENTRY */ and /* ENDCENTRY */, are placed around the code, use the program excentry(1) before mkf2c to generate the input file for mkf2c. Example 3-21 Source File for Use with extcentry typedef unsigned short grunt [4]; struct { long 1,11; char *str; } bar; main () { int kappa =7; foo (kappa,bar.str); } /* CENTRY */ foo (integer, cstring) int integer; char *cstring; { if (integer==1) printf("%s",cstring); } /* ENDCENTRY */ Example 3-21 illustrates the use of extcentry. It shows the C file foo.c containing the function foo, which is to be made Fortran callable. 52 Calling C from Fortran The special comments /* CENTRY */ and /* ENDCENTRY */ surround the section that is to be made Fortran callable. To generate the assembly language wrapper foowrp.s from the above file foo.c, use the following set of commands: % extcentry foo.c foowrp.fc % mkf2c foowrp.fc foowrp.s The programs mkf2c and extcentry are found in the directory /usr/bin. Makefile Considerations make(1) contains default rules to help automate the control of wrapper generation. The following example of a makefile illustrates the use of these rules. In the example, an executable object file is created from the files main.f (a Fortran main program) and callc.c: test: main.o callc.o f77 -o test main.o callc.o callc.o: callc.fc clean: rm -f *.o test *.fc In this program, main calls a C routine in callc.c. The extension .fc has been adopted for Fortran-to-call-C wrapper source files. The wrappers created from callc.fc will be assembled and combined with the binary created from callc.c. Also, the dependency of callc.o on callc.fc will cause callc.fc to be recreated from callc.c whenever the C source file changes. (The programmer is responsible for placing the special comments for extcentry in the C source as required.) Note: Options to mkf2c can be specified when make is invoked by setting the make variable F2CFLAGS. Also, do not create a .fc file for the modules that need wrappers created. These files are both created and removed by make in response to the file.o:file.fc dependency. 53 Chapter 3: Fortran Program Interfaces The makefile above controls the generation of wrappers and Fortran objects. You can add modules to the executable object file in one of the following ways: 54 • If the file is a native C file whose routines are not to be called from Fortran using a wrapper interface, or if it is a native Fortran file, add the .o specification of the final make target and dependencies. • If the file is a C file containing routines to be called from Fortran using a wrapper interface, the comments for extcentry must be placed in the C source, and the .o file placed in the target list. In addition, the dependency of the .o file on the .fc file must be placed in the makefile. This dependency is illustrated in the example makefile above where callf.o depends on callf.fc. Chapter 4 4. System Functions and Subroutines This chapter describes extensions to Fortran 77 that are related to the IRIX compiler and operating system. • “Library Functions” summarizes the Fortran run-time library functions. • “Extended Intrinsic Subroutines” describes the extensions to the Fortran intrinsic subroutines. • “Extended Intrinsic Functions” describes the extensions to the Fortran functions. Library Functions The Fortran library functions provide an interface from Fortran programs to the IRIX system functions. System functions are facilities that are provided by the IRIX system kernel directly, as opposed to functions that are supplied by library code linked with your program. System functions are documented in volume 2 of the reference pages, with an overview in the intro(2) reference page. Table 4-1 summarizes the functions in the Fortran run-time library. In general, the name of the interface routine is the same as the name of the system function as it would be called from a C program. For details on any function use the command man 2 name_of_function Note: You must declare the time function as EXTERNAL; if you do not, the compiler will assume you mean the VMS-compatible intrinsic time function rather than the IRIX system function. (In general it is a good idea to declare any library function in an EXTERNAL statement as documentation.) 55 Chapter 4: System Functions and Subroutines Table 4-1 56 Summary of System Interface Library Routines Function Purpose abort abnormal termination access determine accessibility of a file acct enable/disable process accounting alarm execute a subroutine after a specified time barrier perform barrier operations blockproc block processes brk change data segment space allocation chdir change default directory chmod change mode of a file chown change owner chroot change root directory for a command close close a file descriptor creat create or rewrite a file ctime return system time dtime return elapsed execution time dup duplicate an open file descriptor etime return elapsed execution time exit terminate process with status fcntl file control fdate return date and time in an ASCII string fgetc get a character from a logical unit fork create a copy of this process fputc write a character to a Fortran logical unit free_barrier free barrier Library Functions Table 4-1 (continued) Summary of System Interface Library Routines Function Purpose fseek reposition a file on a logical unit fseek64 reposition a file on a logical unit for 64-bit architecture fstat get file status ftell reposition a file on a logical unit ftell64 reposition a file on a logical unit for 64-bit architecture gerror get system error messages getarg return command line arguments getc get a character from a logical unit getcwd get pathname of current working directory getdents read directory entries getegid get effective group ID gethostid get unique identifier of current host getenv get value of environment variables geteuid get effective user ID getgid get user or group ID of the caller gethostname get current host ID getlog get user’s login name getpgrp get process group ID getpid get process ID getppid get parent process ID getsockopt get options on sockets getuid get user or group ID of caller gmtime return system time iargc return command line arguments 57 Chapter 4: System Functions and Subroutines Table 4-1 (continued) 58 Summary of System Interface Library Routines Function Purpose idate return date or time in numerical form ierrno get system error messages ioctl control device isatty determine if unit is associated with tty itime return date or time in numerical form kill send a signal to a process link make a link to an existing file loc return the address of an object lseek move read/write file pointer lseek64 move read/write file pointer for 64-bit architecture lstat get file status ltime return system time m_fork create parallel processes m_get_myid get task ID m_get_numprocs get number of subtasks m_kill_procs kill process m_lock set global lock m_next return value of counter m_park_procs suspend child processes m_rcle_procs resume child processes m_set_procs set number of subtasks m_sync synchronize all threads m_unlock unset a global lock mkdir make a directory Library Functions Table 4-1 (continued) Summary of System Interface Library Routines Function Purpose mknod make a directory/file mount mount a filesystem new_barrier initialize a barrier structure nice lower priority of a process open open a file oserror get/set system error pause suspend process until signal perror get system error messages pipe create an interprocess channel plock lock process, test, or data in memory prctl control processes profil execution-time profile ptrace process trace putc write a character to a Fortran logical unit putenv set environment variable qsort quick sort read read from a file descriptor readlink read value of symbolic link rename change the name of a file rmdir remove a directory sbrk change data segment space allocation schedctl call to scheduler control send send a message to a socket setblockproccnt set semaphore count 59 Chapter 4: System Functions and Subroutines Table 4-1 (continued) 60 Summary of System Interface Library Routines Function Purpose setgid set group ID sethostid set current host ID setoserror set system error setpgrp set process group ID setsockopt set options on sockets setuid set user ID sginap put process to sleep sginap64 put process to sleep in 64-bit environment shmat attach shared memory shmdt detach shared memory sighold raise priority and hold signal sigignore ignore signal signal change the action for a signal sigpause suspend until receive signal sigrelse release signal and lower priority sigset specify system signal handling sleep suspend execution for an interval socket create an endpoint for communication TCP sproc create a new share group process stat get file status stime set time symlink make symbolic link sync update superblock sysmp control multiprocessing Library Functions Table 4-1 (continued) Summary of System Interface Library Routines Function Purpose sysmp64 control multiprocessing in 64-bit environment system issue a shell command taskblock block tasks taskcreate create a new task taskctl control task taskdestroy kill task tasksetblockcnt set task semaphore count taskunblock unblock task time return system time (must be declared EXTERNAL) ttynam find name of terminal port uadmin administrative control ulimit get and set user limits ulimit64 get and set user limits in 64-bit architecture umask get and set file creation mask umount dismount a file system unblockproc unblock processes unlink remove a directory entry uscalloc shared memory allocator uscalloc64 shared memory allocator in 64-bit environment uscas compare and swap operator usclosepollsema detach file descriptor from a pollable semaphore usconfig semaphore and lock configuration operations uscpsema acquire a semaphore uscsetlock unconditionally set lock 61 Chapter 4: System Functions and Subroutines Table 4-1 (continued) 62 Summary of System Interface Library Routines Function Purpose usctlsema semaphore control operations usdumplock dump lock information usdumpsema dump semaphore information usfree user shared memory allocation usfreelock free a lock usfreepollsema free a pollable semaphore usfreesema free a semaphore usgetinfo exchange information through an arena usinit semaphore and lock initialize routine usinitlock initialize a lock usinitsema initialize a semaphore usmalloc allocate shared memory usmalloc64 allocate shared memory in 64-bit environment usmallopt control allocation algorithm usnewlock allocate and initialize a lock usnewpollsema allocate and initialize a pollable semaphore usnewsema allocate and initialize a semaphore usopenpollsem attach a file descriptor to a pollable semaphore uspsema acquire a semaphore usputinfo exchange information through an arena usrealloc user share memory allocation usrealloc64 user share memory allocation in 64-bit environment ussetlock set lock ustest lock test lock Extended Intrinsic Subroutines Table 4-1 (continued) Summary of System Interface Library Routines Function Purpose ustestsema return value of semaphore ustrace trace usunsetlock unset lock usvsema free a resource to a semaphore uswsetlock set lock wait wait for a process to terminate write write to a file Extended Intrinsic Subroutines This section describes the intrinsic subroutines that are extensions to Fortran 77 (the intrinsic functions that are standard to Fortran 77 are documented in Appendix A of the MIPSpro Fortran 77 Language Reference Manual). The rules for using the names of intrinsic subroutines are also discussed in that appendix. Table 4-2 gives an overview of the intrinsic subroutines and their function; they are described in detail in the sections following the topics. Table 4-2 Overview of System Subroutines Subroutine Information Returned DATE Current date as nine-byte string in ASCII representation IDATE Current month, day, and year, each represented by a separate integer ERRSNS Description of the most recent error EXIT Terminates program execution TIME Current time in hours, minutes, and seconds as an eight-byte string in ASCII representation MVBITS Moves a bit field to a different storage location 63 Chapter 4: System Functions and Subroutines DATE The DATE routine returns the current date as set by the system; the format is as follows: CALL DATE (buf) where buf is a variable, array, array element, or character substring nine bytes long. After the call, buf contains an ASCII variable in the format dd-mmm-yy, where dd is the date in digits, mmm is the month in alphabetic characters, and yy is the year in digits. IDATE The IDATE routine returns the current date as three integer values representing the month, date, and year; the format is as follows: CALL IDATE (m, d, y) where m, d, and y are either INTEGER*4 or INTEGER*2 values representing the current month, day and year. For example, the values of m, d, and y on August 10, 1989, are m = 8 d = 10 y = 89 ERRSNS The ERRSNS routine returns information about the most recent program error; the format is as follows: CALL ERRSNS (arg1, arg2, arg3, arg4, arg5) 64 Extended Intrinsic Subroutines The arguments (arg1, arg2, and so on) can be either INTEGER*4 or INTEGER*2 variables. On return from ERRSNS, the arguments contain the information shown in Table 4-3. Table 4-3 Information Returned by ERRSNS Argument Contents arg1 IRIX global variable errno, which is then reset to zero after the call arg2 Zero arg3 Zero arg4 Logical unit number of the file that was being processed when the error occurred arg5 Zero Although only arg1 and arg4 return relevant information, arg2, arg3, and arg5 are always required. EXIT The EXIT routine causes normal program termination and optionally returns an exit-status code; the format is as follows: CALL EXIT (status) where status is an INTEGER*4 or INTEGER*2 argument containing a status code. TIME The TIME routine returns the current time in hours, minutes, and seconds; the format is as follows: CALL TIME (clock) where clock is a variable, array, array element, or character substring; it must be eight bytes long. After execution, clock contains the time in the format 65 Chapter 4: System Functions and Subroutines hh:mm:ss, where hh, mm, and ss are numerical values representing the hour, the minute, and the second. MVBITS The MVBITS routine transfers a bit field from one storage location to another; the format is as follows: CALL MVBITS (source,sbit,length,destination,dbit) Table 4-4 defines the arguments. Arguments can be declared as INTEGER*2, INTEGER*4, or INTEGER*8. 66 Table 4-4 Arguments to MVBITS Argument Type source Integer variable or array element Source location of bit field to be transferred. sbit Integer expression First bit position in the field to be transferred from source. length Integer expression Length of the field to be transferred from source. destination Integer variable or array element Destination location of the bit field dbit Integer expression Contents First bit in destination to which the field is transferred. Extended Intrinsic Functions Extended Intrinsic Functions Table 4-5 gives an overview of the intrinsic functions added as extensions of Fortran 77. Table 4-5 Function Extensions Function Information Returned SECNDS Elapsed time as a floating point value in seconds. This is an intrinsic routine. RAN The next number from a sequence of pseudo-random numbers. This is not an intrinsic routine. These functions are described in detail in the following sections. SECNDS SECNDS is an intrinsic routine that returns the number of seconds since midnight, minus the value of the passed argument; the format is as follows: s = SECNDS(n) After execution, s contains the number of seconds past midnight less the value specified by n. Both s and n are single-precision, floating point values. RAN RAN generates a pseudo-random number. The format is as follows: v = RAN(s) The argument s is an INTEGER*4 variable or array element. This variable serves as a seed in determining the next random number. It should initially be set to a large, odd integer value. You can compute multiple random number series by supplying different variables or array elements as the seed argument to different calls of RAN. 67 Chapter 4: System Functions and Subroutines Note: Because RAN modifies the argument s, calling the function with a constant can cause a core dump. The algorithm used in RAN is the linear congruential method. The code is similar to the following fragment: S = S * 1103515245L + 12345 RAN = FLOAT(IAND(RSHIFT(S,16),32767))/32768.0 RAN is supplied for compatibility with VMS. For demanding applications, consider using the functions described in the random(3b) reference page. These can all be called using techniques described under “Using %VAL” on page 45. 68 Chapter 5 5. Scalar Optimizations This chapter contains the following sections: • “Overview” provides an overview of the scalar optimization command line options. • “Performing General Optimizations” describes the general scalar optimizations you can enable from the command line. • “Performing Advanced Optimizations” describes the advanced scalar optimizations you can enable from the command line. Overview You can use the compiler to perform various scalar optimizations by specifying any of the options listed in Table 5-1 from the command line. Specify the options in a comma-separated list following the –WK option without any intervening blanks, as follows: % f77 f77options -WK,option[,option] ... file Note: These options specifically control optimizations performed by the Fortran front end. The defaults are usually sufficient. You should use these options when trying to improve the last bit of performance of your code. 69 Chapter 5: Scalar Optimizations You can also initiate many of these optimizations with compiler directives (see Chapter 9, “Fine-Tuning Program Execution.”) Table 5-1 Optimization Options Long Name Short Name Default Value –aggressive=letter –ag=letter option off –arclimit=integer –arclm=integer 5000 –[no]assume=list –[n]as=list CEL –cacheline=integer –chl=integer 4 –cachesize=integer –chs=integer 256 –[no]directives=list –[n]dr=list ackpv –dpregisters=integer –dpr=integer 16 –each_invariant_if_growth=integer –eiifg=integer 20 –fpregisters=integer –fpr=integer 16 –fuse –fuse option on with –scalaropt=2 or –optimize=5 –max_invariant_if_growth=integer –miifg=integer 500 –optimize=integer –o=integer depends on –O option –recursion –rc option on –roundoff=integer –r=integer depends on –O option –scalaropt=integer –so=integer depends on –O option –setassociativity=integer –sasc=integer 1 –unroll=integer –ur=integer 4 –unroll2=weight –ur2=weight 100 The –On option directly initiates basic optimizations. Refer to Chapter 1, “Compiling, Linking, and Running Programs” for details. 70 Performing General Optimizations Performing General Optimizations This section discusses the general optimizations that you can enable. Enabling Loop Fusion The –fuse option enables loop fusion, an optimization that transforms two adjacent loops into a single loop. The use of data-dependence tests allows fusion of more loops than is possible with standard techniques. You must also specify –scalaropt=2 or –optimize=5 to enable loop fusion. Controlling Global Assumptions The –assume=list option (or –as=list) controls certain global assumptions of a program. You can also control most of these assumptions with various assertions (see “Controlling Global Assumptions” in Chapter 5). The default is –assume=cel. list can contain the following characters: a Allows procedure argument aliasing, which is when different subroutine or function parameters refer to the same object. This practice is forbidden by the Fortran 77 standard. This option provides a method of dealing with programs that use argument aliasing anyway. b Allows array subscripts to go outside the declared bounds. c Places constants used in subroutine or function calls in temporary variables. e Allows variables in EQUIVALENCE statements to refer to the same memory location inside one DO loop nest. l Uses temporary variables within an optimized loop and assigns the last value to the original scalar, if the compiler determines that the scalar can be reused before it is assigned. 71 Chapter 5: Scalar Optimizations By default, the compiler assumes that a program conforms to the Fortran 77 standard, that is, –assume=el, and includes –asssume=c to simplify some analysis and inlining. You can disable the default values by specifying the –noassume option. Example The following command compiles the Fortran program source.f, and permits argument aliasing and subscripts out of bounds: % f77 -WK,-assume=ab source.f Setting Invariant IF Floating Limits When a loop contains an IF statement whose condition does not change from one iteration to another (loop-invariant), the compiler performs the same test for every iteration. The code can often be made more efficient by floating the IF statement out of the loop and putting the THEN and ELSE sections into their own loops. This process is called invariant IF floating. The –each_invariant_if_growth and the –max_invariant_if_growth options control limits on invariant IF floating. This process generally involves duplicating the body of the loop, which can increase the amount of code considerably. The –each_invariant_if_growth=integer option (or –eiifg=integer) controls the rewriting of IF statements nested within loops. This option specifies a limit on the number of executable statements in a nested IF statement. If the number of statements in the loop exceeds this limit, the compiler does not rewrite the code. If there are fewer statements, the compiler improves execution speed by interchanging the loop and IF statements. Valid values for integer are from 0 to 100; the default is 20. This process becomes complicated when there is other code in the loop, since a copy of the other code must be included in both the THEN and ELSE loops. 72 Performing General Optimizations For example, the following code: DO I = ... section-1 IF ( ) THEN section-2 ELSE section-3 ENDIF section-4 ENDDO becomes IF ( ) THEN DO I = ... section-1 section-2 section-4 ENDDO ELSE DO I = ... section-1 section-3 section-4 ENDDO ENDIF When sections 1 and 4 are large, the extra code generated can slow a program down (through cache contention, extra paging, and so on) more than the reduced number of IF tests speed it up. The –each_invariant_if_growth option provides a maximum size (in number of lines of executable code) of sections 1 and 4, below which the compiler will try to float an invariant IF statement outside a loop. This can be controlled on a loop-by-loop basis with the C*$* EACH_INVARIANT_IF_GROWTH (integer) directive within the source (see “Setting Invariant IF Floating Limits” in Chapter 9). You can limit the total amount of additional code generated in a program unit through invariant IF floating by specifying the –max_invariant_if_growth option. 73 Chapter 5: Scalar Optimizations The –max_invariant_if_growth=integer option (or –miifg=integer) specifies an upperbound on the total number of additional lines of code the compiler can generate in each program unit through invariant IF floating. This limit is applied on a per subroutine basis. For example, if a subroutine is 400 lines long and –miifg=500, the compiler can add at most 100 lines in the process of invariant IF floating. The default for integer is 500. Note: Other compiler optimizations can add or delete lines, so the final number of lines might differ from the value specified with –miifg. This can be controlled on a loop-by-loop basis with the C*$* MAX_INVARIANT_IF_GROWTH (integer) directive within the source (see “Setting Invariant IF Floating Limits” in Chapter 9). Setting the Optimization Level The –optimize=integer option (or –o=integer) sets the optimization level. Each optimization level is cumulative (that is, level 5 performs everything up to and including level 5). You can also modify the optimization level on a loop-by-loop basis by using the C*$* OPTIMIZE(integer) directive within the source (see “Optimization Level” in Chapter 9). Valid values for integer are: 74 fe0 Disables optimization. 1 Performs only simple optimizations. Enables induction variable recognition. 2 Performs lifetime analysis to determine when last-value assignment of scalars is necessary. 3 Recognizes triangular loops and attempts loop interchanging to improve memory referencing. Uses special case data dependence tests. Also, recognizes special index sets called wrap-around variables. 4 Generates two versions of a loop, if necessary, to break a data dependence arc. 5 Enables array expansion and loop fusion. Performing General Optimizations There is no default value for this option. If you do not specify it, this option can still be in effect through the –O option. Although higher optimization levels increase performance, they also increase compilation time. The output of following example is described for –optimize=1, –optimize=2, and –optimize=5 to illustrate the range of this option. (This example also uses –minconcurrent=0.) 10 ASUM = 0.0 DO 10 I = 1,M DO 10 J = 1,N ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2.0 CONTINUE At –optimize=1, the compiler sees the summation in ASUM as an intractable data dependence between iterations and does not try to optimize the loop. At –optimize=2 (perform lifetime analysis and do not interchange around reduction): ASUM C$DOACROSS DO 3 DO = 0. SHARE(M,N,A,C),LOCAL(I,J),REDUCTION(ASUM) I=1,M 2 J=1,N ASUM = ASUM + A(I,J) C(I,J) = 2. + A(I,J) 2 CONTINUE 3 CONTINUE Specifying –optimize=5 (loop interchange around reduction to improve memory referencing) produces the following: ASUM = 0. C$DOACROSS SHARE(N,M,A,C),LOCAL(J,I),REDUCTION(ASUM) DO 3 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) C(I,J) = 2. + A(I,J) 2 CONTINUE 3 CONTINUE 75 Chapter 5: Scalar Optimizations Controlling Variations in Round Off The –roundoff=integer option (or –r=integer) controls the amount of variation in round-off error produced by optimization. If an arithmetic reduction is accumulated in a different order than in the scalar program, the round-off error is accumulated differently and the final result might differ from the output of the original program. Although the difference is usually insignificant, certain restructuring transformations performed by the compiler must be disabled to obtain exactly the same answers as the scalar program. The values you can specify for integer are cumulative. For example, –roundoff=3 performs what is described for level 3, in addition to what is listed for the previous levels. Valid values for integer are 0 Suppresses any transformations that change round-off error. 1 Performs expression simplification, which might generate various overflow or underflow errors, for expressions with operands between binary and unary operators, expressions that are inside trigonometric intrinsic functions returning integer values, and after forward substitution. Enables strength reduction. Performs intrinsic function simplification for max and min. Enables code floating if –scalaropt is at least 1. Allows loop interchanging around serial arithmetic reductions, if –optimize is at least 4. Allows loop rerolling, if –scalaropt is at least 2. 2 Allows loop interchanging around arithmetic reductions if –optimize is at least 4. For example, the floating point expression A/B/C is computed as A/(B*C). 3 Recognizes REAL (float) induction variables if –scalaropt greater than 2 or–optimize is at least 1. Enables sum reductions. Enables memory management optimizations if –scalaropt=3 (see “Performing Memory Management Transformations” on page 84 for details about memory management transformations). There is no default value for this option. If you do not specify it, this option can still be in effect through the –O option. 76 Performing General Optimizations Example Consider the following code segment: 10 ASUM = 0.0 DO 10 I = 1,M DO 10 J = 1,N ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2.0 CONTINUE When –roundoff=1, the compiler does not transform the summation reduction. The compiler distributes the loop. ASUM = 0. DO 2 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) 2 CONTINUE DO 3 J=1,N DO 3 I=1,M C(I,J) = A(I,J) + 2. 3 CONTINUE When –roundoff=2 and –optimize=5, (reduction variable identification and loop interchange around arithmetic reduction) the original code becomes: ASUM = 0. DO 10 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2. 2 CONTINUE 10 CONTINUE When –roundoff=3 and –optimize=5, the compiler recognizes REAL induction variables. In this example, the compiler performs forward substitution of the transformed induction variable X. 77 Chapter 5: Scalar Optimizations The following code: 10 ASUM = 0.0 X = 0.0 DO 10 I = 1,N ASUM = ASUM + A(I)*COS(X) X = X + 0.01 CONTINUE becomes 10 ASUM = 0. X = 0. DO 10 I=1,N ASUM = ASUM + A(I) * COS ((I - 1) * 0.01) CONTINUE Controlling Scalar Optimizations The –scalaropt=integer option (or –so=integer) controls the level of scalar optimizations that the compiler performs. Valid values for integer are 0 Disables all scalar optimizations. 1 Enables simple scalar optimizations—dead code elimination, global forward substitution of variables, and conversion of IF-GOTO to IF-THEN-ELSE. 2 Enables the full range of scalar optimizations— floating invariant IF statements out of loops, loop rerolling and unrolling (if –roundoff is greater than zero), array expansion, loop fusion, loop peeling, and induction variable recognition. 3 Enables memory management transformations if –roundoff=3 (see “Performing Memory Management Transformations” on page 84 for details about memory management transformations). Performs dead-code elimination during output conversion. There is no default value for this option. If you do not specify it, this option can still be in effect through the –O option. 78 Performing General Optimizations Unlike the –scalaropt command line option, the C*$* SCALAR OPTIMIZE directive sets the level of loop-based optimizations (for example, loop fusion) only, and not straight-code optimizations (for example, dead-code elimination). Refer to “Controlling Scalar Optimizations” in Chapter 9 for details about the C*$* SCALAR OPTIMIZE directive. Using Vector Intrinsics The nine intrinsic functions ASIN, ACOS, ATAN, COS, EXP, LOG, SIN, TAN and SQRT have a scalar (element by element) version and a special version optimized for vectors. When you use -O3 optimization, the compiler uses the vector versions if it can. On the MIPS R8000 and R10000 processors, the vector function is significantly faster than the scalar version, but has a few restrictions on use. Finding Vector Intrinsics To apply the vector intrinsics, the compiler searches for loops of the following form: real a(10000), b(10000) do j = 1, 1000 b(2*j) = sin(a(3*j)) enddo The compiler can recognize the eight functions ASIN, ACOS, ATAN, COS, EXP, LOG, SIN, and TAN when they are applied between elements of named variables in a loop (SQRT is not recognized automatically). The compiler automatically replaces the loop with a single call to a special, vectorized version of the function. 79 Chapter 5: Scalar Optimizations The compiler cannot use the vector intrinsic when the input is based on a temporary result or when the output replaces the input. In the following example, only certain functions can be vectorized. real a(400,400), b(400,400), c(400,400), d( 400,400 ) call xx(a,b,c,d) do j = 100,300,2 do i = 100, 300,3 a(i,j) = 1.23*i + a(i,j) b(i,j) = sin(a(i,j) + 1.0) a(i,j) = log(a(i,j)) c(i,j) = sin(c(i,j)) / cos(d(i,j)) d(i+30,j-10) = tan( d(j,i) ) enddo enddo call xx(a,b,c,d) end In the preceding function, • The first SIN call is applied to a temporary value and cannot be vectorized • The LOG call can be vectorized • Results from the second SIN call and first COS call are used in temporary expressions and cannot be vectorized • The TAN call can be vectorized Limitations of the Vector Intrinsics The vector intrinsics are limited in the following ways: 80 • The SQRT function is not used automatically in the current release (but it can be called directly; see “Calling Vector Functions Directly” on page 81). • The single-precision COS, SIN, and TAN functions are valid only for arguments whose absolute value is less than or equal to 2**28. • The double-precision COS, SIN and TAN functions are valid only for arguments whose absolute value is less than or equal to PI*219. Performing General Optimizations The vector functions assume that the input and output arrays either coincide completely, or do not overlap. They do not check for partial overlap, and will produce unpredictable results if it occurs . Disabling Vector Intrinsics If you need to disable use of vector intrinsics while still compiling at -O3 level, you can do so. Specify the option -OPT:vector_intrinsics=OFF. f77 -64 -mips4 -O3 -OPT:vector_intrinsics=OFF trig.f Calling Vector Functions Directly The vector intrinsic functions are C functions that can be called directly using the techniques discussed under “Calls to C Using LOC%, REF% and VAL%” on page 45. The prototype of one function is as follows: __vsinf( void*from, void*dest, int count, int fromstride, int deststride ) Note the two leading underscore characters in the name. The arguments are from Address of the first element of the source array dest Address of first element of destination array count Number of elements to process fromstride Number of elements to advance in the source array deststride Number of elements to advance in the destination array For example, the compiler converts a loop of this form: real a(10000), b(10000) do j = 1, 1000 b(2*j) = sin(a(3*j)) enddo into nonlooping code of this form: real a(10000), b(10000) call __VSINF$(%REF(A(1)),%REF(A(2)),%VAL(1000),%VAL(3),%VAL(2)) 81 Chapter 5: Scalar Optimizations All the vector intrinsic functions have the same prototype as the one shown above for __vsinf. The names of the available vector functions are shown in Table 5-2. Table 5-2 Vector Intrinsic Function Names Operation REAL*4 Function Name REAL*8 Function Name acos __vacosf __vacos asin __vasinf __vasin atan __vatanf __vatan cos __vcosf __vcos exp __vexpf __vexp log __vlogf __vlog sin __vsinf __vsin sqrt __vsqrtf __vsqrt tan __vtanf __vtan Performing Advanced Optimizations This section describes advanced optimization techniques you can use to obtain maximum performance. Using Aggressive Optimization The –aggressive=letter option (or –ag=letter) performs optimizations that are normally forbidden. When using this option, your program must be a single file, so that the compiler can analyze all of it simultaneously. The only available value for letter is a, which instructs the compiler to add padding to Fortran COMMON blocks. This optimization provides favorable alignments of the virtual addresses. This option does not have a default value. % f77 -WK,-ag=a program.f 82 Performing Advanced Optimizations For example, on a machine with a 64-kilobyte direct-mapped cache, a COMMON definition such as: COMMON /alpha/ a(128,128),b(128,128),c(128,128) can degrade performance if your program contains the following statement: a(i,j) = b(i,j) * c(i,j) All three of the arrays a, b, and c have the same starting virtual address modulo the cache size, and so every access to the array elements causes a cache miss. It would be much better to add some padding between each of the arrays to force the virtual addresses to be different. The –aggressive=a option does exactly this. Unfortunately, this transformation is not always possible. Fortran allows different routines to have different definitions of COMMON. If some other routine contained the definition COMMON /alpha/ scratch(49152) the compiler could not arbitrarily add padding. Therefore, when using this option the entire program must be in a single source file, so the compiler can check for this sort of occurrence. Controlling Internal Table Size The –arclimit=integer option (or –arclm=integer) sets the size of the internal table that the compiler uses to store data dependence information. The default value for integer is 5000. The compiler dynamically allocates the dependence data structure on a loop-nest-by-loop-nest basis. If a loop contains too many dependence relationships and cannot be represented in the dependence data structure, the compiler will stop analyzing the loop. Increasing the value of –arclimit allows the compiler to analyze larger loops. Note: The number of data dependencies (and the time required to do the analysis) is potentially non-linear in the length of the loop. Very long loops (several hundred lines) may be impossible to analyze regardless of the value of –arclimit. 83 Chapter 5: Scalar Optimizations You can use the –arclimit option to increase the size of the data structure to enable the compiler to perform more optimizations. (Most users do not need to change this value.) Performing Memory Management Transformations Memory management transformations are advanced optimizations you can enable by specifying options along with the –WK option. Memory Management Techniques When both –roundoff and –scalaropt are set to 3, the compiler attempts to perform outer loop unrolling (to improve register utilization) and automatic loop blocking (to improve cache utilization). Normal loop unrolling (enabled with the –unroll and –unroll2 options) applies to the innermost loop in a nest of loops. In outer loop unrolling, one of the other loops (typically the next innermost) is unrolled. In certain situations, this technique (also called “unroll and jam”) can greatly improve the register utilization. Loop blocking is a transformation that can be applied when the loop nesting depth is greater than the dimensions of the data arrays being manipulated. For example, the simple matrix multiply uses a nest of three loops operating on two-dimensional arrays. The simple approach repeatedly sweeps across the entire arrays. A better approach is to break the arrays up into blocks, each block being small enough to fit into the cache, and then make repeated sweeps over each (in cache) block. (This technique is also sometimes called “tiles” or “tiling.”) However, the code needed to implement a block style algorithm is often very complex and messy. This automatic transformation allows you to write the simpler method, and have the compiler transform it into the more complex and efficient block method. Memory Management Options The compiler recognizes the following memory management command line options when specified with the -WK option: • 84 –cacheline specifies the width of the memory channel between cache and main memory. Performing Advanced Optimizations • –cachesize specifies the data cache size. • –fpregisters specifies an unrolling factor. • –dpregisters ensures that registers do not overflow during loop unrolling. • –setassociativity specifies which memory management transformation to use. The –cacheline=integer option (or –chl=integer) specifies the width of the memory channel, in bytes, between the cache and main memory. The default value for integer is 4. Refer to Table 5-3 for the recommended setting for your machine. The –cachesize=integer option (or –chs=integer) specifies the size of the data cache, in kilobytes, for which to optimize. The default value for integer is 256 kilobytes. Refer to Table 5-3 for the recommended setting for your machine. You can obtain the cache size for a given machine with the hinv(1) command. This option is generally useful only in conjunction with the other memory management transformations. Table 5-3 Recommended Cache Option Settings Machine Cacheline Value Cache Size Value POWER Series 4D/100 16 64 POWER Series 4D/200 64 64 R4000 (including Crimson™ and Indigo2™) 16 8 CHALLENGE™ and POWER CHALLENGE™ Series 128 16 The –setassociativity=integer option (or –sasc=integer) provides information on the mapping of physical addresses in main memory to cache pages. The default value for integer, 1, says a datum in main memory can be put in only one place in the cache. If this cache page is already in use, its contents must be rewritten or flushed so that the newly-accessed page can be copied into the cache. SGI recommends you set this value to 1 for all machines, except the POWER CHALLENGE series, where you should set it to 4. 85 Chapter 5: Scalar Optimizations The –dpregisters=integer option (or –dpr=integer) specifies the number of DOUBLE PRECISION registers each processor has. The –fpregisters option (or –fpr=integer) specifies the number of single precision (that is, ordinary floating point) registers each processor has. Silicon Graphics recommends you specify the same value for both –dpregisters and –fpregisters. The default values for integer are 16 for both options. When compiled in 32-bit mode, SGI recommends that you do not specify 16, although that is what the hardware supports. It is better to specify a smaller value for integer, like 12, to provide extra registers in case the compiler needs them. In 64-bit mode, where the hardware supports 32 registers, specify 28 for integer. Enabling Loop Unrolling The –unroll and the –unroll2 options control how the compiler unrolls scalar loops. When loops cannot be optimized for concurrent execution, loop execution is often more efficient when the loops are unrolled. (Fewer iterations with more work per iteration require less overhead overall.) You must also specify –scalaropt= 2 when using these options. The –unroll=integer (or –ur=integer) option directs the compiler to unroll inner loops. integer specifies the number of times to replicate the loop. The default value is 4. 0 Uses default values to unroll. 1 Disables unrolling. 2-n Unrolls at most, this many iterations. The –unroll2=weight (or –ur2=weight) option specifies an upper bound on the number of operations in a loop when unrolling it with the –unroll option. The default value for weight is 100. The compiler unrolls an inner loop until the number of operations (the amount of work) in the unrolled loop is close to this upper bound, or until the number of iterations specified in the –unroll option is reached, whichever occurs first. 86 Performing Advanced Optimizations For the –unroll2 option the compiler analyzes a given loop by computing an estimate of the computational work that is inside the loop for one iteration. This rough estimate is obtained by adding the number of: • assignments • IF statements • subscripts • arithmetic operations The following example uses the C*$* UNROLL directive (see “Enabling Loop Unrolling” in Chapter 9) to specify 8 for the maximum number of iterations to unroll and 100 for the maximum “work per unrolled iteration.” (This is equivalent to specifying –WK,–unroll=8,–unroll2=100.) C*$*UNROLL(8,100) DO 10 I = 2,N A(I) = B(I)/A(I-1) 10 CONTINUE This example has: 1 assignment 0 IF statements 3 subscripts 2 arithmetic operators ------------------------6 is the weighted sum (the work for 1 iteration) This weighted sum is then divided into 100 to give a potential unrolling factor of 16. However, the example has also specified 8 for the maximum number of unrolled iterations. The compiler takes the minimum of the two values (8) and unrolls that many iterations. (The maximum number of iterations the compiler unrolls is 100.) 87 Chapter 5: Scalar Optimizations In this case (an unknown number of iterations), the compiler generates two loops - the primary unrolled loop and a cleanup loop to ensure that the number of iterations in the main loop is a multiple of the unrolling factor. The result is the following: INTEGER I1 C*$*UNROLL(8,100) I1 = MOD (N - 1, 8) DO 2 I=2,I1+1 A(I) = B(I) / A(I-1) 2 CONTINUE DO 10 I=I1+2,N,8 A(I) = B(I)/A(I-1) A(I+1) = B(I+1) / A(I) A(I+2) = B(I+2) / A(I+1) A(I+3) = B(I+3) / A(I+2) A(I+4) = B(I+4) / A(I+3) A(I+5) = B(I+5) / A(I+4) A(I+6) = B(I+6) / A(I+5) A(I+7) = B(I+7) / A(I+6) 10 CONTINUE Recognizing Directives The –directives=list option (or –dr=list) specifies which type of directives to accept. list can contain any combination of the following values: a Accepts Silicon Graphics C*$* ASSERT assertions. c Accepts Cray CDIR$ directives. k Accepts Silicon Graphics C*$* and C$PAR directives. p Accepts parallel programming directives. s Accepts Sequent C$ directives. v Accepts VAST CVD$ directives. The default value for list is ackpv. For example, –WK,–directives=k enables Silicon Graphics directives only, whereas –WK,–directives=kas enables Silicon Graphics directives and assertions and Sequent directives. To disable all of the above options, enter –nodirectives or –directives (without any values for list) on the command line. Chapter 9, “Fine-Tuning Program 88 Performing Advanced Optimizations Execution,” describes the Silicon Graphics, Cray, Sequent, and VAST directives the compiler accepts. Assertions are similar in form to directives, but they assert program characteristics that the compiler can use in its optimizations. In addition to specifying a in list, you can control whether the compiler accepts assertions using the C*$* ASSERTIONS and C*$* NOASSERTIONS directives (refer to “Using Assertions” in Chapter 9). Specifying Recursion The –recursion option (or –rc) allows the compiler to call subroutines and functions in the source program recursively (that is, a subroutine or function calls itself, or it calls another routine which calls it). Recursion affects storage allocation decisions. This option is enabled by default. To disable it, specify –norecursion (or –nrc). Unsafe transformations can occur unless the –recursion option is enabled for each recursive routine that the compiler processes. 89 Chapter 6 6. Inlining and Interprocedural Analysis This chapter contains the following sections: • “Overview” describes inlining and interprocedural analysis. • “Using Command Line Options” explains how to use command line options to perform inlining and interprocedural analysis (IPA). • “Conditions That Prevent Inlining and IPA” lists several conditions that prevent inlining and interprocedural analysis. Overview Inlining is the process of replacing a function reference with the text of the function. This process eliminates the overhead of the function call and can assist other optimizations by making relationships between function arguments, returned values, and the surrounding code easier to find. Interprocedural analysis (IPA) is the process of inspecting called functions for information on relationships between arguments, returned values, and global data. This process can provide many of the benefits of inlining without replacing the function reference. You can perform inlining and IPA from the command line and using directives in your source code. 91 Chapter 6: Inlining and Interprocedural Analysis Using Command Line Options The compiler performs inlining and IPA when you specify the options listed in Table 6-1 along with the –WK option using the following syntax: % f77 [f77option ...] -WK,option[,option]... file f77_option is any option you can specify directly to the compiler and option is any of the options listed in Table 6-1. Table 6-1 92 Inlining and IPA Options Long Option Name Short Option Name Default Value –inline[=list] –inl[=list] option off –ipa[=list] –ipa[=list] option off –inline_and _copy –inlc option off –inline_looplevel=integer –inll=integer 2 –ipa_looplevel=integer –ipall=integer 2 –inline_depth=integer –ind=integer 2 –inline_man –inm option off –ipa_man –ipam option off –inline_from_files=list –inff=list option off –ipa_from_files=list –ipaff=list option off –inline_from_libraries=list –infl=list option off –ipa_from_libraries=list –ipa=list option off –inline_create[=name] –incr=[=name] option off –ipa_create=[=name] –ipacr=[=name] option off Using Command Line Options Specifying Routines for Inlining or IPA The –inline[=list] option (or –inl[=list]) provides a list of routines to be expanded inline; the –ipa[=list] option provides a list of routines to be analyzed. The routine names in list must be separated by colons. If you do not specify a list of routines, the compiler expands all eligible routines. The compiler looks for the routines in the current source file, unless you specify an –inline_from or –ipa_from option. Refer to “Specifying Where to Search for Routines” on page 97 for details. Example The following command performs inline expansion on the two routines saxpy and daxpy from the file foo.f: % f77 -WK,-inline=saxpy:daxpy foo.f Refer to “Conditions That Prevent Inlining and IPA” on page 100 for information about conditions that prevent inlining and IPA. The –inline_and_copy (or –inlc) option functions like the –inline option, except that the compiler copies the unoptimized text of a routine into the transformed code file each time the routine is called or referenced. Use this option when inlining routines that are called from the file in which they are located. This option has no special effect when the routines being inlined are being taken from a library or separate source file. When a routine has been inlined everywhere it is used, leaving it unoptimized saves compilation time. When a program involves multiple source files, the unoptimized routine is still available in case another source file contains a reference to it. Note: The –inline_and_copy algorithm assumes that all CALLs and references to the routine precede the routine itself in the source file. If the routine is referenced after the text of the routine and the compiler cannot inline that particular call site, it invokes the unoptimized version of the routine. 93 Chapter 6: Inlining and Interprocedural Analysis Specifying Occurrences for Inlining and IPA The loop level, depth, and manual options allow you to specify specific instances of the routines specified with the –inline or –ipa options to process. Loop Level The –inline_looplevel=integer (or –inll=integer) and –ipa_looplevel=integer (or –ipall=integer) options enable you to limit inlining and interprocedural analysis to routines that are referenced in deeply nested loops, where the reduced call overhead or enhanced optimization is multiplied. integer is defined from the most deeply nested leaf of the call graph. To determine which loops are most deeply nested, the compiler constructs a call graph to account for nesting of loops farther up the call chain. For example, if you specify 1 for integer, the compiler expands routines in only the most deeply nested loop. If you specify 2 for integer, the compiler expand routines in the deepest and second deepest nested loops, and so on. Specifying a large number for integer enables inlining/IPA at any nesting level up to and including the integer value. If you do not specify –inline/ipa_looplevel, the loop level is 2. Example Consider the following code: PROGRAM MAIN .. CALL A ------> SUBROUTINE A .. DO DO CALL B -----> SUBROUTINE B ENDDO DO ENDDO DO CALL C -------> SUBROUTINE C ENDDO ENDDO 94 Using Command Line Options The CALL B is inside a doubly-nested loop and therefore, is more profitable for the compiler to expand than the CALL A. The CALL C is quadruply nested, so inlining C yields the greatest gain of the three. For –inline_looplevel=1, only the routines referenced in the most deeply-nested call sites are inlined (subroutine C in the above example). (If more than one routine is called at the same loop nest level, the compiler selects all of them when that level is inlined/analyzed.) –inline_looplevel=2 inlines only routines called at the most deeply-nested level and one loop less deeply-nested. (–inline_looplevel=3 would be required to inline subroutine B, because its call is two loops less nested than the call to subroutine C. A value of 3 or greater causes the compiler to inline C into B, then the new B to be inlined into the main program.) The calling tree written to the listing file includes the nesting depth level of each call in each program unit and the aggregate nesting depth (the sum of the nesting depths for each call site, starting from the main program). You can use this information to identify the best routines for inlining. A routine that passes the –inline_looplevel test is inlined everywhere it is used, even places that are not in deeply-nested loops. If some, but not all, invocations of a routine are to be expanded, use the C*$* INLINE or C*$* IPA directives just before each CALL/reference to be expanded (refer to “Fine-Tuning Inlining and IPA” in Chapter 9). Because inlining increases the size of the code, the extra paging and cache contention can actually slow down a program. Restricting inlining to routines used in DO loops multiplies the benefits of eliminating subroutine and function call overhead for a given amount of code space expansion. (If inlining appears to have slowed an application code, investigate using IPA, which has little effect on code space and the number of temporary variables.) 95 Chapter 6: Inlining and Interprocedural Analysis Depth The –inline_depth=integer option (or –ind=integer) restricts the number of times the compiler continues to attempt inlining already inlined routines. Valid values for integer are 1-10 Specifies a depth to which inlining is limited. The default is 2. 0 Uses the default value. -1 Limits inline expansion to only those routines that do not reference other routines (that is, only leaf routines are inlined). The compiler does not support any other negative values. When a routine is expanded inline, it can contain references to other routines. The compiler must decide whether to recursively expand these references (which might themselves contain yet other references, and so on). This option limits the number of times the compiler performs this recursive expansion. Note that the default setting is quite low; if you know inlining is useful for a particular program, increase this setting. Note: There is no –ipa_depth option. Recursive inlining can be quite expensive in compilation time. Exercise discretion in its use. Manual Control The –inline_man (or –inm) option enables recognition of the C*$* INLINE directive. This directive, described in “Fine-Tuning Inlining and IPA” in Chapter 9, allows you to select individual instances of routines to be inlined. The –ipa_man (or –ipam) option is the analogous option for the C*$* IPA directive. 96 Using Command Line Options Specifying Where to Search for Routines The options listed in Table 6-2 tell the compiler where to search for the routines specified with the –inline or –ipa options. If you do not specify either option, the compiler searches the current source file by default. Table 6-2 Inlining and IPA Search Command Line Options Long Option Name Short Option Name –inline_from_files=list –inff=list –ipa_from_files=list –ipaff=list –inline_from_libraries=list –infl=list –ipa_from_libraries=list –ipafl=list If one of the names in list is a directory, the compiler uses all appropriate files in that directory. You can specify multiple files and directories simultaneously using a colon-separated list. For example -WK,-inline_from_files=file1:file2:file3 The compiler recognizes the type of file from its extension, or lack of one, as described in Table 6-3. Table 6-3 Filename Extensions Extension Type of File .f, .F, .for, .FOR Fortran source .i Fortran source run through cpp .klib Library created with –inline_create or –ipa_create option Other Directory 97 Chapter 6: Inlining and Interprocedural Analysis The compiler recognizes two special abbreviations when specified in list: • “-” means current source file (as listed on the command line or specified in an –input=file command line option) • “.” means the current working directory Example The following command specifies inline expansion on the source file, calc.f: % f77 -WK,-inline,-inline_from_files=-:input.f calc.f When executed, the compiler searches the current source filecalc.f and input.f for all eligible routines to expand.It also searches for all eligible routines because the –inline option was specified without a list. If you specify a non-existent file or directory, the compiler issues an error. If you specify multiple –inline_from or –ipa_from options, the compiler concatenates their lists to produce a bigger universe. The lists are searched in the order that they appear on the command line. The compiler resolves routine name references by a searching for them in the order that they appear in –inline_from/–ipa_from options on the command line. Libraries are searched in their original lexical order. Note: These options by themselves do not initiate inlining or IPA. They only specify where to look for the routines. Use them in conjunction with the appropriate –inline or –ipa option. Creating Libraries When performing inlining and IPA, the compiler analyzes the routines in the source program. Normally, inlining is done directly from a source file. However, when inlining the same set of routines in many different programs, it is more efficient to create a pre-analyzed library of the routines. Use the –inline_create[=name] option (or –incr[=name]) to create a library of prepared routines (for later use with the –inline_from_libraries option). The compiler assigns name to the library file it creates; for maximum compatibility, use the file name extension .klib. For example: samp.klib. 98 Using Command Line Options The –ipa_create[=name] option (or –ipacr[=name]) is the analogous option for IPA. You do not have to generate your inlining/IPA library from the same source that will actually be linked into the running program. This capability can cause errors, but it can also be quite useful. For example, you can write a library of hand-optimized assembly language routines, then construct an IPA library using Fortran routines that mimic the behavior of the assembly code. Thus, you can do parallelism analysis with IPA correctly, but still actually call the hand-optimized assembly routines. The procedure for creating and using a library for inlining or IPA is given below. 1. Create a library using the –inline_create option (or the –ipa_create option for IPA). For example, the following command line creates a library called prog.klib for the source program prog.f: % f77 -WK,-inline_create=prog.klib prog.f When you specify this option the compiler creates only the library; it does not compile the source program or create a transformed version of the file. 2. Compile the program with inlining enabled and specify the new library: % f77 -WK,-inl,-inlf=prog.klib samp.f Note: Libraries created for inlining contain complete information and can be used for both inlining and IPA. Libraries created for IPA contain only summary information and can be used only for IPA. When creating a library, you can specify only one –inline_create (–ipa_create) option. Therefore, you can create only one library at a time. The compiler overwrites any existing file with the same name as the library. If you do not specify the –inline (–ipa) option along with the –inline_create (–ipa_create) option, the compiler includes all routines from the inlining universe in the library, if possible. If you specify –inline=list or –ipa=list, the compiler includes only the named routines in the library. 99 Chapter 6: Inlining and Interprocedural Analysis Conditions That Prevent Inlining and IPA This section lists conditions that prevent the compiler from inlining and analyzing subroutines and functions, whether from a library or source file. Many constructs that prevent inlining will also stop or restrict interprocedural analysis. Conditions that inhibit inlining: 100 • Dummy and actual parameters are mismatched in type or class. • Dummy parameters are missing. • Actual parameters are missing and the corresponding dummy parameters are arrays. • An actual parameter is a non-scalar expression (for example, A+B, where A and B are arrays). • The number of actual parameters differs from the number of dummy parameters. • The size of an array actual parameter differs from the array dummy parameter and the arrays cannot be made linear. • The calling routine and called routine have mismatched COMMON declarations. • The called routine has EQUIVALENCE statements (some of these can be handled). • The called routine contains NAMELIST statements. • The called routine has dynamic arrays. • The CALL to be expanded has alternate return parameters. Conditions That Prevent Inlining and IPA Inlining is also inhibited when the routine to be inlined • is too long (he limit is about 600 lines) • contains a SAVE statement • contains variables that are live-on-entry, even if they are not in explicit SAVE statements • contains a DATA statement (DATA implies SAVE) and the variable is live-on-entry • contains a CALL with a subroutine or function name as an argument • contains a C*$*INLINE directive • contains unsubscripted array references in I/O statements • contains POINTER statements 101 Chapter 7 7. Fortran Enhancements for Multiprocessors This chapter contains these sections: • “Overview” provides an overview of this chapter. • “Parallel Loops” discusses the concept of parallel DO loops. • “Writing Parallel Fortran” explains how to use compiler directives to generate code that can be run in parallel. • “Analyzing Data Dependencies for Multiprocessing” describes how to analyze DO loops to determine whether they can be parallelized. • “Breaking Data Dependencies” explains how to rewrite DO loops that contain data dependencies so that some or all of the loop can be run in parallel. • “Work Quantum” describes how to determine whether the work performed in a loop is greater than the overhead associated with multiprocessing the loop. • “Cache Effects” explains how to write loops that account for the effect of the cache. • “Advanced Features” describes features that override multiprocessing defaults and customize parallelism. • “DOACROSS Implementation” discusses how multiprocessing is implemented in a DOACROSS routine. • “PCF Directives” describes how the PCF directives implement a general model of parallelism. 103 Chapter 7: Fortran Enhancements for Multiprocessors Overview The Silicon Graphics Fortran compiler allows you to apply the capabilities of a Silicon Graphics multiprocessor workstation to the execution of a single job. By coding a few simple directives, the compiler splits the job into concurrently executing pieces, thereby decreasing the wall-clock run time of the job. This chapter discusses techniques for analyzing your program and converting it to multiprocessing operations. Chapter 8, “Compiling and Debugging Parallel Fortran,” gives compilation and debugging instructions for parallel processing. Parallel Loops The model of parallelism used focuses on the Fortran DO loop. The compiler executes different iterations of the DO loop in parallel on multiple processors. For example, suppose a DO loop consisting of 200 iterations will run on a machine with four processors using the SIMPLE scheduling method (described in“CHUNK, MP_SCHEDTYPE” on page 108). The first 50 iterations run on one processor, the next 50 on another, and so on. The multiprocessing code adjusts itself at run time to the number of processors actually present on the machine. Thus, if the above 200-iteration loop was moved to a machine with only two processors, it would be divided into two blocks of 100 iterations each, without any need to recompile or relink. In fact, multiprocessing code can even be run on single-processor machines. The above loop would be divided into one block of 200 iterations. This allows code to be developed on a single-processor Silicon Graphics workstation, and later run on an IRIS POWER Series multiprocessor. The processes that participate in the parallel execution of a task are arranged in a master/slave organization. The original process is the master. It creates zero or more slaves to assist. When a parallel DO loop is encountered, the master asks the slaves for help. When the loop is complete, the slaves wait on the master, and the master resumes normal execution. The master process and each of the slave processes are called a thread of execution or simply a thread. By default, the number of threads is set equal to the number of processors on the particular machine (this number cannot exceed four). 104 Writing Parallel Fortran If you want, you can override the default and explicitly control the number of threads of execution used by a Fortran job. For multiprocessing to work correctly, the iterations of the loop must not depend on each other; each iteration must stand alone and produce the same answer regardless of when any other iteration of the loop is executed. Not all DO loops have this property, and loops without it cannot be correctly executed in parallel. However, many of the loops encountered in practice fit this model. Further, many loops that cannot be run in parallel in their original form can be rewritten to run wholly or partially in parallel. To provide compatibility for existing parallel programs, Silicon Graphics has chosen to adopt the syntax for parallelism used by Sequent Computer Corporation. This syntax takes the form of compiler directives embedded in comments. These fairly high-level directives provide a convenient method for you to describe a parallel loop, while leaving the details to the Fortran compiler. For advanced users the proposed Parallel Computing Forum (PCF) standard (ANSI-X3H5 91-0023-B Fortran language binding) is available (refer to “PCF Directives” on page 143). Additionally, there are a number of special routines that permit more direct control over the parallel execution (refer to “Advanced Features” on page 133 for more information.) Writing Parallel Fortran The Fortran compiler accepts directives that cause it to generate code that can be run in parallel. The compiler directives look like Fortran comments: they begin with a C in column one. If multiprocessing is not turned on, these statements are treated as comments. This allows the identical source to be compiled with a single-processing compiler or by Fortran without the multiprocessing option. The directives are distinguished by having a $ as the second character. There are six directives that are supported: C$DOACROSS, C$&, C$, C$MP_SCHEDTYPE, C$CHUNK, and C$COPYIN. The C$COPYIN directive is described in “Local COMMON Blocks” on page 138. This section describes the others. 105 Chapter 7: Fortran Enhancements for Multiprocessors C$DOACROSS The essential compiler directive for multiprocessing is C$DOACROSS. This directive directs the compiler to generate special code to run iterations of a DO loop in parallel. The C$DOACROSS directive applies only to the next statement (which must be a DO loop). The C$DOACROSS directive has the form C$DOACROSS [clause [ [,] clause ...] where valid values for the optional clause are [IF (logical_expression)] [{LOCAL | PRIVATE} (item[,item ...])] [{SHARED | SHARE} (item[,item ...])] [{LASTLOCAL | LAST LOCAL} (item[,item ...])] [REDUCTION (item[,item ...])] [MP_SCHEDTYPE=mode ] [{CHUNK=integer_expression | BLOCKED(integer_expression)}] The preferred form of the directive (as generated by WorkShop Pro MPF) uses the optional commas between clauses. This section discusses the meaning of each clause. IF The IF clause determines whether the loop is actually executed in parallel. If the logical expression is TRUE, the loop is executed in parallel. If the expression is FALSE, the loop is executed serially. Typically, the expression tests the number of times the loop will execute to be sure that there is enough work in the loop to amortize the overhead of parallel execution. Currently, the break-even point is about 4000 CPU clocks of work, which normally translates to about 1000 floating point operations. LOCAL, SHARE, LASTLOCAL The LOCAL, SHARE, and LASTLOCAL clauses specify lists of variables used within parallel loops. A variable can appear in only one of these lists. To make the task of writing these lists easier, there are several defaults. The loop-iteration variable is LASTLOCAL by default. All other variables are SHARE by default. 106 Writing Parallel Fortran LOCAL Specifies variables that are local to each process. If a variable is declared as LOCAL, each iteration of the loop is given its own uninitialized copy of the variable. You can declare a variable as LOCAL if its value does not depend on any other iteration of the loop and if its value is used only within a single iteration. In effect the LOCAL variable is just temporary; a new copy can be created in each loop iteration without changing the final answer. The name LOCAL is preferred over PRIVATE. SHARE Specifies variables that are shared across all processes. If a variable is declared as SHARE, all iterations of the loop use the same copy of the variable. You can declare a variable as SHARE if it is only read (not written) within the loop or if it is an array where each iteration of the loop uses a different element of the array. The name SHARE is preferred over SHARED. LASTLOCAL Specifies variables that are local to each process.Unlike with the LOCAL clause, the compiler saves only the value of the logically last iteration of the loop when it exits. The name LASTLOCAL is preferred over LAST LOCAL. LOCAL is a little faster than LASTLOCAL, so if you do not need the final value, it is good practice to put the DO loop index variable into the LOCAL list, although this is not required. Only variables can appear in these lists. In particular, COMMON blocks cannot appear in a LOCAL list (but see the discussion of local COMMON blocks in “Advanced Features” on page 133). The SHARE, LOCAL, and LASTLOCAL lists give only the names of the variables. If any member of the list is an array, it is listed without any subscripts. REDUCTION The REDUCTION clause specifies variables involved in a reduction operation. In a reduction operation, the compiler keeps local copies of the variables and combines them when it exits the loop. For an example and details see “Example 4: Sum Reduction” on page 123 of “Breaking Data Dependencies.” An element of the REDUCTION list must be an individual variable (also called a scalar variable) and cannot be an array. However, it 107 Chapter 7: Fortran Enhancements for Multiprocessors can be an individual element of an array. In a REDUCTION clause, it would appear in the list with the proper subscripts. One element of an array can be used in a reduction operation, while other elements of the array are used in other ways. To allow for this, if an element of an array appears in the REDUCTION list, the entire array can also appear in the SHARE list. The four types of reductions supported are sum(+), product(*), min(), and max(). Note that min(max) reductions must use the min(max) intrinsic functions to be recognized correctly. The compiler confirms that the reduction expression is legal by making some simple checks. The compiler does not, however, check all statements in the DO loop for illegal reductions. You must ensure that the reduction variable is used correctly in a reduction operation. CHUNK, MP_SCHEDTYPE The CHUNK and MP_SCHEDTYPE clauses affect the way the compiler schedules work among the participating tasks in a loop. These clauses do not affect the correctness of the loop. They are useful for tuning the performance of critical loops. See “Load Balancing” on page 131 for more details. For the MP_SCHEDTYPE=mode clause, mode can be one of the following: [SIMPLE | simple | STATIC | static] [DYNAMIC | dynamic] [INTERLEAVE | interleave | INTERLEAVED | interleaved] [GUIDED | guided | GSS | gss] [RUNTIME | runtime] You can use any or all of these modes in a single program. The CHUNK clause is valid only with the DYNAMIC and INTERLEAVE modes. SIMPLE, DYNAMIC, INTERLEAVE, GSS, and RUNTIME are the preferred names for each mode. The simple method (MP_SCHEDTYPE=SIMPLE) divides the iterations among processes by dividing them into contiguous pieces and assigning one piece to each process. 108 Writing Parallel Fortran In dynamic scheduling (MP_SCHEDTYPE=DYNAMIC) the iterations are broken into pieces the size of which is specified with the CHUNK clause. As each process finishes a piece, it enters a critical section to grab the next available piece. This gives good load balancing at the price of higher overhead. The interleave method (MP_SCHEDTYPE=INTERLEAVE) breaks the iterations into pieces of the size specified by the CHUNK option, and execution of those pieces is interleaved among the processes. Instead of the CHUNK option, you can specify the –WK,–chunk command line option (see “Memory Management Options” in Chapter 5 for details). For example, if there are four processes and CHUNK=2, then the first process will execute iterations 1–2, 9–10, 17–18, …; the second process will execute iterations 3–4, 11–12, 19–20,…; and so on. Although this is more complex than the simple method, it is still a fixed schedule with only a single scheduling decision. The fourth method is a variation of the guided self-scheduling algorithm (MP_SCHEDTYPE=GSS). Here, the piece size is varied depending on the number of iterations remaining. By parceling out relatively large pieces to start with and relatively small pieces toward the end, the system can achieve good load balancing while reducing the number of entries into the critical section. In addition to these four methods, you can specify the scheduling method at run time (MP_SCHEDTYPE=RUNTIME). Here, the scheduling routine examines values in your run-time environment and uses that information to select one of the other four methods. See “Advanced Features” on page 133 for more details. If both the MP_SCHEDTYPE and CHUNK clauses are omitted, SIMPLE scheduling is assumed. If MP_SCHEDTYPE is set to INTERLEAVE or DYNAMIC and the CHUNK clause are omitted, CHUNK=1 is assumed. If MP_SCHEDTYPE is set to one of the other values, CHUNK is ignored. If the MP_SCHEDTYPE clause is omitted, but CHUNK is set, then MP_SCHEDTYPE=DYNAMIC is assumed. 109 Chapter 7: Fortran Enhancements for Multiprocessors Example 1 The code fragment DO 10 I = 1, 100 A(I) = B(I) 10 CONTINUE could be multiprocessed with the directive C$DOACROSS LOCAL(I), SHARE(A, B) DO 10 I = 1, 100 A(I) = B(I) 10 CONTINUE Here, the defaults are sufficient, provided A and B are mentioned in a nonparallel region or in another SHARE list. The following then works: C$DOACROSS DO 10 I = 1, 100 A(I) = B(I) 10 CONTINUE Example 2 Consider the following code fragment: DO 10 I = 1, N X = SQRT(A(I)) B(I) = X*C(I) + X*D(I) 10 CONTINUE You can be fully explicit, as shown below: C$DOACROSS LOCAL(I, X), share(A, B, C, D, N) DO 10 I = 1, N X = SQRT(A(I)) B(I) = X*C(I) + X*D(I) 10 CONTINUE 110 Writing Parallel Fortran You can also use the defaults: C$DOACROSS LOCAL(X) DO 10 I = 1, N X = SQRT(A(I)) B(I) = X*C(I) + X*D(I) 10 CONTINUE See Example 5 in “Analyzing Data Dependencies for Multiprocessing” on page 114 for more information on this example. Example 3 Consider the following code fragment: DO 10 I = M, K, N X = D(I)**2 Y = X + X DO 20 J = I, MAX A(I,J) = A(I,J) + B(I,J) * C(I,J) * X + Y 20 CONTINUE 10 CONTINUE PRINT*, I, X Here, the final values of I and X are needed after the loop completes. A correct directive is shown below: C$DOACROSS LOCAL(Y,J), LASTLOCAL(I,X), C$& SHARE(M,K,N,ITOP,A,B,C,D) DO 10 I = M, K, N X = D(I)**2 Y = X + X DO 20 J = I, ITOP A(I,J) = A(I,J) + B(I,J) * C(I,J) *X + Y 20 CONTINUE 10 CONTINUE PRINT*, I, X 111 Chapter 7: Fortran Enhancements for Multiprocessors You can also use the defaults: C$DOACROSS LOCAL(Y,J), LASTLOCAL(X) DO 10 I = M, K, N X = D(I)**2 Y = X + X DO 20 J = I, MAX A(I,J) = A(I,J) + B(I,J) * C(I,J) *X + Y 20 CONTINUE 10 CONTINUE PRINT*, I, X I is a loop index variable for the C$DOACROSS loop, so it is LASTLOCAL by default. However, even though J is a loop index variable, it is not the loop index of the loop being multiprocessed and has no special status. If it is not declared, it is assigned the default value of SHARE, which produces an incorrect answer. C$& Occasionally, the clauses in the C$DOACROSS directive are longer than one line. Use the C$& directive to continue the directive onto multiple lines. For example: C$DOACROSS share(ALPHA, BETA, GAMMA, DELTA, C$& EPSILON, OMEGA), LASTLOCAL(I, J, K, L, M, N), C$& LOCAL(XXX1, XXX2, XXX3, XXX4, XXX5, XXX6, XXX7, C$& XXX8, XXX9) C$ The C$ directive is considered a comment line except when multiprocessing. A line beginning with C$ is treated as a conditionally compiled Fortran statement. The rest of the line contains a standard Fortran statement. The statement is compiled only if multiprocessing is turned on. In this case, the C and $ are treated as if they are blanks. They can be used to insert debugging statements, or an experienced user can use them to insert arbitrary code into the multiprocessed version. 112 Writing Parallel Fortran The following code demonstrates the use of the C$ directive: C$ PRINT 10 C$ 10 FORMAT('BEGIN MULTIPROCESSED LOOP') C$DOACROSS LOCAL(I), SHARE(A,B) DO I = 1, 100 CALL COMPUTE(A, B, I) END DO C$MP_SCHEDTYPE and C$CHUNK The C$MP_SCHEDTYPE=mode directive acts as an implicit MP_SCHEDTYPE clause for all C$DOACROSS directives in scope. mode is any of the modes listed in the section called “CHUNK, MP_SCHEDTYPE” on page 108. A C$DOACROSS directive that does not have an explicit MP_SCHEDTYPE clause is given the value specified in the last directive prior to the look, rather than the normal default. If the C$DOACROSS does have an explicit clause, then the explicit value is used. The C$CHUNK=integer_expression directive affects the CHUNK clause of a C$DOACROSS in the same way that the C$MP_SCHEDTYPE directive affects the MP_SCHEDTYPE clause for all C$DOACROSS directives in scope. Both directives are in effect from the place they occur in the source until another corresponding directive is encountered or the end of the procedure is reached. You can also invoke this functionality from the command line during a compile. The –mp_schedtype=schedule_type and –chunk= integer command line options have the effect of implicitly putting the corresponding directive(s) as the first lines in the file. Nesting C$DOACROSS The Fortran compiler does not support direct nesting of C$DOACROSS loops. 113 Chapter 7: Fortran Enhancements for Multiprocessors For example, the following is illegal and generates a compilation error: C$DOACROSS LOCAL(I) DO I = 1, N C$DOACROSS LOCAL(J) DO J = 1, N A(I,J) = B(I,J) END DO END DO However, to simplify separate compilation, a different form of nesting is allowed. A routine that uses C$DOACROSS can be called from within a multiprocessed region. This can be useful if a single routine is called from several different places: sometimes from within a multiprocessed region, sometimes not. Nesting does not increase the parallelism. When the first C$DOACROSS loop is encountered, that loop is run in parallel. If while in the parallel loop a call is made to a routine that itself has a C$DOACROSS, this subsequent loop is executed serially. Analyzing Data Dependencies for Multiprocessing The essential condition required to parallelize a loop correctly is that each iteration of the loop must be independent of all other iterations. If a loop meets this condition, then the order in which the iterations of the loop execute is not important. They can be executed backward or even at the same time, and the answer is still the same. This property is captured by the notion of data independence. For a loop to be data-independent, no iterations of the loop can write a value into a memory location that is read or written by any other iteration of that loop. It is all right if the same iteration reads and/or writes a memory location repeatedly as long as no others do; it is all right if many iterations read the same location, as long as none of them write to it. In a Fortran program, memory locations are represented by variable names. So, to determine if a particular loop can be run in parallel, examine the way variables are used in the loop. Because data dependence occurs only when memory locations are modified, pay particular attention to variables that appear on the left-hand side of assignment statements. If a variable is not modified or if it is passed to a function or subroutine, there is no data dependence associated with it. 114 Analyzing Data Dependencies for Multiprocessing The Fortran compiler supports four kinds of variable usage within a parallel loop: SHARE, LOCAL, LASTLOCAL, and REDUCTION. If a variable is declared as SHARE, all iterations of the loop use the same copy. If a variable is declared as LOCAL, each iteration is given its own uninitialized copy. A variable is declared SHARE if it is only read (not written) within the loop or if it is an array where each iteration of the loop uses a different element of the array. A variable can be LOCAL if its value does not depend on any other iteration and if its value is used only within a single iteration. In effect the LOCAL variable is just temporary; a new copy can be created in each loop iteration without changing the final answer. As a special case, if only the very last value of a variable computed on the very last iteration is used outside the loop (but would otherwise qualify as a LOCAL variable), the loop can be multiprocessed by declaring the variable to be LASTLOCAL. “REDUCTION” on page 107 describes the use of REDUCTION variables. It is often difficult to analyze loops for data dependence information. Each use of each variable must be examined to see if it fulfills the criteria for LOCAL, LASTLOCAL, SHARE, or REDUCTION. If all of the variables’ uses conform, the loop can be parallelized. If not, the loop cannot be parallelized as it stands, but possibly can be rewritten into an equivalent parallel form. (See “Breaking Data Dependencies” on page 120 for information on rewriting code in parallel form.) An alternative to analyzing variable usage by hand is to use Power Fortran. This optional software package is a Fortran preprocessor that analyzes loops for data dependence. If Power Fortran determines that a loop is data-independent, it automatically inserts the required compiler directives (see “Writing Parallel Fortran” on page 105). If Power Fortran cannot determine whether the loop is independent, it produces a listing file detailing where the problems lie. You can use Power Fortran in conjunction with WorkShop Pro MPF to visualize these dependencies and make it easier to understand the obstacles to parallelization. The rest of this section is devoted to analyzing sample loops, some parallel and some not parallel. Example 1: Simple Independence DO 10 I = 1,N 10 A(I) = X + B(I)*C(I) 115 Chapter 7: Fortran Enhancements for Multiprocessors In this example, each iteration writes to a different location in A, and none of the variables appearing on the right-hand side is ever written to, only read from. This loop can be correctly run in parallel. All the variables are SHARE except for I, which is either LOCAL or LASTLOCAL, depending on whether the last value of I is used later in the code. Example 2: Data Dependence DO 20 I = 2,N 20 A(I) = B(I) - A(I-1) This fragment contains A(I) on the left-hand side and A(I-1) on the right. This means that one iteration of the loop writes to a location in A and the next iteration reads from that same location. Because different iterations of the loop read and write the same memory location, this loop cannot be run in parallel. Example 3: Stride Not 1 DO 20 I = 2,N,2 20 A(I) = B(I) - A(I-1) This example looks like the previous example. The difference is that the stride of the DO loop is now two rather than one. Now A(I) references every other element of A, and A(I-1) references exactly those elements of A that are not referenced by A(I). None of the data locations on the right-hand side is ever the same as any of the data locations written to on the left-hand side. The data are disjoint, so there is no dependence. The loop can be run in parallel. Arrays A and B can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL. Example 4: Local Variable DO I = 1, N X = A(I)*A(I) + B(I) B(I) = X + B(I)*X END DO In this loop, each iteration of the loop reads and writes the variable X. However, no loop iteration ever needs the value of X from any other iteration. X is used as a temporary variable; its value does not survive from 116 Analyzing Data Dependencies for Multiprocessing one iteration to the next. This loop can be parallelized by declaring X to be a LOCAL variable within the loop. Note that B(I) is both read and written by the loop. This is not a problem because each iteration has a different value for I, so each iteration uses a different B(I). The same B(I) is allowed to be read and written as long as it is done by the same iteration of the loop. The loop can be run in parallel. Arrays A and B can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL. Example 5: Function Call DO 10 I = 1, N X = SQRT(A(I)) B(I) = X*C(I) + X*D(I) 10 CONTINUE The value of X in any iteration of the loop is independent of the value of X in any other iteration, so X can be made a LOCAL variable. The loop can be run in parallel. Arrays A, B, C, and D can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL. The interesting feature of this loop is that it invokes an external routine, SQRT. It is possible to use functions and/or subroutines (intrinsic or user defined) within a parallel loop. However, make sure that the various parallel invocations of the routine do not interfere with one another. In particular, SQRT returns a value that depends only on its input argument, does not modify global data, and does not use static storage. We say that SQRT has no side effects. All the Fortran intrinsic functions listed in Appendix A of the MIPSpro Fortran 77 Language Reference Manual have no side effects and can safely be part of a parallel loop. For the most part, the Fortran library functions and VMS intrinsic subroutine extensions (listed in Chapter 4, “System Functions and Subroutines,”) cannot safely be included in a parallel loop. In particular, rand is not safe for multiprocessing. For user-written routines, it is the responsibility of the user to ensure that the routines can be correctly multiprocessed. Caution: Do not use the –static option when compiling routines called within a parallel loop. 117 Chapter 7: Fortran Enhancements for Multiprocessors Example 6: Rewritable Data Dependence INDX = 0 DO I = 1, N INDX = INDX + I A(I) = B(I) + C(INDX) END DO Here, the value of INDX survives the loop iteration and is carried into the next iteration. This loop cannot be parallelized as it is written. Making INDX a LOCAL variable does not work; you need the value of INDX computed in the previous iteration. It is possible to rewrite this loop to make it parallel (see Example 1 in “Breaking Data Dependencies” on page 120). Example 7: Exit Branch DO I = 1, N IF (A(I) .LT. EPSILON) GOTO 320 A(I) = A(I) * B(I) END DO 320 CONTINUE This loop contains an exit branch; that is, under certain conditions the flow of control suddenly exits the loop. The Fortran compiler cannot parallelize loops containing exit branches. Example 8: Complicated Independence DO I = K+1, 2*K W(I) = W(I) + B(I,K) * W(I-K) END DO At first glance, this loop looks like it cannot be run in parallel because it uses both W(I) and W(I-K). Closer inspection reveals that because the value of I varies between K+1 and 2*K, then I-K goes from 1 to K. This means that the W(I-K) term varies from W(1) up to W(K), while the W(I) term varies from W(K+1) up to W(2*K). So W(I-K) in any iteration of the loop is never the same memory location as W(I) in any other iterations. Because there is no data overlap, there are no data dependencies. This loop can be run in parallel. Elements W, B, and K can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL. 118 Analyzing Data Dependencies for Multiprocessing This example points out a general rule: the more complex the expression used to index an array, the harder it is to analyze. If the arrays in a loop are indexed only by the loop index variable, the analysis is usually straightforward though tedious. Fortunately, in practice most array indexing expressions are simple. Example 9: Inconsequential Data Dependence INDEX = SELECT(N) DO I = 1, N A(I) = A(INDEX) END DO There is a data dependence in this loop because it is possible that at some point I will be the same as INDEX, so there will be a data location that is being read and written by different iterations of the loop. In this special case, you can simply ignore it. You know that when I and INDEX are equal, the value written into A(I) is exactly the same as the value that is already there. The fact that some iterations of the loop read the value before it is written and some after it is written is not important because they all get the same value. Therefore, this loop can be parallelized. Array A can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL. Example 10: Local Array DO I = 1, N D(1) = A(I,1) - A(J,1) D(2) = A(I,2) - A(J,2) D(3) = A(I,3) - A(J,3) TOTAL_DISTANCE(I,J) = SQRT(D(1)**2 + D(2)**2 + D(3)**2) END DO In this fragment, each iteration of the loop uses the same locations in the D array. However, closer inspection reveals that the entire D array is being used as a temporary. This can be multiprocessed by declaring D to be LOCAL. The Fortran compiler allows arrays (even multidimensional arrays) to be LOCAL variables with one restriction: the size of the array must be known at compile time. The dimension bounds must be constants; the LOCAL array cannot have been declared using a variable or the asterisk syntax. 119 Chapter 7: Fortran Enhancements for Multiprocessors Therefore, this loop can be parallelized. Arrays TOTAL_DISTANCE and A can be declared SHARE, while array D and variable I should be declared LOCAL or LASTLOCAL. Breaking Data Dependencies Many loops that have data dependencies can be rewritten so that some or all of the loop can be run in parallel. The essential idea is to locate the statement(s) in the loop that cannot be made parallel and try to find another way to express it that does not depend on any other iteration of the loop. If this fails, try to pull the statements out of the loop and into a separate loop, allowing the remainder of the original loop to be run in parallel. The first step is to analyze the loop to discover the data dependencies (see “Writing Parallel Fortran” on page 105). You can use WorkShop Pro MPF with MIPSpro Power Fortran 77 to identify the problem areas. Once you have identified these areas, you can use various techniques to rewrite the code to break the dependence. Sometimes the dependencies in a loop cannot be broken, and you must either accept the serial execution rate or try to discover a new parallel method of solving the problem. The rest of this section is devoted to a series of “cookbook” examples on how to deal with commonly occurring situations. These are by no means exhaustive but cover many situations that happen in practice. Example 1: Loop Carried Value INDX = 0 DO I = 1, N INDX = INDX + I A(I) = B(I) + C(INDX) END DO This code segment is the same as in “Example 6: Rewritable Data Dependence” on page 118. INDX has its value carried from iteration to iteration. However, you can compute the appropriate value for INDX without making reference to any previous value. 120 Breaking Data Dependencies For example, consider the following code: C$DOACROSS LOCAL (I, INDX) DO I = 1, N INDX = (I*(I+1))/2 A(I) = B(I) + C(INDX) END DO In this loop, the value of INDX is computed without using any values computed on any other iteration. INDX can correctly be made a LOCAL variable, and the loop can now be multiprocessed. Example 2: Indirect Indexing DO 100 I = 1, N IX = INDEXX(I) IY = INDEXY(I) XFORCE(I) = XFORCE(I) + NEWXFORCE(IX) YFORCE(I) = YFORCE(I) + NEWYFORCE(IY) IXX = IXOFFSET(IX) IYY = IYOFFSET(IY) TOTAL(IXX, IYY) = TOTAL(IXX, IYY) + EPSILON 100 CONTINUE It is the final statement that causes problems. The indexes IXX and IYY are computed in a complex way and depend on the values from the IXOFFSET and IYOFFSET arrays. We do not know if TOTAL (IXX,IYY) in one iteration of the loop will always be different from TOTAL (IXX,IYY) in every other iteration of the loop. 121 Chapter 7: Fortran Enhancements for Multiprocessors We can pull the statement out into its own separate loop by expanding IXX and IYY into arrays to hold intermediate values: C$DOACROSS LOCAL(IX, IY, I) DO I = 1, N IX = INDEXX(I) IY = INDEXY(I) XFORCE(I) = XFORCE(I) + NEWXFORCE(IX) YFORCE(I) = YFORCE(I) + NEWYFORCE(IY) IXX(I) = IXOFFSET(IX) IYY(I) = IYOFFSET(IY) END DO DO 100 I = 1, N TOTAL(IXX(I),IYY(I)) = TOTAL(IXX(I), IYY(I)) + EPSILON 100 CONTINUE Here, IXX and IYY have been turned into arrays to hold all the values computed by the first loop. The first loop (containing most of the work) can now be run in parallel. Only the second loop must still be run serially. This will be true if IXOFFSET or IYOFFSET are permutation vectors. Before we leave this example, note that if we were certain that the value for IXX was always different in every iteration of the loop, then the original loop could be run in parallel. It could also be run in parallel if IYY was always different. If IXX (or IYY) is always different in every iteration, then TOTAL(IXX,IYY) is never the same location in any iteration of the loop, and so there is no data conflict. This sort of knowledge is, of course, program-specific and should always be used with great care. It may be true for a particular data set, but to run the original code in parallel as it stands, you need to be sure it will always be true for all possible input data sets. 122 Breaking Data Dependencies Example 3: Recurrence DO I = 1,N X(I) = X(I-1) + Y(I) END DO This is an example of recurrence, which exists when a value computed in one iteration is immediately used by another iteration. There is no good way of running this loop in parallel. If this type of construct appears in a critical loop, try pulling the statement(s) out of the loop as in the previous example. Sometimes another loop encloses the recurrence; in that case, try to parallelize the outer loop. Example 4: Sum Reduction SUM = 0.0 DO I = 1,N SUM = SUM + A(I) END DO This operation is known as a reduction. Reductions occur when an array of values is combined and reduced into a single value. This example is a sum reduction because the combining operation is addition. Here, the value of SUM is carried from one loop iteration to the next, so this loop cannot be multiprocessed. However, because this loop simply sums the elements of A(I), we can rewrite the loop to accumulate multiple, independent subtotals. 123 Chapter 7: Fortran Enhancements for Multiprocessors Then we can do much of the work in parallel: NUM_THREADS = MP_NUMTHREADS() C C C IPIECE_SIZE = N/NUM_THREADS ROUNDED UP IPIECE_SIZE = (N + (NUM_THREADS -1)) / NUM_THREADS DO K = 1, NUM_THREADS PARTIAL_SUM(K) = 0.0 C C C C C C C THE FIRST THREAD DOES 1 THROUGH IPIECE_SIZE, THE SECOND DOES IPIECE_SIZE + 1 THROUGH 2*IPIECE_SIZE, ETC. IF N IS NOT EVENLY DIVISIBLE BY NUM_THREADS, THE LAST PIECE NEEDS TO TAKE THIS INTO ACCOUNT, HENCE THE "MIN" EXPRESSION. DO I =K*IPIECE_SIZE -IPIECE_SIZE +1, MIN(K*IPIECE_SIZE,N) PARTIAL_SUM(K) = PARTIAL_SUM(K) + A(I) END DO END DO C C NOW ADD UP THE PARTIAL SUMS SUM = 0.0 DO I = 1, NUM_THREADS SUM = SUM + PARTIAL_SUM(I) END DO The outer K loop can be run in parallel. In this method, the array pieces for the partial sums are contiguous, resulting in good cache utilization and performance. This is an important and common transformation, and so automatic support is provided by the REDUCTION clause: SUM = 0.0 C$DOACROSS LOCAL (I), REDUCTION (SUM) DO 10 I = 1, N SUM = SUM + A(I) 10 CONTINUE The previous code has essentially the same meaning as the much longer and more confusing code above. It is an important example to study because the idea of adding an extra dimension to an array to permit parallel computation, and then combining the partial results, is an important 124 Breaking Data Dependencies technique for trying to break data dependencies. This idea occurs over and over in various contexts and disguises. Note that reduction transformations such as this do not produce the same results as the original code. Because computer arithmetic has limited precision, when you sum the values together in a different order, as was done here, the round-off errors accumulate slightly differently. It is likely that the final answer will be slightly different from the original loop. Both answers are equally “correct.” Most of the time the difference is irrelevant, but sometimes it can be significant, so some caution is in order. If the difference is significant, neither answer is really trustworthy. This example is a sum reduction because the operator is plus (+). The Fortran compiler supports three other types of reduction operations: 1. sum: p = p+a(i) 2. product: p = p*a(i) 3. min: m = min(m,a(i)) 4. max: m = max(m,a(i)) For example, C$DOACROSS LOCAL(I),REDUCTION(BG_SUM,BG_PROD,BG_MIN,BG_MAX) DO I = 1,N BG_SUM = BG_SUM + A(I) BG_PROD = BG_PROD * A(I) BG_MIN = MIN(BG_MIN, A(I)) BG_MAX = MAX(BG_MAX, A(I) END DO 125 Chapter 7: Fortran Enhancements for Multiprocessors One further example of a reduction transformation is noteworthy. Consider the following code: DO I = 1, N TOTAL = 0.0 DO J = 1, M TOTAL = TOTAL + A(J) END DO B(I) = C(I) * TOTAL END DO Initially, it might look as if the inner loop should be parallelized with a REDUCTION clause. However, look at the outer I loop. Although TOTAL cannot be made a LOCAL variable in the inner loop, it fulfills the criteria for a LOCAL variable in the outer loop: the value of TOTAL in each iteration of the outer loop does not depend on the value of TOTAL in any other iteration of the outer loop. Thus, you do not have to rewrite the loop; you can parallelize this reduction on the outer I loop, making TOTAL and J local variables. Work Quantum A certain amount of overhead is associated with multiprocessing a loop. If the work occurring in the loop is small, the loop can actually run slower by multiprocessing than by single processing. To avoid this, make the amount of work inside the multiprocessed region as large as possible. Example 1: Loop Interchange DO K = 1, N DO I = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO Here you have several choices: parallelize the J loop or the I loop. You cannot parallelize the K loop because different iterations of the K loop will all try to read and write the same values of A(I,J). Try to parallelize the outermost DO loop possible, because it encloses the most work. In this example, that is the 126 Work Quantum I loop. For this example, use the technique called loop interchange. Although the parallelizable loops are not the outermost ones, you can reorder the loops to make one of them outermost. Thus, loop interchange would produce C$DOACROSS LOCAL(I, J, K) DO I = 1, N DO K = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO Now the parallelizable loop encloses more work and shows better performance. In practice, relatively few loops can be reordered in this way. However, it does occasionally happen that several loops in a nest of loops are candidates for parallelization. In such a case, it is usually best to parallelize the outermost one. Occasionally, the only loop available to be parallelized has a fairly small amount of work. It may be worthwhile to force certain loops to run without parallelism or to select between a parallel version and a serial version, on the basis of the length of the loop. Example 2: Conditional Parallelism J = (N/4) * 4 DO I = J+1, N A(I) = A(I) + X*B(I) END DO DO I = 1, J, 4 A(I) = A(I) + X*B(I) A(I+1) = A(I+1) + X*B(I+1) A(I+2) = A(I+2) + X*B(I+2) A(I+3) = A(I+3) + X*B(I+3) END DO Here you are using loop unrolling of order four to improve speed. For the first loop, the number of iterations is always fewer than four, so this loop does not do enough work to justify running it in parallel. The second loop is 127 Chapter 7: Fortran Enhancements for Multiprocessors worthwhile to parallelize if N is big enough. To overcome the parallel loop overhead, N needs to be around 500. An optimized version would use the IF clause on the DOACROSS directive: J = (N/4) * 4 DO I = J+1, N A(I) = A(I) + X*B(I) END DO C$DOACROSS IF (J.GE.500), LOCAL(I) DO I = 1, J, 4 A(I) = A(I) + X*B(I) A(I+1) = A(I+1) + X*B(I+1) A(I+2) = A(I+2) + X*B(I+2) A(I+3) = A(I+3) + X*B(I+3) END DO ENDIF Cache Effects It is good policy to write loops that take the effect of the cache into account, with or without parallelism. The technique for the best cache performance is also quite simple: make the loop step through the array in the same way that the array is laid out in memory. For Fortran, this means stepping through the array without any gaps and with the leftmost subscript varying the fastest. Note that this optimization does not depend on multiprocessing, nor is it required in order for multiprocessing to work correctly. However, multiprocessing can affect how the cache is used, so it is worthwhile to understand. 128 Cache Effects Performing a Matrix Multiply Consider the following code segment: DO I = 1, N DO K = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO This is the same as Example 1 in “Work Quantum” on page 126 (after interchange). To get the best cache performance, the I loop should be innermost. At the same time, to get the best multiprocessing performance, the outermost loop should be parallelized. For this example, you can interchange the I and J loops, and get the best of both optimizations: C$DOACROSS LOCAL(I, J, K) DO J = 1, N DO K = 1, N DO I = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO Understanding Trade-Offs Sometimes you must choose between the possible optimizations and their costs. Look at the following code segment: DO J = 1, N DO I = 1, M A(I) = A(I) + B(J)*C(I,J) END DO END DO 129 Chapter 7: Fortran Enhancements for Multiprocessors This loop can be parallelized on I but not on J. You could interchange the loops to put I on the outside, thus getting a bigger work quantum. C$DOACROSS LOCAL(I,J) DO I = 1, M DO J = 1, N A(I) = A(I) + B(J)*C(I,J) END DO END DO However, putting J on the inside means that you will step through the C array in the wrong direction; the leftmost subscript should be the one that varies the fastest. It is possible to parallelize the I loop where it stands: DO J = 1, N C$DOACROSS LOCAL(I) DO I = 1, M A(I) = A(I) + B(J)*C(I,J) END DO END DO However, M needs to be large for the work quantum to show any improvement. In this example, A(I) is used to do a sum reduction, and it is possible to use the reduction techniques shown in Example 4 of “Breaking Data Dependencies” on page 120 to rewrite this in a parallel form. (Recall that there is no support for an entire array as a member of the REDUCTION clause on a DOACROSS.) However, that involves converting array A from a one-dimensional array to a two-dimensional array to hold the partial sums; this is analogous to the way we converted the scalar summation variable into an array of partial sums. 130 Cache Effects If A is large, however, the conversion can take more memory than you can spare. It can also take extra time to initialize the expanded array and increase the memory bandwidth requirements. NUM = MP_NUMTHREADS() IPIECE = (N + (NUM-1)) / NUM C$DOACROSS LOCAL(K,J,I) DO K = 1, NUM DO J = K*IPIECE - IPIECE + 1, MIN(N, K*IPIECE) DO I = 1, M PARTIAL_A(I,K) = PARTIAL_A(I,K) + B(J)*C(I,J) END DO END DO END DO C$DOACROSS LOCAL (I,K) DO I = 1, M DO K = 1, NUM A(I) = A(I) + PARTIAL_A(I,K) END DO END DO You must trade off the various possible optimizations to find the combination that is right for the particular job. Load Balancing When the Fortran compiler divides a loop into pieces, by default it uses the simple method of separating the iterations into contiguous blocks of equal size for each process. It can happen that some iterations take significantly longer to complete than other iterations. At the end of a parallel region, the program waits for all processes to complete their tasks. If the work is not divided evenly, time is wasted waiting for the slowest process to finish. Example DO I = 1, N DO J = 1, I A(J, I) = A(J, I) + B(J)*C(I) END DO END DO 131 Chapter 7: Fortran Enhancements for Multiprocessors The previous code segment can be parallelized on the I loop. Because the inner loop goes from 1 to I, the first block of iterations of the outer loop will end long before the last block of iterations of the outer loop. In this example, this is easy to see and predictable, so you can change the program: NUM_THREADS = MP_NUMTHREADS() C$DOACROSS LOCAL(I, J, K) DO K = 1, NUM_THREADS DO I = K, N, NUM_THREADS DO J = 1, I A(J, I) = A(J, I) + B(J)*C(I) END DO END DO END DO In this rewritten version, instead of breaking up the I loop into contiguous blocks, break it into interleaved blocks. Thus, each execution thread receives some small values of I and some large values of I, giving a better balance of work between the threads. Interleaving usually, but not always, cures a load balancing problem. You can use the MP_SCHEDTYPE clause to automatically perform this desirable transformation. C$DOACROSS LOCAL (I,J), MP_SCHEDTYPE=INTERLEAVE DO 20 I = 1, N DO 10 J = 1, I A (J,I) = A(J,I) + B(J)*C(J) 10 CONTINUE 20 CONTINUE The previous code has the same meaning as the rewritten form above. Note that interleaving can cause poor cache performance because the array is no longer stepped through at stride 1. You can improve performance somewhat by adding a CHUNK=integer_expression clause. Usually 4 or 8 is a good value for integer_expression. Each small chunk will have stride 1 to improve cache performance, while the chunks are interleaved to improve load balancing. 132 Advanced Features The way that iterations are assigned to processes is known as scheduling. Interleaving is one possible schedule. Both interleaving and the “simple” scheduling methods are examples of fixed schedules; the iterations are assigned to processes by a single decision made when the loop is entered. For more complex loops, it may be desirable to use DYNAMIC or GSS schedules. Comparing the output from pixie or from pc sampling allows you to see how well the load is being balanced so you can compare the different methods of dividing the load. Refer to the discussion of the MP_SCHEDTYPE clause in “C$DOACROSS” on page 106 for more information. Even when the load is perfectly balanced, iterations may still take varying amounts of time to finish because of random factors. One process may take a page fault , another may be interrupted to let a different program run, and so on. Because of these unpredictable events, the time spent waiting for all processes to complete can be several hundred cycles, even with near perfect balance. Advanced Features A number of features are provided so that sophisticated users can override the multiprocessing defaults and customize the parallelism to their particular applications. This section provides a brief explanation of these features. mp_block and mp_unblock mp_block puts the slave threads into a blocked state using the system call blockproc. The slave threads stay blocked until a call is made to mp_unblock. These routines are useful if the job has bursts of parallelism separated by long stretches of single processing, as with an interactive program. You can block the slave processes so they consume CPU cycles only as needed, thus freeing the machine for other users. The Fortran system automatically unblocks the slaves on entering a parallel region should you neglect to do so. 133 Chapter 7: Fortran Enhancements for Multiprocessors mp_setup, mp_create, and mp_destroy The mp_setup, mp_create, and mp_destroy subroutine calls create and destroy threads of execution. This can be useful if the job has only one parallel portion or if the parallel parts are widely scattered. When you destroy the extra execution threads, they cannot consume system resources; they must be re-created when needed. Use of these routines is discouraged because they degrade performance; the mp_block and mp_unblock routines should be used in almost all cases. mp_setup takes no arguments. It creates the default number of processes as defined by previous calls to mp_set_numthreads, by the environment variable MP_SET_NUMTHREADS (described in “Environment Variables: MP_SET_NUMTHREADS, MP_BLOCKTIME, MP_SETUP” on page 136), or by the number of CPUs on the current hardware platform. mp_setup is called automatically when the first parallel loop is entered to initialize the slave threads. mp_create takes a single integer argument, the total number of execution threads desired. Note that the total number of threads includes the master thread. Thus, mp_create(n) creates one thread less than the value of its argument. mp_destroy takes no arguments; it destroys all the slave execution threads, leaving the master untouched. When the slave threads die, they generate a SIGCLD signal. If your program has changed the signal handler to catch SIGCLD, it must be prepared to deal with this signal when mp_destroy is executed. This signal also occurs when the program exits; mp_destroy is called as part of normal cleanup when a parallel Fortran job terminates. mp_blocktime The Fortran slave threads spin wait until there is work to do. This makes them immediately available when a parallel region is reached. However, this consumes CPU resources. After enough wait time has passed, the slaves block themselves through blockproc. Once the slaves are blocked, it requires a system call to unblockproc to activate the slaves again (refer to the unblockproc(2) man page for details). This makes the response time much longer when starting up a parallel region. 134 Advanced Features This trade-off between response time and CPU usage can be adjusted with the mp_blocktime call. mp_blocktime takes a single integer argument that specifies the number of times to spin before blocking. By default, it is set to 10,000,000; this takes roughly one second. If called with an argument of 0, the slave threads will not block themselves no matter how much time has passed. Explicit calls to mp_block, however, will still block the threads. This automatic blocking is transparent to the user’s program; blocked threads are automatically unblocked when a parallel region is reached. mp_numthreads, mp_set_numthreads Occasionally, you may want to know how many execution threads are available. mp_numthreads is a zero-argument integer function that returns the total number of execution threads for this job. The count includes the master thread. mp_set_numthreads takes a single-integer argument. It changes the default number of threads to the specified value. A subsequent call to mp_setup will use the specified value rather than the original defaults. If the slave threads have already been created, this call will not change their number. It only has an effect when mp_setup is called. mp_my_threadnum mp_my_threadnum is a zero-argument function that allows a thread to differentiate itself while in a parallel region. If there are n execution threads, the function call returns a value between zero and n – 1. The master thread is always thread zero. This function can be useful when parallelizing certain kinds of loops. Most of the time the loop index variable can be used for the same purpose. Occasionally, the loop index may not be accessible, as, for example, when an external routine is called from within the parallel loop. This routine provides a mechanism for those cases. 135 Chapter 7: Fortran Enhancements for Multiprocessors Environment Variables: MP_SET_NUMTHREADS, MP_BLOCKTIME, MP_SETUP The MP_SET_NUMTHREADS, MP_BLOCKTIME, and MP_SETUP environment variables act as an implicit call to the corresponding routine(s) of the same name at program start-up time. For example, the csh command % setenv MP_SET_NUMTHREADS 2 causes the program to create two threads regardless of the number of CPUs actually on the machine, just like the source statement CALL MP_SET_NUMTHREADS (2) Similarly, the sh commands % set MP_BLOCKTIME 0 % export MP_BLOCKTIME prevent the slave threads from autoblocking, just like the source statement call mp_blocktime (0) For compatibility with older releases, the environment variable NUM_THREADS is supported as a synonym for MP_SET_NUMTHREADS. To help support networks with several multiprocessors and several CPUs, the environment variable MP_SET_NUMTHREADS also accepts an expression involving integers +, –, min, max, and the special symbol all, which stands for “the number of CPUs on the current machine.” For example, the following command selects the number of threads to be two fewer than the total number of CPUs (but always at least one): % setenv MP_SET_NUMTHREADS max(1,all-2) 136 Advanced Features Environment Variables: MP_SUGNUMTHD, MP_SUGNUMTHD_VERBOSE, MP_SUGNUMTHD_MIN, MP_SUGNUMTHD_MAX Prior to the current (6.02) compiler release, the number of threads utilized during execution of a multiprocessor job was generally constant, set for example using MP_SET_NUMTHREADS. In an environment with long running jobs and varying workloads, it may be preferable to vary the number of threads during execution of some jobs. Setting MP_SUGNUMTHD causes the run-time library to create an additional, asynchronous process that periodically wakes up and monitors the system load. When idle processors exist, this process increases the number of threads, up to a maximum of MP_SET_NUMTHREADS. When the system load increases, it decreases the number of threads, possibly to as few as 1. When MP_SUGNUMTHD has no value, this feature is disabled and multithreading works as before. The environment variables MP_SUGNUMTHD_MIN and MP_SUGNUMTHD_MAX are used to limit this feature as desired. When MP_SUGNUMTHD_MIN is set to an integer value between 1 and MP_SET_NUMTHREADS, the process will not decrease the number of threads below that value. When MP_SUGNUMTHD_MAX is set to an integer value between the minimum number of threads and MP_SET_NUMTHREADS, the process will not increase the number of threads above that value. If you set any value in the environment variable MP_SUGNUMTHD_VERBOSE, informational messages are written to stderr whenever the process changes the number of threads in use. Calls to mp_numthreads and mp_set_numthreads are taken as a sign that the application depends on the number of threads in use. The number in use is frozen upon either of these calls; and if MP_SUGNUMTHD_VERBOSE is set, a message to that effect is written to stderr. 137 Chapter 7: Fortran Enhancements for Multiprocessors Environment Variables: MP_SCHEDTYPE, CHUNK These environment variables specify the type of scheduling to use on DOACROSS loops that have their scheduling type set to RUNTIME. For example, the following csh commands cause loops with the RUNTIME scheduling type to be executed as interleaved loops with a chunk size of 4: % setenv MP_SCHEDTYPE INTERLEAVE % setenv CHUNK 4 The defaults are the same as on the DOACROSS directive; if neither variable is set, SIMPLE scheduling is assumed. If MP_SCHEDTYPE is set, but CHUNK is not set, a CHUNK of 1 is assumed. If CHUNK is set, but MP_SCHEDTYPE is not, DYNAMIC scheduling is assumed. mp_setlock, mp_unsetlock, mp_barrier mp_setlock, mp_unsetlock, and mp_barrier are zero-argument subroutines that provide convenient (although limited) access to the locking and barrier functions provided by ussetlock, usunsetlock, and barrier. These subroutines are convenient because you do not need to initialize them; calls such as usconfig and usinit are done automatically. The limitation is that there is only one lock and one barrier. For most programs, this amount is sufficient. If your program requires more complex or flexible locking facilities, use the ussetlock family of subroutines directly. Local COMMON Blocks A special ld option allows named COMMON blocks to be local to a process. Each process in the parallel job gets its own private copy of the common block. This can be helpful in converting certain types of Fortran programs into a parallel form. The common block must be a named COMMON (blank COMMON may not be made local), and it must not be initialized by DATA statements. 138 Advanced Features To create a local COMMON block, give the special loader directive –Xlocal followed by a list of COMMON block names. Note that the external name of a COMMON block known to the loader has a trailing underscore and is not surrounded by slashes. For example, the command % f77 –mp a.o –Xlocal foo_ makes the COMMON block /foo/ a local COMMON block in the resulting a.out file. You can specify multiple –Xlocal options if necessary. It is occasionally desirable to be able to copy values from the master thread’s version of the COMMON block into the slave thread’s version. The special directive C$COPYIN allows this. It has the form C$COPYIN item [, item …] Each item must be a member of a local COMMON block. It can be a variable, an array, an individual element of an array, or the entire COMMON block. For example, C$COPYIN x,y, /foo/, a(i) propagates the values for x and y, all the values in the COMMON block foo, and the ith element of array a. All these items must be members of local COMMON blocks. Note that this directive is translated into executable code, so in this example i is evaluated at the time this statement is executed. Compatibility With sproc The parallelism used in Fortran is implemented using the standard system call sproc. It is recommended that programs not attempt to use both C$DOACROSS loops and sproc calls. It is possible, but there are several restrictions: • Any threads you create may not execute $DOACROSS loops; only the original thread is allowed to do this. • The calls to routines like mp_block and mp_destroy apply only to the threads created by mp_create or to those automatically created when the Fortran job starts; they have no effect on any user-defined threads. 139 Chapter 7: Fortran Enhancements for Multiprocessors • Calls to routines such as m_get_numprocs do not apply to the threads created by the Fortran routines. However, the Fortran threads are ordinary subprocesses; using the routine kill with the arguments 0 and sig (for example, kill(0,sig)) to signal all members of the process group might kill the threads used to execute C$DOACROSS. • If you choose to intercept the SIGCLD signal, you must be prepared to receive this signal when the threads used for the C$DOACROSS loops exit; this occurs when mp_destroy is called or at program termination. • Note in particular that m_fork is implemented using sproc, so it is not legal to m_fork a family of processes that each subsequently executes C$DOACROSS loops. Only the original thread can execute C$DOACROSS loops. DOACROSS Implementation This section discusses how multiprocessing is implemented in a DOACROSS routine. This information is useful when you use a debugger or interpret the results of an execution profile. Loop Transformation When the Fortran compiler encounters a C$DOACROSS directive, it spools the body of the corresponding DO loop into a separate subroutine and replaces the loop with a call to a special library routine __mp_parallel_do. The newly created routine is named by appending .pregion to the name of the original routine, followed by the number of the parallel loop in the routine (where 0 is the first loop). For example, the first parallel loop in a routine named foo is named foo.pregion0, the second parallel loop is foo.pregion1, and so on. If a loop occurs in the main routine and if that routine has not been given a name by the PROGRAM statement, its name is assumed to be main. Any variables declared to be LOCAL in the original C$DOACROSS statement are declared as local variables in the spooled routine. References to SHARE variables are resolved by referring back to the original routine. 140 DOACROSS Implementation Because the spooled routine is now just a DO loop, the routine uses subroutine arguments to specify which part of the loop a particular process is to execute. The spooled routine has three arguments: the starting value for the index, the number of times to execute the loop, and a special flag word. As an example, the following routine that appears on line 1000: SUBROUTINE EXAMPLE(A, B, C, N) REAL A(*), B(*), C(*) C$DOACROSS LOCAL(I,X) DO I = 1, N X = A(I)*B(I) C(I) = X + X**2 END DO C(N) = A(1) + B(2) RETURN END produces this spooled routine to represent the loop: SUBROUTINE EXAMPLE.pregion X ( _LOCAL_START, _LOCAL_NTRIP, _THREADINFO) INTEGER*4 _LOCAL_START INTEGER*4 _LOCAL_NTRIP INTEGER*4 _THREADINFO INTEGER*4 I REAL X INTEGER*4 _DUMMY I = _LOCAL_START DO _DUMMY = 1,_LOCAL_NTRIP X = A(I)*B(I) C(I) = X + X**2 I = I + 1 END DO END 141 Chapter 7: Fortran Enhancements for Multiprocessors Executing Spooled Routines The set of processes that cooperate to execute the parallel Fortran job are members of a process share group created by the system call sproc. The process share group is created by special Fortran start-up routines that are used only when the executable is linked with the –mp option, which enables multiprocessing. The first process is the master process. It executes all the nonparallel portions of the code. The other processes are slave processes; they are controlled by the routine mp_slave_control. When they are inactive, they wait in the special routine __mp_slave_wait_for_work. The __mp_parallel_do routine divides the work and signals the slaves. The master process then calls the spooled routine to do its share of the work. When a slave is signaled, it wakes up from the wait loop, calculates which iterations of the spooled DO loop it is to execute, and then calls the spooled routine with the appropriate arguments. When a slave completes its execution of the spooled routine, it reports that it has finished and returns to __mp_slave_wait_for_work. When the master completes its execution of its portion of the spooled routine, it waits in the special routine mp_wait_for_loop_completion until all the slaves have completed processing. The master then returns to the main routine and continues execution. 142 PCF Directives PCF Directives In addition to the simple loop-level parallelism offered by the C$DOACROSS directive (described in “Parallel Loops” on page 104), the compiler supports a more general model of parallelism. This model is based on the work done by the Parallel Computing Forum (PCF), which itself formed the basis for the proposed ANSI-X3H5 standard. The compiler supports this model through compiler directives, rather than extensions to the source language. The main concept in this model is the parallel region, which can be any arbitrary section of code (not just a DO loop). Within the parallel region, there are special work-sharing constructs that can be used to divide the work among separate processes or threads. The parallel region can also contain a critical section construct, where exactly one process executes at a time. The master thread executes the user program until it reaches a parallel region. It then spawns one or more slave threads that begin executing code at the beginning of a parallel region. Each thread executes all the code in the region until a work sharing construct is encountered. Each thread then executes some portion of the work sharing construct, and then resumes executing the parallel region code. At the end of the parallel region, all the threads synchronize, and the master thread continues execution of the user program. The PCF directives, summarized in Table 7-1, implement the general model of parallelism. They look like Fortran comments, with a C in column one. The compiler recognizes these directives when multiprocessing is enabled with either the –mp option. (Multiprocessing is also enabled with the –pfa option if you have purchased Power Fortran 77.) If multiprocessing is not enabled, the compiler treats these statements as comments. Therefore, you can compile identical source with a single-processing compiler or by Fortran without the multiprocessing option. The PCF directives start with the characters C$PAR. 143 Chapter 7: Fortran Enhancements for Multiprocessors Table 7-1 144 Summary of PCF Directives Directive Description C$PAR BARRIER Ensures that each process waits until all processes reach the barrier before proceeding. C$PAR [END] CRITICAL SECTION Ensures that the enclosed block of code is executed by only one process at a time by using a lock variable. C$PAR [END] PARALLEL Encloses a parallel region, which includes work-sharing constructs and critical sections. C$PAR PARALLEL DO Precedes a single DO loop for which separate iterations are executed by different processes. This directive is equivalent to the C$ DOACROSS directive. C$PAR [END] PDO Separate iterations of the enclosed loop are executed by different processes. This directive must be inside a parallel region. C$PAR [END] PSECTION[S] Parcels out each block of code in turn to a process. C$PAR SECTION Signifies a starting line for an individual section within a parallel section. C$PAR [END] SINGLE PROCESS Ensures that the enclosed block of code is executed by exactly one process. C$PAR & Continues a PCF directive onto multiple lines. PCF Directives Parallel Region A parallel region encloses any number of PCF constructs (described in “PCF Constructs” on page 146). It signifies the boundary within which slave threads execute. A user program can contain any number of parallel regions. The syntax of the parallel region is C$PAR PARALLEL [clause [[,] clause]...] code C$PAR END PARALLEL where valid clauses are [IF ( logical_expression )] [{LOCAL | PRIVATE}(item [,item ...])] [{SHARED | SHARE}(item [,item ...])] The IF, LOCAL, and SHARED clauses have the same meaning as in the C$ DOACROSS directive (refer to “Writing Parallel Fortran” on page 105). The preferred form of the directive has no commas between the clauses. The SHARED clause is preferred over SHARE and LOCAL is preferred over PRIVATE. In the following code, all threads enter the parallel region and call the routine foo: subroutine ex1(index) integer i C$PAR PARALLEL LOCAL(i) i = mp_my_threadnum() call foo(i) C$PAR END PARALLEL end 145 Chapter 7: Fortran Enhancements for Multiprocessors PCF Constructs The three types of PCF constructs are work-sharing constructs, critical sections, and barriers. All master and slave threads synchronize at the bottom of a work-sharing construct. None of the threads continue past the end of the construct until they all have completed execution within that construct. The four work-sharing constructs are • parallel DO • PDO • parallel sections • single process If specified, the PDO, parallel section, and single process constructs must appear inside of a parallel region; the parallel DO construct cannot. Specifying a parallel DO construct inside of a parallel region produces a syntax error. The critical section construct protects a block of code with a lock so that it is executed by only one thread at a time. Threads do not synchronize at the bottom of a critical section. The barrier construct ensures that each process that is executing waits until all others reach the barrier before proceeding. Parallel DO The parallel DO construct is the same as the C$DOACROSS directive (described in “C$DOACROSS” on page 106) and conceptually the same as a parallel region containing exactly one PDO construct and no other code. Each thread inside the enclosing parallel region executes separate iterations of the loop within the parallel DO construct. The syntax of the parallel DO construct is C$PAR PARALLEL DO [clause [[,] clause]...] 146 PCF Directives “C$DOACROSS” on page 106 describes valid values for clause with the exception of the MP_SCHEDTYPE=mode clause. For the C$PAR PARALLEL DO directive, MP_SCHEDTYPE= is optional; you can just specify mode. PDO Each thread inside the enclosing parallel region executes a separate iteration of the loop within the PDO construct. The syntax of the PDO construct, which can only be specified within a parallel region, is C$PAR PDO [clause [[,] clause]...] code [C$PAR END PDO [NOWAIT]] where valid values for clause are [{LOCAL | PRIVATE} (item[,item ...])] [{LASTLOCAL | LAST LOCAL} (item[,item ...])] [(ORDERED)] [ sched ] [ chunk ] LOCAL , LASTLOCAL, sched, and chunk have the same meaning as in the C$DOACROSS directive (refer to “Writing Parallel Fortran” on page 105). Note in particular that it is legal to declare a data item as LOCAL in a PDO even if it was declared as SHARED in the enclosing parallel region. The (ORDERED) clause is equivalent to a sched clause of DYNAMIC and a chunk clause of 1. The parenthesis are required. LASTLOCAL is preferred over LAST LOCAL and LOCAL is preferred over PRIVATE. The END PDO directive is optional. If specified, this directive must appear immediately after the end of the DO loop. The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes will wait until all have reached the directive before proceeding. 147 Chapter 7: Fortran Enhancements for Multiprocessors As an example of the PDO construct, consider the following code: subroutine ex2(a,n) real a(n) C$PAR PARALLEL local(i) shared(a) C$PAR PDO do i = 1, n a(i) = a(i) + 1.0 enddo C$PAR END PARALLEL end This sample code is the same as a C$ DOACROSS loop. In fact, the compiler recognizes this as a special case and generates the same (more efficient) code as for a C$ DOACROSS directive. Parallel Sections The parallel sections construct is a parallel version of the Fortran 90 SELECT statement. Each block of code is parcelled out in turn to a separate thread. The syntax of the parallel sections construct is C$PAR PSECTION[S] [clause [[,]clause ]... code [C$PAR SECTION code] ... C$PAR END PSECTION[S] [NOWAIT] where the only valid value for clause is [{LOCAL | PRIVATE} (item [,item]) ] LOCAL is preferred over PRIVATE and has the same meaning as for the C$ DOACROSS directive (refer to “C$DOACROSS” on page 106). Note in particular that it is legal to declare a data item as LOCAL in a parallel sections construct even if it was declared as SHARED in the enclosing parallel region. The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes will wait until all have reached the END PSECTION directive before proceeding. 148 PCF Directives Parallel sections must appear within a parallel region. They can contain critical section contructs (described in “Critical Section” on page 154) but cannot contain any of the following types of constructs: • PDO • parallel DO or C$ DOACROSS • single process Each code block is executed in parallel (depending on the number of processes available). The code blocks are assigned to threads one at a time, in the order specified. Each code block is executed by only one thread. For example, consider the following code: subroutine ex3(a,n1,b,n2,c,n3) real a(n1), b(n2), c(n3) C$PAR PARALLEL local(i) shared(a,b,c) C$PAR PSECTIONS C$PAR SECTION do i = 1, n1 a(i) = 0.0 enddo C$PAR SECTION do i = 1, n2 b(i) = 0.5 enddo C$PAR SECTION call normalize(c,n3) do i = 1, n3 c(i) = c(i) + 1.0 enddo C$PAR END PSECTION C$PAR END PARALLEL end The first thread to enter the parallel sections construct executes the first block, the second thread executes the second block, and so on. This example has only three sections, so if more than three threads are in the parallel region, the fourth and higher threads wait at the C$PAR END PSECTION 149 Chapter 7: Fortran Enhancements for Multiprocessors directive until all threads are finished. If the parallel region is being executed by only two threads, whichever thread finishes its block first continues and executes the remaining block. This example uses DO loops, but a parallel section can be any arbitrary block of code. Be aware of the significant overhead of a parallel construct. Make sure the amount of work performed is enough to outweigh the extra overhead. The sections within a parallel sections construct are assigned to threads one at a time, from the top down. There is no other implied ordering to the operations within the sections. In particular, a later section cannot depend on the results of an earlier section, unless some form of explicit synchronization is used. If there is such explicit synchronization, you must be sure that the lexical ordering of the blocks is a legal order of execution. Single Process The single process construct, which can only be specified within a parallel region, ensures that a block of code is executed by exactly one process. The syntax of the single process construct is C$PAR SINGLE PROCESS [clause [[,] clause]...] code C$PAR END SINGLE PROCESS [NOWAIT] where the only valid value for clause is [{LOCAL | PRIVATE} (item [,item]) ] LOCAL is preferred over PRIVATE and has the same meaning as for the C$ DOACROSS directive (refer to “C$DOACROSS” on page 106). Note in particular that it is legal to declare a data item as LOCAL in a single process construct even if it was declared as SHARED in the enclosing parallel region. The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes will wait until all have reached the directive before proceeding. 150 PCF Directives This construct is semantically equivalent to a parallel sections construct with only one section. The single process construct provides a more descriptive syntax. For example, consider the following code: real function ex4(a,n, big_max, bmax_x, bmax_y) real a(n,n), big_max integer bmax_x, bmax_y C$ volatile big_max, bmax_x, bmax_y C$ volatile cur_max, index_x, index_y index_x = 0 index_y = 0 cur_max = 0.0 C$PAR PARALLEL local(i,j) C$PAR& shared(a,n,index_x,index_y,cur_max, C$PAR& big_max,bmax_x,bmax_y) C$PAR PDO do j = 1, n do i = 1, n if (a(i,j) .gt. cur_max) then C$PAR CRITICAL SECTION if (a(i,j) .gt. cur_max) then index_x = i index_y = j cur_max = a(i,j) endif C$PAR END CRITICAL SECTION endif enddo enddo C$PAR SINGLE PROCESS if (cur_max .gt. big_max) then big_max = (big_max + cur_max) / 2.0 bmax_x = index_x bmax_y = index_y endif C$PAR END SINGLE PROCESS 151 Chapter 7: Fortran Enhancements for Multiprocessors C$PAR PDO do j = 1, n do i = 1, n a(i,j) = a(i,j)/big_max enddo enddo C$PAR END PARALLEL ex4 = cur_max end The first thread to reach the single process section executes the code in that block. All other threads wait at the end of the block until the code has been executed. This example contains a number of interesting points to be examined. First, note the use of the VOLATILE declaration. Any data item that might be written by one thread and then read by a different thread must be marked as VOLATILE. Making a variable VOLATILE can reduce opportunities for optimization, so the declarations are prefixed by C$ to prevent the single-processor version of the code from being penalized. Refer to the MIPSpro Fortran 77 Language Reference Manual for more information about the VOLATILE statement. 152 PCF Directives Second, note the use of the odd looking repetition of the IF test in the first parallel loop: if (a(i,j) .gt. cur_max) then C$PAR CRITICAL SECTION if (a(i,j) .gt. cur_max) then This practice is usually called test&test&set. It is a multi-processing optimization. Note that the following straight forward code segment is incorrect: do i = 1, n if (a(i,j) .gt. cur_max) then C$PAR CRITICAL SECTION index_x = i index_y = j cur_max = a(i,j) C$PAR END CRITICAL SECTION endif enddo Because many threads execute the loop in parallel, there is no guarantee that once inside the critical section, cur_max still has the same value it did in the IF test outside the critical section (some other thread may have updated it). In particular, cur_max may now have a value that is larger than a(i,j). Therefore, the critical section must be locked before testing the value of cur_max. Changing the previous code into the equally straightforward do i = 1, n C$PAR CRITICAL SECTION if (a(i,j) .gt. cur_max) then index_x = i index_y = j cur_max = a(i,j) endif C$PAR END CRITICAL SECTION enddo works correctly, but suffers from a serious performance penalty: the critical section lock must be acquired and released (an expensive operation) for each element of the array. Because the values are rarely updated, this process involves a lot of wasted effort. It is almost certainly slower than just executing the loop serially. 153 Chapter 7: Fortran Enhancements for Multiprocessors Combining the two methods, as in the original example, produces code that is both fast and correct. If the IF test outside of the critical section fails, you can be certain that the values will not be updated, and can proceed. You can expect that the outside IF test will account for the majority of cases. If the outer IF test passes, then the values might be updated, but you cannot always be certain. To ensure correctness, you must perform the test again after acquiring the critical section lock. You can prefix one of the two identical IF tests with C$ to reduce overhead in the non-multiprocessed case. Lastly, note the difference between the single process and critical section constructs. If several processes arrive at a critical section construct, they execute the code one at a time. However, they will all execute the code. If several processes arrive at a single process construct, only one process executes the code. The other processes bypass the code and wait at the end of the construct for the chosen process to finish. Critical Section The critical section construct restricts execution of a block of code so that only one process can execute it at a time. Another process attempting to gain entry to the critical section must wait until the previous process has exited. The critical section construct can appear anywhere in a program, including inside and outside a parallel region and within a C$ DOACROSS loop. The syntax of the critical section construct is C$PAR CRITICAL SECTION [ ( lock_variable ) ] code C$PAR END CRITICAL SECTION The lock_variable is an optional integer variable that must be initialized to zero. The parenthesis are required. If you do not specify lock_variable, the compiler automatically supplies one. Multiple critical section constructs inside the same parallel region are considered to be independent of each other unless they use the same explicit lock_variable. 154 PCF Directives Consider the following code: integer function num_exceptions(a,n,biggest_allowed) double precision a(n,n,n), biggest_allowed integer count integer lock_var volatile count count = 0 lock_var = 0 C$PAR PARALLEL local(i,j,k) shared(count,lock_var) C$PAR PDO do 10 k = 1,n do 10 j = 1,n do 10 i = 1,n if (a(i,j,k) .gt. biggest_allowed) then C$PAR CRITICAL SECTION (lock_var) count = count + 1 C$PAR END CRITICAL SECTION (lock_var) else call transform(a(i,j,k)) if (a(i,j,k) .gt. biggest_allowed) then C$PAR CRITICAL SECTION (lock_var) count = count + 1 C$PAR END CRITICAL SECTION (lock_var) endif 10 endif continue C$PAR END PARALLEL num_exceptions = count return end 155 Chapter 7: Fortran Enhancements for Multiprocessors This example demonstrates the use of the lock variable (lock_var). A C$PAR CRITICAL SECTION directive ensures that no more than one process executes the enclosed block of code at a time. However, if there are multiple critical sections, different processes can be in different critical sections at the same time. This example does not allow different processes to be in different critical sections at the same time because both critical sections control access to the same variable (count). Specifying the same lock variable for both critical sections ensures that no more than one process is executing either of the critical sections that use that lock variable. Note that the lock_var must be SHARED (so that all processes use the same lock), and that count must be volatile (because other processes might change its value). Barrier Constructs A barrier construct ensures that each process waits until all processes reach the barrier before proceeding. The syntax of the barrier construct is C$PAR BARRIER C$PAR & Occasionally, the clauses in PCF directives are longer than one line. You can use the C$PAR & directive to continue a directive onto multiple lines. For example, C$PAR PARALLEL local(i,j) C$PAR& shared(a,n,index_x,index_y,cur_max, C$PAR& big_max,bmax_x,bmax_y) 156 PCF Directives Restrictions The three work-sharing constructs, PDO, PSECTION, and SINGLE PROCESS, must be executed by all the threads executing in the parallel region (or none of the threads). The following is illegal: . . . C$PAR PARALLEL if (mp_my_threadnum() .gt. 5) then C$PAR SINGLE PROCESS many_processes = .true. C$PAR END SINGLE PROCESS endif . . . This code will hang forever when run with enough processes. One or more process will be stuck at the C$PAR END SINGLE PROCESS directive waiting for all the threads to arrive. Because some of the threads never took the appropriate branch, they will never encounter the construct. However, the following kind of simple looping is supported: code C$PAR PARALLEL local(i,j) shared(a) do i= 1,n C$PAR PDO do j = 2,n code The distinction here is that all of the threads encounter the work-sharing construct, they all complete it, and they all loop around and encounter it again. Note that this restriction does not apply to the critical section construct, which operates on one thread at a time without regard to any other threads. Parallel regions cannot be lexically nested inside of other parallel regions, nor can work-sharing constructs be nested. However, as an aid to writing library code, you can call an external routine that contains a parallel region even from within a parallel region. In this case, only the first region is 157 Chapter 7: Fortran Enhancements for Multiprocessors actually run in parallel. Therefore, you can create a parallelized routine without accounting for whether it will be called from within an already parallelized routine. A Few Words About Efficiency The more general PCF constructs are typically slower than the special case parallelism offered by the C$DOACROSS directive. They are slower because of the extra synchronization required. When a C$DOACROSS loop executes, there is a synchronization point at entry and another at exit. When a parallel region executes, there is a synchronization point at entry to the region, another at each entry to a work-sharing construct, another at each exit from a work-sharing construct, and one at exit from the region. Thus, several separate C$DOACROSS loops typically execute faster than a single parallel region with several PDO constructs. Limit your use of the parallel region construct to those few cases that actually need it. 158 Chapter 8 8. Compiling and Debugging Parallel Fortran This chapter gives instructions on how to compile and debug a parallel Fortran program and contains the following sections: • “Compiling and Running” explains how to compile and run a parallel Fortran program. • “Profiling a Parallel Fortran Program” describes how to use the system profiler, prof, to examine execution profiles. • “Debugging Parallel Fortran” presents some standard techniques for debugging a parallel Fortran program. This chapter assumes you have read Chapter 7, “Fortran Enhancements for Multiprocessors,” and have reviewed the techniques and vocabulary for parallel processing in the IRIX environment. Compiling and Running After you have written a program for parallel processing, you should debug your program in a single-processor environment by calling the Fortran compiler with the f77 command. You can also debug your program using the WorkShop Pro MPF debugger, which is sold as a separate product. After your program has executed successfully on a single processor, you can compile it for multiprocessing. Check the f77(1) manual page for multiprocessing options. To turn on multiprocessing, add –mp to the f77 command line. This option causes the Fortran compiler to generate multiprocessing code for the particular files being compiled. When linking, you can specify both object files produced with the –mp option and object files produced without it. If any or all of the files are compiled with –mp, the executable must be linked with –mp so that the correct libraries are used. 159 Chapter 8: Compiling and Debugging Parallel Fortran Using the –static Option A few words of caution about the –static compiler option: The multiprocessing implementation demands some use of the stack to allow multiple threads of execution to execute the same code simultaneously. Therefore, the parallel DO loops themselves are compiled with the –automatic option, even if the routine enclosing them is compiled with –static. This means that SHARE variables in a parallel loop behave correctly according to the –static semantics but that LOCAL variables in a parallel loop do not (see “Debugging Parallel Fortran” on page 162 for a description of SHARE and LOCAL variables). Finally, if the parallel loop calls an external routine, that external routine cannot be compiled with –static. You can mix static and multiprocessed object files in the same executable; the restriction is that a static routine cannot be called from within a parallel loop. Examples of Compiling This section steps you through a few examples of compiling code using –mp. The following command line % f77 –mp foo.f compiles and links the Fortran program foo.f into a multiprocessor executable. In this example % f77 –c –mp –O2 snark.f the Fortran routines in the file snark.f are compiled with multiprocess code generation enabled. The optimizer is also used. A standard snark.o binary is produced, which must be linked: % f77 –mp –o boojum snark.o bellman.o 160 Profiling a Parallel Fortran Program Here, the –mp option signals the linker to use the Fortran multiprocessing library. The file bellman.o need not have been compiled with the –mp option (although it could have been). After linking, the resulting executable can be run like any standard executable. Creating multiple execution threads, running and synchronizing them, and task terminating are all handled automatically. When an executable has been linked with –mp, the Fortran initialization routines determine how many parallel threads of execution to create. This determination occurs each time the task starts; the number of threads is not compiled into the code. The default is to use whichever is less: 4 or the number of processors that are on the machine (the value returned by the system call sysmp(MP_NAPROCS); see the sysmp(2) man page). You can override the default by setting the shell environment variable MP_SET_NUMTHREADS. If it is set, Fortran tasks use the specified number of execution threads regardless of the number of processors physically present on the machine. MP_SET_NUMTHREADS can be from 1 to 64. Profiling a Parallel Fortran Program After converting a program, you need to examine execution profiles to judge the effectiveness of the transformation. Good execution profiles of the program are crucial to help you focus on the loops consuming the most time. IRIX provides profiling tools that can be used on Fortran parallel programs. Both pixie(1) and pc-sample profiling can be used. On jobs that use multiple threads, both these methods will create multiple profile data files (one for each thread). You can use the standard profile analyzer prof(1) to examine this output. (Refer to the MIPS Compiling and Performance Tuning Guide for details about using prof.) The profile of a Fortran parallel job is different from a standard profile. As mentioned in “Analyzing Data Dependencies for Multiprocessing” on page 114, to produce a parallel program, the compiler pulls the parallel DO loops out into separate subroutines, one routine for each loop. Each of these loops is shown as a separate procedure in the profile. Comparing the amount of 161 Chapter 8: Compiling and Debugging Parallel Fortran time spent in each loop by the various threads shows how well the workload is balanced. In addition to the loops, the profile shows the special routines that actually do the multiprocessing. The __mp_parallel_do routine is the synchronizer and controller. Slave threads wait for work in the routine __mp_slave_wait_for_work. The less time they wait, the more time they work. This gives a rough estimate of how parallel the program is. Debugging Parallel Fortran This section presents some standard techniques to assist in debugging a parallel program. General Debugging Hints 162 • Debugging a multiprocessed program is much more difficult than debugging a single-processor program. Therefore you should do as much debugging as possible on the single-processor version. • Try to isolate the problem as much as possible. Ideally, try to reduce the problem to a single C$DOACROSS loop. • Before debugging a multiprocessed program, change the order of the iterations on the parallel DO loop on a single-processor version. If the loop can be multiprocessed, then the iterations can execute in any order and produce the same answer. If the loop cannot be multiprocessed, changing the order frequently causes the single-processor version to fail, and standard single-process debugging techniques can be used to find the problem. Debugging Parallel Fortran Example: Erroneous C$DOACROSS In this example, the bug is that the two references to a have the indexes in reverse order. If the indexes were in the same order (if both were a(i,j) or both were a(j,i)), the loop could be multiprocessed. As written, there is a data dependency, so the C$DOACROSS is a mistake. c$doacross local(i,j) do i = 1, n do j = 1, n a(i,j) = a(j,i) + x*b(i) end do end do Because a (correct) multiprocessed loop can execute its iterations in any order, you could rewrite this as: c$doacross local(i,j) do i = n, 1, –1 do j = 1, n a(i,j) = a(j,i) + x*b(i) end do end do This loop no longer gives the same answer as the original even when compiled without the –mp option. This reduces the problem to a normal debugging problem. When a multiprocessed loop is giving the wrong answer, make the following checks: • Check the LOCAL variables when the code runs correctly as a single process but fails when multiprocessed. Carefully check any scalar variables that appear in the left-hand side of an assignment statement in the loop to be sure they are all declared LOCAL. Be sure to include the index of any loop nested inside the parallel loop. A related problem occurs when you need the final value of a variable but the variable is declared LOCAL rather than LASTLOCAL. If the use of the final value happens several hundred lines farther down, or if the variable is in a COMMON block and the final value is used in a completely separate routine, a variable can look as if it is LOCAL when in fact it should be LASTLOCAL. To combat this problem, simply declare all the LOCAL variables LASTLOCAL when debugging a loop. 163 Chapter 8: Compiling and Debugging Parallel Fortran 164 • Check for EQUIVALENCE problems. Two variables of different names may in fact refer to the same storage location if they are associated through an EQUIVALENCE. • Check for the use of uninitialized variables. Some programs assume uninitialized variables have the value 0. This works with the –static option, but without it, uninitialized values assume the value left on the stack. When compiling with –mp, the program executes differently and the stack contents are different. You should suspect this type of problem when a program compiled with –mp and run on a single processor gives a different result when it is compiled without –mp. One way to track down a problem of this type is to compile suspected routines with –static. If an uninitialized variable is the problem, it should be fixed by initializing the variable rather than by continuing to compile –static. • Try compiling with the –C option for range checking on array references. If arrays are indexed out of bounds, a memory location may be referenced in unexpected ways. This is particularly true of adjacent arrays in a COMMON block. • If the analysis of the loop was incorrect, one or more arrays that are SHARE may have data dependencies. This sort of error is seen only when running multiprocessed code. When stepping through the code in the debugger, the program executes correctly. In fact, this sort of error often is seen only intermittently, with the program working correctly most of the time. • The most likely candidates for this error are arrays with complicated subscripts. If the array subscripts are simply the index variables of a DO loop, the analysis is probably correct. If the subscripts are more involved, they are a good choice to examine first. • If you suspect this type of error, as a final resort print out all the values of all the subscripts on each iteration through the loop. Then use uniq(1) to look for duplicates. If duplicates are found, then there is a data dependency. Chapter 9 9. Fine-Tuning Program Execution This chapter contains the following sections: • “Overview” explains the concept of directives and assertions. • “Fine-Tuning Scalar Optimizations” describes how you can use directives to fine-tune scalar optimizations. • “Fine-Tuning Inlining and IPA” explains how you can use directives to fine tune inlining and IPA. • “Using Equivalenced Variables” explains how you can inform the compiler that your code uses or does not use equivalenced variables. • “Using Assertions” explains how you can enable or disable compiler recognition of assertions. • “Using Aliasing” explains the assertions that enable or disable types of aliasing. • “Fine-Tuning Global Assumptions” describes how to use assertions to fine-tune global assumptions. • “Ignoring Data Dependencies” explains how to instruct the compiler to ignore data dependencies. 165 Chapter 9: Fine-Tuning Program Execution Overview After running a Fortran source program through the compiler’s scalar optimizations once, you can use directives and assertions to fine-tune program execution by forcing the compiler to execute portions of code in various ways. By default, the compiler recognizes all Silicon Graphics directives and assertions. You can use the –WK,–directives command line option to selectively enable/disable certain directives and assertions. Refer to “Recognizing Directives” in Chapter 5 for information about the –directives option. Directives Directives enable, disable, or modify a feature of the compiler. Essentially, directives are command line options specified within the input file instead of on the command line. Unlike command line options, directives have no default setting. To invoke a directive, you must either toggle it on or set a desired value for its level. Directives allow you to enable, disable, or modify a feature of the compiler in addition to, or instead of, command line options. Directives placed on the first line of the input file are called global directives. The compiler interprets them as if they appeared at the top of each program unit in the file. Use global directives to ensure that the program is compiled with the correct command line options. Directives appearing anywhere else in the file apply only until the end of the current program unit. The compiler resets the value of the directive to the global value at the start of the next program unit. (Set the global value using a command line option or a global directive.) Some command line options act like global directives. Other command line options override directives. Many directives have corresponding command line options. If you specify conflicting settings in the command line and a directive, the compiler chooses the most restrictive setting. For Boolean options, if either the directive or the command line has the option turned off, it is considered off. For options that require a numeric value, the compiler uses the minimum of the command line setting and the directive setting. 166 Overview Table 9-1 lists the directives supported by the compiler. In addition to the standard Silicon Graphics directives, the compiler supports the CrayTM and VASTTM directives listed in the table. The compiler maps these directives to corresponding Silicon Graphics assertions. Refer to “Assertions” on page 168 for details. Table 9-1 Directives Summary Directive Compatability C*$*ARCLIMIT(n) Silicon Graphics C*$*[NO]ASSERTIONS Silicon Graphics C*$* EACH_INVARIANT_IF_GROWTH(n) Silicon Graphics C*$* [NO]INLINE Silicon Graphics C*$* [NO]IPA Silicon Graphics C*$* MAX_INVARIANT_IF_GROWTH(n) Silicon Graphics C*$* OPTIMIZE(n) Silicon Graphics C*$* ROUNDOFF(n) Silicon Graphics C*$* SCALAR OPTIMIZE(n) Silicon Graphics C*$* UNROLL(integer[,weight]) Silicon Graphics CDIR$ NO RECURRENCE Cray CVD$ [NO] DEPCHK VAST CVD$ [NO]LSTVAL VAST 167 Chapter 9: Fine-Tuning Program Execution Assertions Assertions provide the compiler with additional information about the source program. Sometimes assertions can improve optimization results. Use them only when speed is essential. Assertions can be unsafe because the compiler cannot verify the accuracy of the information provided. If you specify an incorrect assertion, the compiler-generated code might produce different results than the original serial program. If you suspect unsafe assertions are causing problems, use the –WK,–nodirectives command line option or the C*$* NO ASSERTIONS directive to tell the compiler to ignore all assertions. Table 9-2 lists the supported assertions and their duration. Table 9-2 Assertions and Their Duration Assertion Duration C*$* ASSERT [NO] ARGUMENT ALIASING Until reset C*$* ASSERT [NO] BOUNDS VIOLATIONS Until reset C*$* ASSERT [NO] EQUIVALENCE HAZARD Until reset C*$* ASSERT NO RECURRENCE Next loop C*$* ASSERT RELATION (name.xx. name) Next loop C*$* ASSERT [NO] TEMPORARIES FOR CONSTANT ARGUMENTS Next loop As with a directive, the compiler treats an assertion as a global assertion if it comes before all comments and statements in the file. That is, the compiler treats the assertion as if it were repeated at the top of each program unit in the file. Some assertions (such as C*$* ASSERT RELATION) include variable names. If you specify them as global assertions, a program uses them only when those variable names appear in COMMON blocks or are dummy argument names to the subprogram. You cannot use global assertions to make relational assertions about variables that are local to a subprogram. 168 Overview Many assertions, like directives, are active until the end of the program unit (or file) or until you reset them. Other assertions are active within a program unit, regardless of where they appear in that program unit. Certain Cray and VAST directives function like Silicon Graphics assertions. The compiler maps these directives to the corresponding Silicon Graphics assertions. These directives are described along with the related assertions later in this chapter. There is no guarantee that a specified assertion will have an effect. The compiler notes the information provided by the assertion and uses the information if it will help. To understand the process the compiler uses in interpreting assertions, you must understand the concept of assumed dependences. The following loop contains two types of dependences: 10 DO 10 i=1,n X(i) = X(i-1) + X(m) X is an array, n and m are scalars, and nothing is known about the relationship between n and m. Between X(i) and X(i-1) there is a forward dependence, and the distance is one. Between X(i) and X(m), the compiler tries to find a relation, but cannot, because it does not know the value of m in relation to n. The second dependence is called an assumed dependence, because it is assumed but cannot be proven to exist. 169 Chapter 9: Fine-Tuning Program Execution Fine-Tuning Scalar Optimizations The compiler supports several directives that allow you to fine-tune the scalar optimizations described in “Controlling Scalar Optimizations” in Chapter 5. Controlling Internal Table Size The C*$* ARCLIMIT(integer) directive sets the minimum size of the internal table that the compiler uses for data dependence analysis. The greater the value for integer, the more information the compiler can keep on complex loop nests. The maximum value and default value for integer is 5000. When you specify this directive globally, it has the same effect as the –arclimit command line option (refer to “Controlling Internal Table Size” in Chapter 5 for details). Setting Invariant IF Floating Limits The C*$* EACH_INVARIANT_IF_GROWTH and the C*$* MAX_INVARIANT_IF_GROWTH directives control limits on invariant IF floating. This process generally involves duplicating the body of the loop, which can increase the amount of code considerably. Refer to “Setting Invariant IF Floating Limits” in Chapter 5 for details about invariant IF floating. The C*$* EACH_INVARIANT_IF_GROWTH(integer) directive limits the total number of additional lines of code generated through invariant IF floating in a loop. You can control this limit globally with the –each_invariant_if_growth command line option (see “Setting Invariant IF Floating Limits” in Chapter 5). You can limit the maximum amount of additional code generated in a program unit through invariant IF floating with the C*$* MAX_INVARIANT_IF_GROWTH(integer) directive. Use the –max_invariant_if_growth command line option to control this limit globally (see “Setting Invariant IF Floating Limits” in Chapter 5). 170 Fine-Tuning Scalar Optimizations These directives are in effect until the end of the routine or until reset by a succeeding directive of the same type. Example Consider the following code: C*$*EACH_INVARIANT_IF_GROWTH(integer) C*$*MAX_INVARIANT_IF_GROWTH(integer) DO I = ... C*$*EACH_INVARIANT_IF_GROWTH(integer) C*$*MAX_INVARIANT_IF_GROWTH(integer) DO J = ... C*$*EACH_INVARIANT_IF_GROWTH(integer) C*$*MAX_INVARIANT_IF_GROWTH(integer) DO K = ... section-1 IF ( ) THEN section-2 ELSE section-3 ENDIF section-4 ENDDO ENDDO ENDDO In floating the invariant IF out of the loop nest, the compiler honors the constraints set by the innermost directives first. If those constraints are satisfied, the invariant IF is floated from the inner loop. The middle pair of directives is tested and the invariant IF is floated from the middle loop as long as the restrictions established by these directives are not violated. The process of floating continues as long as the directive constraints are satisfied. 171 Chapter 9: Fine-Tuning Program Execution Optimization Level The C*$* OPTIMIZE(integer) directive sets the optimization level in the same way as the –optimize command line option. As you increase integer, the compiler performs more optimizations, and therefore takes longer to compile. Valid values for integer are: 0 Disables optimization. 1 Performs only simple optimizations. Enables induction variable recognition. 2 Performs lifetime analysis to determine when last-value assignment of scalars is necessary. 3 Recognizes triangular loops and attempts loop interchanging to improve memory referencing. Uses special case data dependence tests. Also, recognizes special index sets called wrap-around variables. 4 Generates two versions of a loop, if necessary, to break a data dependence arc. 5 Enables array expansion and loop fusion. Refer to “Controlling Scalar Optimizations” in Chapter 5 for examples. 172 Fine-Tuning Scalar Optimizations Variations in Round Off The C*$* ROUNDOFF(integer) directive controls the amount of variation in round off error produced by optimization in the same way as the –roundoff command line option. Valid values for integer are 0 Suppresses any transformations that change round-off error. 1 Performs expression simplification, which might generate various overflow or underflow errors, for expressions with operands between binary and unary operators, for expressions that are inside trigonometric intrinsic functions returning integer values, and after forward substitution. Enables strength reduction. Performs intrinsic function simplification for max and min. Enables code floating if –scalaropt is at least 1. Allows loop interchanging around serial arithmetic reductions, if –optimize is at least 4. Allows loop rerolling, if –scalaropt is at least 2. 2 Allows loop interchanging around arithmetic reductions if –optimize is at least 4. For example, the floating point expression A/B/C is computed as A/(B*C). 3 Recognizes REAL (float) induction variables if –scalaropt greater than 2 or –optimize is at least 1. Enables sum reductions. Enables memory management optimizations if –scalaropt=3 (see “Performing Memory Management Transformations” in Chapter 5 for details about memory management transformations). 173 Chapter 9: Fine-Tuning Program Execution Controlling Scalar Optimizations The C*$* SCALAR OPTIMIZE(integer) directive controls the amount of standard scalar optimizations that the compiler performs. Unlike the –WK,–scalaropt command line option, the C*$* SCALAR OPTIMIZE directive sets the level of loop-based optimizations (such as loop fusion) only, and not straight-code optimizations (such as dead-code elimination). Valid values for integer are 0 Disables all scalar optimizations. 1 Enables simple, loop-based, scalar optimization —changing IF loops to DO loops, simple code floating out of loops, and forward substitution of variables. 2 Enables the full range of loop-based scalar optimizations— induction variable recognition, loop rerolling, loop unrolling, loop fusion, and array expansion. 3 Enables memory management transformations if –roundoff=3. Refer to “Performing Memory Management Transformations” in Chapter 5 for details. Enabling Loop Unrolling The C*$* UNROLL(integer[,weight]) directive controls how the compiler unrolls scalar loops. Loops that cannot be optimized for concurrent execution usually execute more efficiently when they are unrolled. This directive is recognized only when you specify –WK,–scalaropt=2. The compiler unrolls the loop proceeding the C*$*UNROLL directive until either the number of operations in the loop equals the weight parameter or the number of iterations reaches the integer parameter, whichever occurs first. The –unroll and –unroll2 command line options act like a global C*$* UNROLL directive. See “Enabling Loop Unrolling” in Chapter 5 for detailed examples. The C*$* UNROLL directive is in effect only for the loop immediately following it, unlike other directives. 174 Fine-Tuning Inlining and IPA Fine-Tuning Inlining and IPA Chapter 6, “Inlining and Interprocedural Analysis,” explains how to use inlining and IPA on an entire program. You can fine-tune inlining and IPA using the C*$*[NO] INLINE and C*$*[NO] IPA directives. The compiler ignores these directives by default. They are enabled when you specify any inlining or IPA command line option, respectively, on the command line. The –inline_manual and –ipa_manual command line options enable these directives without activating the automatic inlining/algorithms. The C*$* [NO] INLINE directive behaves like the –inline command line option, but allows you to specify which occurrences of a routine are actually inlined. The format for this directive is C*$*[NO]INLINE [(name[,name ... ])] [HERE|ROUTINE|GLOBAL] where name Specifies the routines to be inlined. If you do not specify a name, this directive will affect all routines in the program. HERE Applies the INLINE directive only to the next line; occurrences of the named routines on that next line are inlined. ROUTINE Inlines the named routines everywhere they appear in the current routine. GLOBAL Inlines the named routines throughout the source file. If you do not specify HERE, ROUTINE, or GLOBAL, the directive applies only to the next statement. The C*$*NOINLINE form overrides the –inline command line option and so allows you to selectively disable inlining of the named routines at specific points. 175 Chapter 9: Fine-Tuning Program Execution Example In the following code fragment, the C*$*INLINE directive inlines the first call to beta but not the second. do i =1,n C*$*INLINE (beta) HERE call beta (i,1) enddo call beta (n, 2) Using the specifier ROUTINE rather than HERE inlines both calls. This routine must be compiled with the –inline_man command line option for the compiler to recognize the C*$* INLINE directive. The C*$* [NO] IPA directive is the analogous directive for interprocedural analysis. The format for this directive is C*$*[NO]IPA [(name [,name...])] [HERE|ROUTINE|GLOBAL] Using Equivalenced Variables The C*$* ASSERT [NO] EQUIVALENCE HAZARD assertion tells the compiler that your code does not use equivalenced variables to refer to the same memory location inside one loop nest. Normally, EQUIVALENCE statements allow your code to use different variable names to refer to the same storage location. The –WK,-assume=e command line option acts like the global C*$* ASSERT EQUIVALENCE HAZARD assertion (see “Controlling Global Assumptions” on page 71 in Chapter 4). The C*$* ASSERT EQUIVALENCE HAZARD assertion is active until you reset it or until the end of the program. Using Assertions The C*$*[NO]ASSERTIONS directive instructs the compiler to accept or ignore assertions. The C*$* NO ASSERTIONS version is in effect until the next C*$* ASSERTIONS directive or the end of the program unit. 176 Using Aliasing If you specify the –directives command line option without the assertions parameter (that is, a), the compiler will ignore assertions regardless of whether the file contains the C*$* ASSERTIONS directive. Refer to “Recognizing Directives” in Chapter 5 for details on the –directives command line option. Using Aliasing The compiler recognizes two assertions for use with aliasing. C*$* ASSERT [NO] ARGUMENT ALIASING The C*$* ASSERT [NO] ARGUMENT ALIASING assertion allows the compiler to make assumptions about subprogram arguments in a program. According to the Fortran 77 standard, you can alias a variable only if you do not modify (that is, write to) the aliased variable. The following subroutine violates the standard, because variable A is aliased in the subroutine (through C and D) and variable X is aliased (through X and E): COMMON X,Y REAL A,B CALL SUB (A, A, X) ... SUBROUTINE SUB(C,D,E) COMMON X,Y X = ... C = ... ... The command line option –assume=a acts like a global C*$* ASSERT ARGUMENT ALIASING assertion (see “Controlling Global Assumptions” in Chapter 5). A C*$* ARGUMENT ALIASING assertion is active until it is reset or until the next routine begins. 177 Chapter 9: Fine-Tuning Program Execution C*$* ASSERT RELATION The C*$* ASSERT RELATION(name.xx.name) assertion indicates the relationship between two variables or between a variable and a constant. name is the variable or constant, and xx is any of the following: GT, GE, EQ, NE, LT, or LE. This assertion applies only to the next DO statement. The C*$* ASSERT RELATION assertion includes variable names (name and xx). When specified globally, this assertion will only be used when the variable names appear in COMMON blocks or are dummy arguments to a subprogram. You cannot use global assertions to make relational assertions about variables that are local to a subprogram. As an example of the use of the C*$* ASSERT RELATION assertion, consider the following code: 100 DO 100 I = 1, N A (I) = A (I+M) + B (I) CONTINUE If you know that M is greater than N, use the following assertion to give this information to the compiler: C*$* ASSERT RELATION (M .GT. N) DO 100 I = 1, N A (I) = A (I +M) + B (I) 100 CONTINUE Knowing that M is greater than N, the compiler can generate parallel code for this loop. If M is less than N at run time, the answers produced by the code run in parallel could differ from the answers produced by the original code run serially. Note: Many relationships of this type can be cheaply tested for at run time. The compiler attempts to answer questions of this sort by generating an IF statement that explicitly tests the relationship at run time. Occasionally, the compiler needs assistance, or you might want to squeeze that last bit of performance out of some critical loop by asserting some relationship rather than repeatedly checking it at run time. 178 Fine-Tuning Global Assumptions Fine-Tuning Global Assumptions You can use the assertions described in this section to fine-tune the global assumptions discussed in “Controlling Global Assumptions” in Chapter 5. C*$* ASSERT [NO]BOUNDS VIOLATIONS The C*$* ASSERT [NO] BOUNDS VIOLATIONS assertion indicates that array subscript bounds may be violated during execution. If your program does not violate array subscript bounds, do not specify this assertion. When specified, this assertion is active until reset or until the end of the program. For formal parameters, the compiler treats a declared last dimension of (1) the same as (*). The –WK,–assert=b command line option acts like a global C*$* ASSERT BOUNDS VIOLATIONS assertion. In the following example, the compiler assumes the first loop nest is standard-conforming, and therefore can optimize both loops. The loops can be interchanged to improve memory referencing because no A(I,J) will overwrite an A(I',J+1). In the second nest, the assertion warns the compiler that the loop limit of the first array index (I) might violate the declared array bounds. The compiler plays it safe and optimizes only the right array index. Note: The compiler always assumes that array references will be within the array itself, so the rightness index will be concurrentizable. DO 100 I = 1,M DO 100 J = 1,N A(I,J) = A(I,J) + B (I,J) 100 CONTINUE C C*$*ASSERT BOUNDS VIOLATIONS DO 200 I = 1,M DO 200 J = 1,N A(I,J) = A(I,J) + B (I,J) 200 CONTINUE 179 Chapter 9: Fine-Tuning Program Execution becomes C$DOACROSS SHARE(N,M,A,B),LOCAL(J,I) DO 2 J=1,N DO 2 I=1,M A(I,J) = A(I,J) + B (I,J) 2 CONTINUE C C*$*ASSERT BOUNDS VIOLATIONS DO 4 I=1,M C$DOACROSS SHARE(N,I,A,B),LOCAL(J) DO 3 J=1,N A(I,J) = A(I,J) + B (I,J) 3 CONTINUE 4 CONTINUE C*$* ASSERT NO EQUIVALENCE HAZARD The C*$* ASSERT NO EQUIVALENCE HAZARD assertion tells the compiler that equivalenced variables will not be used to refer to the same memory location inside one DO loop nest. Normally, EQUIVALENCE statements allow different variable names to refer to the same storage location. The –WK,–assume=e command line option acts like a global C*$* ASSERT NO EQUIVALENCE HAZARD assertion. This assertion is active until reset or until the end of the program. In the following example, if arrays E and F are equivalenced, but you know that the overlapping sections will not be referenced in this loop, then using C*$* ASSERT NO EQUIVALENCE HAZARD allows the compiler to concurrentize the loop: EQUIVALENCE ( E(1), F(101) ) C*$* ASSERT NO EQUIVALENCE HAZARD DO 10 I = 1,N E(I+1) = B(I) C(I) = F(I) 10 CONTINUE 180 Fine-Tuning Global Assumptions becomes EQUIVALENCE (E(1), F(101)) C*$* ASSERT NO EQUIVALENCE HAZARD C$DOACROSS SHARE(N,E,B,C,F),LOCAL(I) DO 10 I=1,N E(I+1) = B(I) C(I) = F(I) 10 CONTINUE C*$* ASSERT [NO] TEMPORARIES FOR CONSTANT ARGUMENTS Sometimes the compiler does not perform certain transformations when their effects on the rest of the program are unclear. For example, usually the IF-to-intrinsic transformation changes the following code: SUBROUTINE X(I,N) IF (I .LT. N) I = N END into SUBROUTINE X(I,N) I = MAX(I,N) END But if the actual parameter for I were a constant such as the following, CALL X(1,N) it would appear that the value of the constant 1 was being reassigned. Without additional information, the compiler does not perform transformations within the subroutine. Most compilers automatically put constant actual arguments into temporary variables to protect against this case. The C*$*ASSERT TEMPORARIES FOR CONSTANT ARGUMENTS assertion or the –WK,–assume=c command line option (the default) informs the compiler that constant parameters are protected. The NO version directs the compiler to avoid transformations that might change the values of constant parameters. 181 Chapter 9: Fine-Tuning Program Execution Ignoring Data Dependencies The C*$* ASSERT NO RECURRENCE(variable) assertion tells the compiler to ignore all data dependence conflicts caused by variable in the DO loop that follows it. For example, the following code tells the compiler to ignore all dependence arcs caused by the variable X in the loop: C*$* ASSERT NO RECURRENCE (X) DO 10 i=1,m,5 10 X(k) = X(k) + X(i) Not only does the compiler ignore the assumed dependence, it also ignores the real dependence caused by X(k) appearing on both sides of the assignment. The C*$* ASSERT NO RECURRENCE assertion applies only to the next DO loop. It cannot be specified as a global assertion. In addition to the C*$* ASSERT NO RECURRENCE assertion, the compiler supports the Cray CDIR$ NORECURRENCE assertion and the VAST CVD$ NODEPCHK directive, which perform the same function. 182 Appendix A A. Run-Time Error Messages Table A-1 lists possible Fortran run-time I/O errors. Other errors given by the operating system may also occur (refer to the intro(2) and perror(3F) reference pages for details). Each error is listed on the screen alone or with one of the following phrases appended to it: apparent state: unit num named user filename last format: string lately (reading, writing) (sequential, direct, indexed) formatted, unformatted (external, internal) IO When the Fortran run-time system detects an error, the following actions take place: • A message describing the error is written to the standard error unit (Unit 0). • A core file, which can be used with dbx (the debugger) to inspect the state of the program at termination, is produced if the f77_dump_flag environment variable is defined and set to y. When a run-time error occurs, the program terminates with one of the error messages shown in Table A-1. All of the errors in the table are output in the format user filename : message. 183 Appendix A: Run-Time Error Messages Table A-1 Run-Time Error Messages Number Message/Cause 100 error in format Illegal characters are encountered in FORMAT statement. 101 out of space for I/O unit table Out of virtual space that can be allocated for the I/O unit table. 102 formatted io not allowed Cannot do formatted I/O on logical units opened for unformatted I/O. 103 unformatted io not allowed Cannot do unformatted I/O on logical units opened for formatted I/O. 104 direct io not allowed Cannot do direct I/O on sequential file. 106 can’t backspace file Cannot perform BACKSPACE/REWIND on file. 107 null file name Filename specification in OPEN statement is null. 108 can’t stat file The directory information for the file is not accessible. 109 file already connected The specified filename has already been opened as a different logical unit. 110 off end of record Attempt to do I/O beyond the end of the record. 112 incomprehensible list input Input data for list-directed read contains invalid character for its data type. 113 out of free space Cannot allocate virtual memory space on the system. 184 Table A-1 (continued) Run-Time Error Messages Number Message/Cause 114 unit not connected Attempt to do I/O on unit that has not been opened or cannot be opened. 115 read unexpected character Unexpected character encountered in formatted or directed read. 116 blank logical input field Invalid character encountered for logical value. 117 bad variable type Specified type for the namelist element is invalid. This error is most likely caused by incompatible versions of the front end and the run-time I/O library. 118 bad namelist name The specified namelist name cannot be found in the input data file. 119 variable not in namelist The namelist variable name in the input data file does not belong to the specified namelist. 120 no end record $END is not found at the end of the namelist input data file. 121 namelist subscript out of range The array subscript of the character substring value in the input data file exceeds the range for that array or character string. 122 negative repeat count The repeat count in the input data file is less than or equal to zero. 123 illegal operation for unit You cannot set your own buffer on direct unformatted files. 124 off beginning of record Format edit descriptor causes positioning to go off the beginning of the record. 125 no * after repeat count An asterisk (*) is expected after an integer repeat count. 185 Appendix A: Run-Time Error Messages Table A-1 (continued) Run-Time Error Messages Number Message/Cause 126 'new' file exists The file is opened as new but already exists. 127 can’t find 'old' file The file is opened as old but does not exist. 128 unknown system error An unexpected error was returned by IRIX. 129 requires seek ability The file is on a device that cannot do direct access. 130 illegal argument Invalid value in the I/O control list. 131 duplicate key value on write Cannot write a key that already exists. 132 indexed file not open Cannot perform indexed I/O on an unopened file. 133 bad isam argument The indexed I/O library function receives a bad argument because of a corrupted index file or bad run-time I/O libraries. 134 bad key description The key description is invalid. 135 too many open indexed files Cannot have more than 32 open indexed files. 136 corrupted isam file The indexed file format is not recognizable. This error is usually caused by a corrupted file. 137 isam file not opened for exclusive access Cannot obtain lock on the indexed file. 138 record locked The record has already been locked by another process. 186 Table A-1 (continued) Run-Time Error Messages Number Message/Cause 138 key already exists The key specification in the OPEN statement has already been specified. 140 cannot delete primary key DELETE cannot be executed on a primary key. 141 beginning or end of file reached The index for the specified key points beyond the length of the indexed data file. This error is probably because of corrupted ISAM files or a bad indexed I/O run-time library. 142 cannot find request record The requested key for indexed READ does not exist. 143 current record not defined Cannot execute REWRITE, UNLOCK, or DELETE before doing a READ to define the current record. 144 isam file is exclusively locked The indexed file has been exclusively locked by another process. 145 filename too long The indexed filename exceeds 128 characters. 148 key structure does not match file structure Mismatch between the key specifications in the OPEN statement and the indexed file. 149 direct access on an indexed file not allowed Cannot have direct-access I/O on an indexed file. 150 keyed access on a sequential file not allowed Cannot specify keyed access together with sequential organization. 151 keyed access on a relative file not allowed Cannot specify keyed access together with relative organization. 152 append access on an indexed file not allowed Cannot specifiy append access together with indexed organization. 187 Appendix A: Run-Time Error Messages Table A-1 (continued) Run-Time Error Messages Number Message/Cause 153 must specify record length A record length specification is required when opening a direct or keyed access file. 154 key field value type does not match key type The type of the given key value does not match the type specified in the OPEN statement for that key. 155 character key field value length too long The length of the character key value exceeds the length specification for that key. 156 fixed record on sequential file not allowed RECORDTYPE='fixed' cannot be used with a sequential file. 157 variable records allowed only on unformatted sequential file RECORDTYPE='variable' can only be used with an unformatted sequential file. 158 stream records allowed only on formatted sequential file RECORDTYPE='stream_lf' can only be used with a formatted sequential file. 159 maximum number of records in direct access file exceeded The specified record is bigger than the MAXREC= value used in the OPEN statement. 160 attempt to create or write to a read-only file User does not have write permission on the file. 161 must specify key descriptions Must specify all the keys when opening an indexed file. 162 carriage control not allowed for unformatted units CARRIAGECONTROL specifier can be used only on a formatted file. 188 Table A-1 (continued) Run-Time Error Messages Number Message/Cause 163 indexed files only Indexed I/O can be done only on logical units that have been opened for indexed (keyed) access. 164 cannot use on indexed file Illegal I/O operation on an indexed (keyed) file. 165 cannot use on indexed or append file Illegal I/O operation on an indexed (keyed) or append file. 167 invalid code in format specification Unknown code is encountered in format specification. 168 invalid record number in direct access file The specified record number is less than 1. 169 cannot have endfile record on non-sequential file Cannot have an endfile on a direct- or keyed-access file. 170 cannot position within current file Cannot perform fseek() on a file opened for sequential unformatted I/O. 171 cannot have sequential records on direct access file Cannot do sequential formatted I/O on a file opened for direct access. 173 cannot read from stdout Attempt to read from stdout. 174 cannot write to stdin Attempt to write to stdin. 175 stat call failed in f77inode The directory information for the file is unreadable. 176 illegal specifier The I/O control list contains an invalid value for one of the I/O specifiers. For example, ACCESS='INDEXED'. 180 attempt to read from a writeonly file User does not have read permission on the file. 189 Appendix A: Run-Time Error Messages Table A-1 (continued) Run-Time Error Messages Number Message/Cause 181 direct unformatted io not allowed Direct unformatted file cannot be used with this I/O operation. 182 cannot open a directory The name specified in FILE= must be the name of a file, not a directory. 183 subscript out of bounds The exit status returned when a program compiled with the –C option has an array subscript that is out of range. 184 function not declared as varargs Variable argument routines called in subroutines that have not been declared in a $VARARGS directive. 185 internal error Internal run-time library error. 190 Index A B –aggressive option, 82 –align16 compiler option, 26 –align8 compiler option, 26 alignment, 24, 25 of COMMON blocks, 82 ANSI Fortran data alignment, 25 ANSI-X3H5 standard, 105, 143 archiver, ar, 15 –arclimit option, 83 argument aliasing, 71 arrays declaring, 24 assembly language routines, 19 assertions C*$* ASSERT ARGUMENT ALIASING, 177 C*$* ASSERT NO ARGUMENT ALIASING, 177 C*$* ASSERT NO RECURRENCE, 182 C*$* ASSERT RELATION, 178 C*$* ASSERT TEMPORARIES FOR CONSTANT ARGUMENTS, 181 enabling recognition of, 88 overview, 168 –assume option, 71, 176 assumed dependences, 169 assumptions controlling globally, 71 –automatic compiler option, 160 barrier construct, 146, 156 barrier function, 138 –bestG compiler option, 13 blocking slave threads, 133 C C$, 112 –C compiler option, 164 –c compiler option, 4 C macro preprocessor, 3 C$&, 112 C*$* ARCLIMIT, 170 C*$* ASSERT ARGUMENT ALIASING, 177 C*$* ASSERT NO ARGUMENT ALIASING, 177 C*$* ASSERT NO RECURRENCE, 182 C*$* ASSERT RELATION, 178 C*$* ASSERT TEMPORARIES FOR CONSTANT ARGUMENTS, 181 C*$* EACH_INVARIANT_IF_GROWTH, 170 C*$* INLINE, 175 C*$* MAX_INVARIANT_IF_GROWTH, 170 C*$* NOINLINE, 175 C*$* NOIPA, 176 C*$* OPTIMIZE, 172 C*$* ROUNDOFF, 173 191 Index C*$* SCALAROPTIMIZE, 174 C-style comments accepting in Hollerith strings, 3 cache, 128 setting up page mapping, 85 specifying size, 85 specifying width of memory channel, 85 –cacheline option, 85 –cachesize option, 85 C$CHUNK, 113 C$COPYIN, 139 CDIR$ NORECURRENCE, 182 C$DOACROSS, 106 and REDUCTION, 107 continuing with C$&, 112 IF clause, 106 LASTLOCAL clause, 107 loop naming convention, 140 nesting, 114 CHUNK, 109, 132, 138 –chunk compiler option, 113 C$MP_SCHEDTYPE, 113 comments, 3 COMMON blocks, 107, 164 aligning, 82 making local to a process, 138 shared, 24 compilation, 2 compiler options, 7 –align16, 24, 26 –align8, 24, 26 –automatic, 160 –bestG, 13 –C, 164 –c, 4 –chunk, 113 –G, 13 –jmopt, 13 192 –l, 6 –mp, 142, 143, 159, 164 –mp_schedtype, 113 –nocpp, 3 –pfa, 143 –static, 117, 160, 164 –WK, 69 COMPLEX, 24 COMPLEX*16, 24 COMPLEX*32, 24 constructs work-sharing, 146 core files, 19 producing, 183 C$PAR & directive, 156 C$PAR BARRIER, 156 C$PAR CRITICAL SECTION, 154 C$PAR PARALLEL, 145 C$PAR PARALLEL DO, 146 C$PAR PDO, 147 C$PAR PSECTIONS, 148 C$PAR SINGLE PROCESS, 150 cpp, 3 Cray assertions CDIR$ NORECURRENCE, 182 critical section, 146 and SHARED, 156 PCF construct, 154 critical section construct, 143 differences between single process, 154 CVD$ NODEPCHK, 182 D data dependencies, 116 analyzing for multiprocessing, 114 breaking, 120 complicated, 118 inconsequential, 119 rewritable, 118 data independence, 114 data types alignment, 24, 25 DATE, 64 dbx, 183 debugging parallel Fortran programs, 162 dependences assumed, 169 direct files, 17 directives C$, 112 C$&, 112 C*$* ARCLIMIT, 170 C*$* EACH_INVARIANT_IF_GROWTH, 170 C*$* INLINE, 175 C*$* MAX_INVARIANT_IF_GROWTH, 170 C*$* NOINLINE, 175 C*$* NOIPA, 176 C*$* OPTIMIZE, 172 C*$* ROUNDOFF, 173 C*$* SCALAROPTIMIZE, 174 C$CHUNK, 113 C$DOACROSS, 106 C$MP_SCHEDTYPE, 113 enabling recognition of, 88 list of, 105 overview, 166 see also PCF directives –directives option, 88 dis object file tool, 14 DO loops, 104, 115, 126, 164 DOACROSS, 113 and multiprocessing, 140 double precision registers, 86 –dpregisters option, 86 driver options, 7 drivers, 2 dump object file tool, 14 dynamic scheduling, 109 E –each_invariant_if_growth option, 72 environment variables, 161 CHUNK, 138 f77_dump_flag, 19, 183 MP_BLOCKTIME, 136 MP_SCHEDTYPE, 138 MP_SET_NUMTHREADS, 136 MP_SETUP, 136 equivalence statements, 164 error handling, 19 error messages run-time, 183 ERRSNS, 64 executable object, 4 EXIT, 65 external files, 17 F f77 as driver, 2 supported file formats, 17 syntax, 2 f77_dump_flag, 19, 183 file, object file tool, 14 files direct, 17 external, 17 position when opened, 18 preconnected, 18 193 Index sequential unformatted, 17 supported formats, 17 UNKNOWN status, 19 fine-tuning inlining and IPA, 175 floating point registers, 86 formats files, 17 Fortran ANSI, 25 –fpregisters option, 86 functions in parallel loops, 117 intrinsic, 67, 117 SECNDS, 67 library, 55, 117 RAN, 67 side effects, 117 –fuse option, 71 G –G compiler option, 13 global assumptions controlling, 71 global data area reducing, 13 guided self-scheduling, 109 H handle_sigfpes, 20 Hollerith strings and C-style comments, 3 194 I IDATE, 64 IF clause and C$DOACROSS, 106 IGCLD signal intercepting, 140 –inline_and_copy option, 93 –inline_create option, 98 –inline_from_files option, 97 –inline_from_libraries option, 97 inlining, 91 enabling with options, 92 fine-tuning, 175 specifying routines, 93 interleave scheduling, 109 interleaving, 132 internal table size controlling, 83 interprocedural analysis performing with options, 92 interprocedural analysis (IPA), 91 fine-tuning, 175 specifying routines, 93 intrinsic subroutines DATE, 64 ERRSNS, 64 EXIT, 65 IDATE, 64 MVBITS, 66 TIME, 65 invariant IF floating, 72, 170 –ipa_create option, 99 –ipa_from_files option, 97 –ipa_from_libraries option, 97 J –jmpopt compiler option, 13 L –l compiler option, 6 LASTLOCAL, 106, 115 LASTLOCAL clause, 107 libfpe.a, 20 libraries link, 6 specifying, 7 library functions, 55 link libraries, 6 linking, 5 load balancing, 131 LOCAL, 107, 115 LOGICAL, 24 loop blocking, 84 loop fusion, 71 loop interchange, 127 loop unrolling, 84 enabling, 86 loops, 104 data dependencies, 115 tranformation, 140 M makefiles, 53 master processes, 105, 142 –max_invariant_if_growth option, 72 memory channel specifying width, 85 memory management transformations, 84 options, 84 techniques, 84 m_fork and multiprocessing, 140 misaligned data, 25 –mp compiler option, 142, 143, 159, 164 mp_barrier, 138 mp_block, 133 mp_blocktime, 135 MP_BLOCKTIME environment variable, 136 mp_create, 134 mp_destroy, 134 mp_my_threadnum, 135 mp_numthreads, 135 __mp_parallel_do, 162 MP_SCHEDTYPE, 108, 113, 138 –mp_schedtype compiler option, 113 mp_setlock, 138 MP_SET_NUMTHREADS, 136 mp_set_numthreads, 135 and MP_SET_NUMTHREADS, 136 MP_SETUP, 136 mp_setup, 134 mp_simple_sched and loop transformations, 140 tasks executed, 142 mp_slave_control, 142 __mp_slave_wait_for_work, 162 mp_unblock, 133 mp_unsetlock, 138 multi-language programs, 4 multiprocessing and DOACROSS, 140 and load balancing, 131 associated overhead, 126 enabling, 159 enabling directives, 142 195 Index MVBITS, 66 N nm, object file tool, 14 –noassume option, 72 –nocpp compiler option, 3 NOWAIT clause, 147, 148, 150 NUM_THREADS, 136 O object files, 4 tools for interpreting, 14 object module, 4 objects linking, 5 optimizations aggressive, 82 changing levels, 172 controlling internal table size, 83 controlling levels, 74 invariant IF floating, 72 loop blocking, 84 loop fusion, 71 loop unrolling, 84, 86 memory management transformations, 84 recursion, 89 scalar, 174 –optimize option, 74 and –O compiler option, 75 optimizing inlining and IPA, 91 196 P parallel DO construct, 146 parallel Fortran directives, 105 parallel region, 131, 143, 145 and SHARED, 145 efficiency of, 158 restrictions, 157 parallel sections construct, 148 assignment of processes, 150 PCF constructs and efficiency, 158 barrier, 146, 156 critical section, 146, 154 differences between single process and critical section, 154 NOWAIT, 147, 148, 150 parallel DO, 146 parallel regions, 145, 157 parallel sections, 148 PDO, 147 restrictions, 157 single process, 150 types of, 146 PCF directives C$PAR &, 156 C$PAR BARRIER, 156 C$PAR CRITICAL SECTION, 154 C$PAR PARALLEL, 145 C$PAR PARALLEL DO, 146 C$PAR PDO, 147 C$PAR PSECTIONS, 148 C$PAR SINGLE PROCESS, 150 enabling, 143 overview, 143 PCF standard, 105 PDO construct, 147 quad-precision operations, 19 recurrence and data dependency, 123 recursion enabling, 89 –recursion option, 89 reduction and data dependency, 123 listing associated variables, 107 sum, 125 REDUCTION clause and C$DOACROSS, 107 registers double precision, 86 floating point, 86 round off controlling from command line, 76 round-to-nearest mode, 19 –roundoff option, 76 and –O compiler option, 76 run-time error handling, 19 run-time scheduling, 109 R S RAN, 67 rand and multiprocessing, 117 REAL*16 range, 23 REAL*4 range, 23 REAL*8 alignment, 24 range, 23 records, 17 scalar optimizations controlling levels, 78 controlling with directives, 174 fine tuning, 170 –scalaropt option, 78 and –O compiler option, 78 scheduling methods, 108, 133, 140 dynamic, 109 guided self-scheduling, 109 interleave, 109 run-time, 109 simple, 108 performance improving, 13 –pfa compiler option, 143 Power Fortran, 115 preconnected files, 18 preprocessor cpp, 3 processes master, 104, 105, 142 slave, 104, 105, 142 prof and parallel Fortran, 161 profiling parallel Fortran program, 161 programs multi-language, 4 Q 197 Index SECNDS, 67 self-scheduling, 109 sequential unformatted files, 17 –setassociativity option, 85 SHARE, 106, 107, 115 SHARED and critical section, 156 and parallel region, 145 SIGCLD, 134 simple scheduling, 108 single process PCF construct, 150 single process construct, 150 differences between critical section, 154 size, object file tool, 14 slave threads, 105, 142 blocking, 133, 134 source files, 3 spooled routines, 140 sproc and multiprocessing, 139 associated processes, 142 –static compiler option, 117, 160, 164 strip, object file tool, 14 subroutines intrinsic, 117 system DATE, 64 ERRSNS, 64 EXIT, 65 IDATE, 64 MVBITS, 66 sum reduction, example, 125 sychronizer, 162 198 symbol table information producing, 14 syntax conventions, xvii system interface, 55 T test&test&set, 153 thread master, 104 slave, 104 tiling, 84 TIME, 65 trap handling, 20 U –unroll option, 86 –unroll2 option, 86 ussetlock, 138 usunsetlock, 138 V variables in parallel loops, 115 local, 117 VAST directives CVD$ NODEPCHK, 182 VOLATILE and critical section, 156 and multiple threads, 152 W –WK option –aggressive, 82 and scalar optimizations, 69 –arclimit, 83 –assume, 71, 176 –cacheline, 85 –cachesize, 85 –directives, 88 –dpregisters, 86 –each_invariant_if_growth, 72 –fpregisters, 86 –fuse, 71 –inline_create, 98 –inline_from_files, 97 –inline_from_libraries, 97 –ipa_create, 99 –ipa_from_files, 97 –ipa_from_libraries, 97 –max_invariant_if_growth, 72 –optimize, 74 –recursion, 89 –roundoff, 76 –scalaropt, 78 –setassociativity, 85 –unroll, 86 –unroll2, 86 work quantum, 126 work-sharing constructs, 143 restrictions, 157 types of, 146 X –Xlocaldata loader directive, 138 199 Tell Us About This Manual As a user of Silicon Graphics products, you can help us to better understand your needs and to improve the quality of our documentation. Any information that you provide will be useful. Here is a list of suggested topics: • General impression of the document • Omission of material that you expected to find • Technical errors • Relevance of the material to the job you had to do • Quality of the printing and binding Please send the title and part number of the document with your comments. The part number for this document is 007-2361-002. Thank you! Three Ways to Reach Us • To send your comments by electronic mail, use either of these addresses: – On the Internet: techpubs@sgi.com – For UUCP mail (through any backbone site): [your_site]!sgi!techpubs • To fax your comments (or annotated copies of manual pages), use this fax number: 650-932-0801 • To send your comments by traditional mail, use this address: Technical Publications Silicon Graphics, Inc. 2011 North Shoreline Boulevard, M/S 535 Mountain View, California 94043-1389
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.1 Linearized : No Page Count : 218 Create Date : 1999:01:07 08:01:26 Producer : Acrobat Distiller 3.02EXIF Metadata provided by EXIF.tools