007 3311 001
User Manual: 007-3311-001
Open the PDF directly: View PDF .
Page Count: 130
Download | |
Open PDF In Browser | View PDF |
SpeedShop User’s Guide Document Number 007-3311-001 CONTRIBUTORS Written and Illustrated by Janet Home-Lorenzin, Wendy Ferguson and Dany Galgani. Production by Ruth Christian. Engineering contributions by Marty Itzkowitz, Pete Orelup, Jun Yu and Marco Zagha. © Copyright 1996 Silicon Graphics, Inc.— All Rights Reserved The contents of this document may not be copied or duplicated in any form, in whole or in part, without the prior written permission of Silicon Graphics, Inc. RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure of the technical data contained in this document by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 52.227-7013 and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR Supplement. Unpublished rights reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94039-7311. Silicon Graphics is a registered trademark of Silicon Graphics, Inc. UNIX is a registered trademark of UNIX System Laboratories. X Window System is a trademark of the Massachusetts Institute of Technology. OSF/Motif is a trademark of the Open Software Foundation. Purify is a registered trademark of Pure Software, Inc. SpeedShop User’s Guide Document Number 007-3311-001 Contents List of Figures ix List of Tables xi About This Guide xiii 1. 2. Introduction to Performance Analysis 1 Sources of Performance Problems 1 CPU-bound Processes 2 I/O-bound Processes 2 Memory-bound Processes 2 Bugs 2 Fixing Performance Problems 3 SpeedShop Tools 3 Main Commands 3 Additional Commands 4 Experiment Types 5 Libraries 6 API 6 Supported Programs and Languages 6 Using SpeedShop Tools for Performance Analysis Using ssusage 8 Using ssrun and prof 8 Collecting Data for Part of a Program 10 7 Tutorial for C Users 11 Tutorial Overview 11 Tutorial Setup 13 iii Contents Analyzing Performance Data 13 A usertime Experiment 14 A pcsamp Experiment 16 A Hardware Counter Experiment An ideal Experiment 20 An fpe Trace 23 3. 4. iv Tutorial for Fortran Users 25 Tutorial Overview 25 Tutorial Setup 26 Analyzing Performance Data 27 A usertime Experiment 27 A pcsamp Experiment 30 A Hardware Counter Experiment An ideal Experiment 33 18 31 Experiment Types 37 Selecting an Experiment 37 usertime Experiment 38 pcsamp Experiment 39 Hardware Counter Experiments 40 [f]gi_hwc 41 [f]cy_hwc 41 [f]ic_hwc 41 [f]isc_hwc 42 [f]dc_hwc 42 [f]dsc_hwc 42 [f]tlb_hwc 42 [f]gfp_hwc 43 prof_hwc 43 Hardware Counter Numbers 43 ideal Experiment 45 Inclusive Basic Block Counting 46 fpe Trace 47 Contents 5. Collecting Data on Machine Resource Usage 49 ssusage Syntax 49 ssusage Results 49 6. Setting Up and Running Experiments: ssrun 51 Building your Executable 51 Setting Up Output Directories and Files 53 Running Experiments 53 ssrun Syntax 53 ssrun Examples 55 Example Using the pcsampx Experiment 55 Example Using the -v Option 56 Using ssrun with a Debugger 57 Running Experiments on MPI Programs 57 Using Calipers 58 Setting Calipers with ssrt_caliper_point 59 Setting Calipers with Signals 60 Setting Calipers with a Debugger 60 Effects of ssrun 61 Effects of ssrun -ideal 61 Customizing Your Environment 61 Setting Environment Variables for Spawned Processes Setting General Environment Variables 62 7. Analyzing Experiment Results: prof 65 Using prof to Generate Performance Reports prof Syntax 66 prof Options 66 prof Output 68 62 65 v Contents Using prof with ssrun 69 usertime Experiment Reports 69 pcsamp Experiment Reports 71 Hardware Counter Experiment Reports 72 ideal Experiment Reports 73 FPE Trace Reports 75 Using prof Options 76 Using the -dis Option 76 Using the -S Option 77 Using the -calipers Option 79 Using the -gprof Option 79 Generating Reports For Different Machine Types 82 Generating Reports for Multiprocessed Executables 83 Generating Compiler Feedback Files 83 Interpreting Reports 83 vi 8. Using SpeedShop in Expert Mode: pixie 85 Using pixie 85 pixie Syntax 86 pixie Options 86 pixie Output 87 Obtaining Basic Block Counts 87 Examples of Basic Block Counting 91 Example Using prof –invocations 91 Example Using prof –heavy 93 Example Using prof –quit 94 Obtaining Inclusive Basic Block Counts 95 Example of prof –gprof 98 9. Miscellaneous Commands 99 Using the thrash Command 99 thrash Syntax 99 Effects of thrash 100 Contents Using the squeeze Command 100 squeeze Syntax 100 Effects of squeeze 101 Calculating the Working Set of a Program Dumping Performance Data Files 103 ssdump Syntax 103 Experiment File Format 104 Dumping Compiler Feedback Files 106 fbdump Syntax 106 Index 101 109 vii List of Figures Figure 8-1 Figure 8-2 How Basic Block Counting Works 90 How Inclusive Basic Block Counting Works 97 ix List of Tables Table 1-1 Table 4-1 Table 4-2 Table 4-3 Table 6-1 Table 6-2 Table 6-3 Table 7-1 Table 8-1 Table 9-1 Choosing an Experiment Type 9 Summary of Experiments 38 Basic Block Counts and PC Profile Counts Compared Hardware Counter Numbers 43 Flags for ssrun 54 Environment variables for Spawned Processes 62 Environment Variables for ssrun 62 Options for prof 66 Options for pixie 86 Options for fbdump 107 40 xi About This Guide This manual is a user’s guide for the SpeedShop performance tools, Release 1.0. It contains the following chapters: • Chapter 1, “Introduction to Performance Analysis” provides a general introduction to performance analysis concepts and techniques, plus an overview of the SpeedShop tools. • Chapter 2, “Tutorial for C Users” provides a tutorial on how to collect performance data and generate reports for a C program. • Chapter 3, “Tutorial for Fortran Users” provides a tutorial on how to collect performance data and generate reports for Fortran programs running on single-processor machines. • Chapter 4, “Experiment Types” describes the types of experiments that can be performed using SpeedShop tools. • Chapter 5, “Collecting Data on Machine Resource Usage” describes how to use the ssusage command to collect information about a program’s machine resource usage. • Chapter 6, “Setting Up and Running Experiments: ssrun” explains in detail how to set up and run experiments using ssrun, and explains how to use caliper points to generate reports for part of a program. • Chapter 7, “Analyzing Experiment Results: prof” explains how to generate reports from performance data using prof. • Chapter 8, “Using SpeedShop in Expert Mode: pixie” explains how to use pixie and prof directly, without invoking ssrun. • Chapter 9, “Miscellaneous Commands” explains how to use the thrash and squeeze commands to determine the memory usage, or working set, of your application. It also covers commands to print performance data files. xiii About This Guide xiv Chapter 1 1. Introduction to Performance Analysis This chapter provides a brief introduction to performance analysis techniques for SGI systems and describes how to use them to solve performance problems. It includes the following sections: • “Sources of Performance Problems” • “Fixing Performance Problems” • “SpeedShop Tools” • “Using SpeedShop Tools for Performance Analysis” • “Collecting Data for Part of a Program” Sources of Performance Problems To tune a program’s performance, you need to determine its consumption of machine resources. At any point (or phase) in a process, there is one limiting resource controlling the speed of execution. Processes can be slowed down by any of the following: • CPU speed and availability • I/O processing • Memory size and availability Performance problems might span the entire run of a process, or they may occur in just a small portion of the program. For example, a function that performs large amounts of I/O processing might be called regularly as the program runs, or a particularly CPU-intensive calculation might occur in just one portion of the program. When there are performance problems in a small portion of the program, it is useful to be able to collect data for just that part of the program. Since programs exhibit different behavior during different phases of operation, you need to identify the limiting resource during each phase. A program can be I/O-bound while it reads in data, CPU-bound while it performs computation, and I/O-bound again in its 1 Chapter 1: Introduction to Performance Analysis final stage while it writes out data. Once you’ve identified the limiting resource in a phase, you can perform an in-depth analysis to find the problem. And after you have solved that problem, you can check for other problems within the same or other phases— performance analysis is an iterative process. CPU-bound Processes A CPU-bound process spends its time in the CPU and is limited by CPU speed and availability. To improve performance on CPU-bound processes, you may need to streamline your code. This can entail modifying algorithms, reordering code to avoid interlocks, removing nonessential steps, blocking to keep data in cache and registers, or using alternative algorithms. I/O-bound Processes An I/O-bound process has to wait for I/O to complete and may be limited by disk access speeds or memory caching. To improve the performance of I/O-bound processes, you can try one of the following techniques: • Improve overlap of I/O with computation • Optimize data usage to minimize disk access • Use data compression Memory-bound Processes A memory-bound program that continuously needs to swap out pages of memory. Page thrashing is often due to accessing virtual memory on a haphazard rather than strategic basis. One telltale indication of a page-thrashing condition on a system with paging to a local disk, is noise due to disk accesses. To fix a memory-bound process, you can try to improve the memory reference patterns or, if possible, decrease the memory used by the program. Bugs You may find that a bug is causing the performance problem. For example, you might find that you are unnecessarily reading from the same file twice in different parts of the 2 Fixing Performance Problems program, that floating point exceptions are slowing down your program, that old code has not been completely removed, or that you are leaking memory (making malloc calls without the corresponding calls to free). Fixing Performance Problems The SpeedShop performance tools described in this manual can help you to identify specific performance problems described later in this chapter. However the techniques described in this manual comprise only a part of performance tuning. Other areas that you can tune, but are outside the scope of this document, include graphics, I/O, the kernel, system parameters, memory, and real-time system calls. Although it may be possible to obtain short-term speed increases by relying on unsupported or undocumented quirks of the compiler system, it’s a bad idea to do so. Any such “features” may break in future releases of the compiler system. The best way to produce efficient code that can be trusted to remain efficient is to follow good programming practices. In particular, choose good algorithms and leave the details to the compiler. SpeedShop Tools The SpeedShop tools allow you to run experiments and generate reports to track down the sources of performance problems. SpeedShop consists of a set of commands that can be run in a shell, an API, and a number of libraries to support the commands. Main Commands • ssusage The ssusage command allows you to collect information about your program’s use of machine resources. Output from ssusage can be used to determine where most resources are being spent. • ssrun The ssrun command allows you to run experiments on a program to collect performance data. It establishes the environment to capture performance data for an executable, creates a process from the executable (or from an instrumented 3 Chapter 1: Introduction to Performance Analysis version of the executable) and runs it. Input to ssrun consists of an experiment type, control flags, the name of the target, and the arguments to be used in executing the target. • prof The prof command analyzes the performance data you have recorded using ssrun and provides formatted reports. prof detects the type of experiment you have run, and generates a report specific to the experiment type. Additional Commands • pixie The pixie command instruments an executable to enable basic block counting experiments to be performed. If you use ssrun, you will not normally need to call this program directly. pixie reads an executable program, partitions it into basic blocks, and writes an equivalent program (with a .pixie extension by default) containing additional code that counts the execution of each basic block. • fbdump The fbdump command prints out the formatted contents of compiler feedback files generated by prof. • squeeze The squeeze command is used to allocate a region of virtual memory and lock the virtual memory down into real memory, making it unavailable to other processes. • thrash The thrash command is used to explore paging behavior by allowing you to allocate a block of memory, then accessing the allocated memory to explore paging behavior. • ssdump The ssdump program prints out formatted performance data that was collected while running ssrun. This program is included for SpeedShop debugging purposes. You will not normally need to use it. 4 SpeedShop Tools Experiment Types You can conduct the following types of experiments using the ssrun command: • Statistical PC Sampling with pcsamp experiments. Data is measured by periodically sampling the Program Counter (PC) of the target executable when it is in the CPU. The PC shows the address of the currently executing instruction in the program. The data that is obtained from the samples is translated to a time displayed at the function, source line and machine instruction levels. The actual CPU time is calculated by multiplying the number of times a specific address is found in the PC by the amount of time between samples. • Statistical Hardware Counter Sampling with _hwc experiments. Hardware counter experiments are available on R10000 systems that have built-in hardware counters. Data is measured by collecting information each time the specified hardware counter overflows. You can specify the hardware counter and the overflow interval you want to use. • Statistical Call Stack Profiling with usertime Data is measured by periodically sampling the call stack. The program’s callstack data is used to attribute exclusive user time to the function at the bottom of each callstack (i.e., the function being executed at the time of the sample), and to attribute inclusive user time to all the functions above the one currently being executed. • Basic Block Counting with ideal Data is measured by counting basic blocks and calculating an ideal CPU time for each function. This involves instrumenting the program to divide the code into basic blocks, which are sets of instructions with a single entry point, a single exit point, and no branches into or out of the set. Instrumentation also permits a count of all dynamic (function-pointer) calls to be recorded. • Floating Point Exception Trace with fpe A Floating Point Exception Trace collects each floating point exception with the exception type and the callstack at the time of the exception. prof generates a report showing inclusive and exclusive floating-point exception counts. 5 Chapter 1: Introduction to Performance Analysis Libraries Versions of the SpeedShop libraries libss.so and libssrt.so are available to support applications built using shared libraries (DSOs) only and 32-bit, n32 or 64-bit Bit ABIs. • libss.so libss.so is a shared library (DSO) that supports libssrt.so. libss.so data normally appears in experiment results generated with prof. • libssrt.so libssrt.so is a shared library (DSO) that gets linked in to the program you specify when you run an experiment. All the collection of performance data with the SpeedShop system is done within the target process(es), by exercising various pieces of functionality using libssrt. Data from libssrt.so does not normally appear in performance data reports generated with prof. • libfpe_ss.so Replaces the standard libfpe.so for the purposes of collecting floating point exception data. Click fpe_ss to view the reference page. • libmalloc_ss.so Inserts versions of malloc routines from libc.so.1 that allow tracing all calls to malloc, free, realloc, memalign, and valloc. Click malloc_ss to view the reference page. API The API is primarily available to allow you to use ssrt_caliper_point to set caliper points in your source code. See Chapter 6, “Using Calipers” for information on using caliper points. For information on other API functions, click ssapi to view the reference page. Supported Programs and Languages The SpeedShop tools support programs with the following characteristics: 6 • Shared libraries (DSOs.) • Non-stripped executables. • Executables containing fork, sproc, system, or exec commands. Using SpeedShop Tools for Performance Analysis • Executables using supported techniques for opening, closing, and/or delay-loading DSOs. • C, C++, Fortran (Fortran-77, Fortran-90, and High-Performance Fortran), or ADA-95 source code. • Power Fortran and Power C source code; prof understands the syntax and semantics of the MP-runtime, and displays the data accordingly. • pthreads: Currently supported only with data on a per-process basis, not per-thread. The behavior of the pthreads library itself is monitored just like any other user-level code. Future releases of the SpeedShop tools will provide per thread support for pthreads. • MPI or other message-passing paradigms: Currently supported by providing data on the behavior of each process. The behavior of the MPI library itself is monitored just like any other user-level code. Using SpeedShop Tools for Performance Analysis Performance tuning typically consists of examining machine resource usage, breaking down the process into phases, identifying the resource bottleneck within each phase, and correcting the cause of the bottleneck. Generally, you run the first experiment to break your program down into phases and run subsequent experiments to examine each phase individually. After you have solved a problem in a phase, you should then re-examine machine resource usage to see if there is further opportunity for performance improvement. The general steps for a performance analysis cycle are: 1. Build the application. 2. Run experiments on the application to collect performance data. 3. Examine the performance data. 4. Generate an improved version of the program. 5. Repeat as needed. 7 Chapter 1: Introduction to Performance Analysis To accomplish the above using SpeedShop Tools: • Use ssusage to capture information on your program’s use of machine resources. • Use ssrun to capture different types of performance data over either your entire program or parts of the program. ssrun can be used in conjunction with dbx or WorkShop debuggers. • Use prof to analyze the data and generate reports. Using ssusage To determine overall resource usage by your program, run the program with ssusage. The results of this command allow you to identify high user CPU time, high system CPU time, high I/O time, and a high degree of paging. ssusage program_name From the ssusage output, you can decide which experiments to run to collect data for further study. For more information on ssusage, see Chapter 5, “Collecting Data on Machine Resource Usage,” or click ssusage to view the reference page. Using ssrun and prof This section describes the steps involved in a performance analysis cycle when using the main interface to the SpeedShop tools: the ssrun command. You can also call the commands individually. For example, if you are planning to perform basic block counting experiments that involve instrumenting the executable, you can either do this by calling ssrun with the appropriate experiment type, or you can set up your environment to call pixie directly to instrument your executable. Information on setting up your environment and running pixie directly can be found in Chapter 8, “Using SpeedShop in Expert Mode: pixie”. 1. Build the executable. In most circumstances, you can build the executable as you would normally. See Chapter 6, “Building your Executable” for all the information you need on how to build the executable. 2. Specify caliper points if you want to collect data for only a portion of your program. See “Collecting Data for Part of a Program” for more information on running experiments over part of a program. 8 Using SpeedShop Tools for Performance Analysis 3. To collect performance data, call ssrun with the parameters below. Use the information in Table 1-1 to determine which experiments to run: ssrun flags exp_type prog_name prog_args flags One or more valid flags. For a complete list of flags, click ssrun to view the reference page. exp_type Experiment name. prog_name Executable name. prog_args Arguments to the executable Table 1-1 Choosing an Experiment Type Performance Problem Experiment(s) to Run High user CPU time usertime pcsamp *_hwc experiments ideal High system CPU time If floating point exceptions are suspected: fpe High I/O time ideal, then examine counts of I/O routines High paging (majf) ideal, then prof -feedback and cord to rearrange procedures. If inefficient heap usage is suspected, use WorkShop’s Performance Analyzer to gather information. For each run of the executable, the experiment data is stored in a file with a name of the format prog_name.exp_type.pid For more information on the ssrun command, see Chapter 6, “Setting Up and Running Experiments: ssrun,” or click ssrun to view the reference page. 9 Chapter 1: Introduction to Performance Analysis 4. To generate a report of the experiment, call prof with the following parameters: prof flags data_file flags One or more valid flags. For a complete list of flags, click prof to view the reference page. data_file The name of the file in which the experiment data was recorded. For more information on using prof, see Chapter 7, “Analyzing Experiment Results: prof,” or click prof to view the reference page. Collecting Data for Part of a Program If you have a performance problem in only one part of your program, you may want to collect performance data for just that part.You can do this by setting caliper points around the problem area when running an experiment, then using prof -calipers option to generate a report for the problem area. You can set caliper points using one of the following approaches: • The SpeedShop API • The caliper signal environment • A debugger such as the ProDev WorkShop debugger For more information on using calipers, see Chapter 6, “Setting Up and Running Experiments: ssrun.” 10 Chapter 2 2. Tutorial for C Users This chapter presents a tutorial for using the SpeedShop tools to gather and analyze performance data in a C program, and covers these topics: • “Tutorial Overview” • “Tutorial Setup” • “Analyzing Performance Data” Note: Because of inherent differences between systems and because of concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic form of the results should be the same. Tutorial Overview This tutorial is based on a sample program called generic. There are three versions of the program: the generic directory contains files for the 32-bit ABI, the generic64 directory contains files for the 64-bit ABI and genericn32 directory contains files for the N 32-bit ABI. When you do the tutorial, choose the version of generic most appropriate for your system. A good guideline is to choose whichever version corresponds to the way you expect to develop your programs. This tutorial was written and tested using the version of generic in the genericn32 directory. generic was designed as a test and demonstration application. It contains code to run scenarios that each test a different area of SpeedShop. The version of generic that you will be using in this tutorial performs scenarios that: • Build a linked list of structures • Use a lot of usertime • Scan a directory and run the stat command on each file 11 Chapter 2: Tutorial for C Users • Perform file I/O • Generate a number of floating point exceptions • Link and call a routine in a DSO Output from the program looks like the following: 0:00:00.000 ======== (24479) Begin script Fri 03 May 96 10:17:13. begin script `ll.u.d.i.f.dso’ 0:00:00.032 ======== (24479) start of linklist Sun 03 May 96 10:17:13. linklist completed. 0:00:00.002 ======== (24479) start of usrtime Fri 03 May 96 10:17:13. usertime completed. 0:00:10.824 ======== (24479) start of dirstat Fri 03 May 96 10:17:24. dirstat of /usr/include completed, 242 files. 0:00:10.844 ======== (24479) start of iofile Fri 03 May 96 10:17:24. iofile on /unix completed, 4221656 chars. 0:00:11.036 ======== (24479) start of fpetraps Fri 03 May 96 10:17:24. fpetraps completed. 0:00:11.038 ======== (24479) start of libdso Fri 03 May 96 10:17:24. dlslave_init executed dlslave_routine executed slaveusertime completed. libdso: dynamic routine returned 13 end of script `u.d.i.f.dso’ 0:00:11.930 ======== (24479) End script Fri 03 May 96 10:17:25. The tutorial shows you how to perform the following experiments using ssrun, and how to interpret experiment-specific reports generated by prof: 12 • usertime • pcsamp • dsc_hwc • ideal • fpe Tutorial Setup Tutorial Setup You need to copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial. 1. Change to the /usr/demos/SpeedShop directory. 2. Copy the appropriate generic directory and its contents to a directory where you have write permission: cp -r generic3. Change to the directory you just created: cd /generic 4. Compile the program by typing make all This provides an executable for the experiment. 5. Set the library path so that the program can find shared libraries in the generic directory: setenv LD_LIBRARY_PATH /generic Analyzing Performance Data This section provides steps on how to run the following experiments on the generic program and generate and interpret the results: • “A usertime Experiment” • “A pcsamp Experiment” • “A Hardware Counter Experiment” • “An ideal Experiment” • “An fpe Trace” You can follow the tutorial from start to finish, or you can follow steps for just the experiment(s) you want. 13 Chapter 2: Tutorial for C Users A usertime Experiment This section takes you through the steps to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of user time spent in each function in your program. For more information on usertime, see the “usertime Experiment” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -usertime generic This starts the experiment. Output from generic and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, generic, the experiment type, usertime, and the process ID is also generated. In this example, the filename is generic.usertime.24479. 0:00:00.000 ======== (24479) Begin script Fri 03 May 96 10:17:13. begin script `ll.u.d.i.f.dso’ 0:00:00.032 ======== (24479) start of linklist Sun 03 May 96 10:17:13. linklist completed. 0:00:00.002 ======== (24479) start of usrtime Fri 03 May 96 10:17:13. usertime completed. 0:00:10.824 ======== (24479) start of dirstat Fri 03 May 96 10:17:24. dirstat of /usr/include completed, 242 files. 0:00:10.844 ======== (24479) start of iofile Fri 03 May 96 10:17:24. iofile on /unix completed, 4221656 chars. 0:00:11.036 ======== (24479) start of fpetraps Fri 03 May 96 10:17:24. fpetraps completed. 0:00:11.038 ======== (24479) start of libdso Fri 03 May 96 10:17:24. dlslave_init executed dlslave_routine executed slaveusertime completed. libdso: dynamic routine returned 13 end of script `u.d.i.f.dso’ 0:00:11.930 ======== (24479) End script Fri 03 May 96 10:17:25. 2. To generate a report on the data collected, type the following at the command line: prof prof prints results to stdout. Note that the prof output below is a partial listing. ------------------------------------------------------------------------------Profile listing generated Sun May 19 16:32:23 1996 with: prof generic.usertime.14427 ------------------------------------------------------------------------------Total Time (secs) : 21.51 14 Analyzing Performance Data Total Samples : 717 Stack backtrace failed: 0 Sample interval (ms) : 30 CPU : R4000 FPU : R4010 Clock : 150.0MHz Number of CPUs : 1 ------------------------------------------------------------------------------caller/total parents index %time self descendents total (self) name callee/descend children ------------------------------------------------------------------------------[1] 100.0% 0.00 21.51 717 (0) __start [1] 0.00 21.48 716/717 0x10001b44 main [2] 0.03 0.00 1/717 0x10001b04 __readenv_sigfpe[20] ------------------------------------------------------------------------------0.00 21.48 716/716 0x10001b44 __start [1] [2] 99.9% 0.00 21.48 716 (0) main [2] 0.00 21.48 716/716 0x10001bd0 Scriptstring [3] ------------------------------------------------------------------------------0.00 21.48 716/716 0x10001bd0 main [2] [3] 99.9% 0.00 21.48 716 (0) Scriptstring [3] 0.00 18.69 623/716 0x10001f64 usrtime [4] 0.00 1.50 50/716 0x10001f64 iofile [6] 0.00 1.08 36/716 0x10001f64 libdso [10] 0.00 0.15 5/716 0x10001e78 genLog [13] 0.00 0.03 1/716 0x10001f64 dirstat [30] 0.00 0.03 1/716 0x1000202c genLog [13] ------------------------------------------------------------------------------0.00 18.69 623/623 0x10001f64 Scriptstring [3] [4] 86.9% 0.00 18.69 623 (0) usrtime [4] 18.66 0.03 623/623 0x10004edc anneal [5] ------------------------------------------------------------------------------18.66 0.03 623/623 0x10004edc usrtime [4] [5] 86.9% 18.66 0.03 623 (622) anneal [5] 0.00 0.03 1/1 0x10005ad0 init2da [27] ------------------------------------------------------------------------------0.00 1.50 50/50 0x10001f64 Scriptstring [3] [6] 7.0% 0.00 1.50 50 (0) iofile [6] 0.00 1.50 50/50 0x100025f4 fread [7] ------------------------------------------------------------------------------0.00 1.50 50/50 0x100025f4 iofile [6] [7] 7.0% 0.00 1.50 50 (0) fread [7] 0.00 1.44 48/50 0xfac80ec __filbuf [8] 0.00 0.03 1/50 0xfac8060 _findbuf [33] 15 Chapter 2: Tutorial for C Users 0.03 0.00 1/50 0xfac8188 memcpy [18] ... The report shows information for each function, its callers and its callees. The function names are show in the right-hand column of the report. The function that is being reported is shown outdented from its caller and callee(s). For example, the first function shown in this report is __start which has no callers and two callees. The remaining columns are described below. • The index column provides an index number for reference. • The %time column shows the cumulative percentage of inclusive time spent in each function and its descendents. For example, 99.9% of the time was spent in Scriptstring and all functions listed below it. • The self column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in __start, but 0.03 of a second was spent in __readenv_sigfpe. • The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, 21.48 seconds were spent in main. • The caller/total, total (self), callee/descend column provides information on the number of cycles out of the total spent on the function, its callers and its callees. For example, the anneal function (index number 5) shows 623/623 for its caller (usrtime), 623(622) for itself, and 1/1 for its callee (init2da.) A pcsamp Experiment This section takes you through the steps to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine instruction and function in your program. For more information on pcsamp, see the “pcsamp Experiment” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun fpcsamp generic This starts the experiment. The f option is used with pcsamp for this program because the program runs quickly and does not gather much data using the default pcsamp experiment. Output from generic and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, generic, the experiment type, pcsamp, and the process ID is also generated. In this example, the filename is generic.pcsamp.2277. 16 Analyzing Performance Data 0:00:00.000 ======== (14480) Begin script begin script `ll.u.d.i.f.dso’ 0:00:00.074 ======== (14480) start of linklist linklist completed. 0:00:00.085 ======== (14480) start of usrtime usertime completed. 0:00:17.985 ======== (14480) start of dirstat dirstat of /usr/include completed, 230 files. 0:00:18.008 ======== (14480) start of iofile iofile on /unix completed, 4221656 chars. 0:00:20.321 ======== (14480) start of fpetraps fpetraps completed. 0:00:20.323 ======== (14480) start of libdso dlslave_init executed dlslave_routine executed slaveusertime completed. libdso: dynamic routine returned 13 end of script `ll.u.d.i.f.dso’ 0:00:21.394 ======== (14480) End script Sun 19 May 96 17:18:33. Sun 19 May 96 17:18:33. Sun 19 May 96 17:18:33. Sun 19 May 96 17:18:51. Sun 19 May 96 17:18:51. Sun 19 May 96 17:18:54. Sun 19 May 96 17:18:54. Sun 19 May 96 17:18:55. 2. To generate a report on the data collected and redirect the output to a file, type the following at the command line: prof > pcsamp.results Output similar to the following is generated. ------------------------------------------------------------------------------Profile listing generated Sun May 19 17:21:27 1996 with: prof generic.fpcsamp.14480 ------------------------------------------------------------------------------samples time CPU FPU Clock N-cpu S-interval Countsize 19077 19s R4000 R4010 150.0MHz 1 1.0ms 2(bytes) Each sample covers 4 bytes for every 1.0ms ( 0.01% of 19.0770s) -------------------------------------------------------------------------------p[rocedures] using pc-sampling. Sorted in descending order by the number of samples in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------samples time(%) cum time(%) procedure (dso:file) 17794 18s( 93.3) 18s( 93.3) anneal (/usr/demos/SShop/genericn32/generic:/usr/demos/SShop/genericn32/generic.c) 1046 1s( 5.5) 19s( 98.8) slaveusrtime (/usr/demos/SShop/genericn32/dlslave.so:/usr/demos/SShop/genericn32/dlslave.c) 17 Chapter 2: Tutorial for C Users 163 0.16s( 0.9) 19s( 99.6) _read (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/sys/read.s) 34 0.034s( 0.2) 19s( 99.8) memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 20 0.02s( 0.1) 19s( 99.9) _xstat (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/sys/xstat.s) 8 0.008s( 0.0) 19s( 99.9) fread (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fread.c) 3 0.003s( 0.0) 19s(100.0) iofile (/usr/demos/SShop/genericn32/generic:/usr/demos/SShop/genericn32/generic.c) 3 0.003s( 0.0) 19s(100.0) _doprnt (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/print/doprnt.c) 1 0.001s( 0.0) 19s(100.0) __sinf (/usr/lib32/libm.so:/work/cmplrs/libm/fsin.c) 1 0.001s( 0.0) 19s(100.0) init2da (/usr/demos/SShop/genericn32/generic:/usr/demos/SShop/genericn32/generic.c) 1 0.001s( 0.0) 19s(100.0) _write (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/sys/write.s) 1 0.001s( 0.0) 19s(100.0) _drand48 (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) 1 0.001s( 0.0) 19s(100.0) _morecore (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/gen/malloc.c) 1 0.001s( 0.0) 19s(100.0) fwrite (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fwrite.c) 19077 19s(100.0) 19s(100.0) TOTAL • The samples columns shows how many samples were taken when the process was executing in the function. • The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function. • The cum time(%) column shows how much time has been spent in the function up to and including the procedure in the list. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal in the file generic.c in the generic executable. A Hardware Counter Experiment Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000 class of machines.) 18 Analyzing Performance Data This section takes you through the steps to perform a hardware counter experiment. There are a number of hardware counter experiments, but this tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see the “Hardware Counter Experiments” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -dsc_hwc generic This starts the experiment. Output from generic and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, generic, the experiment type, dsc_hwc, and the process ID is also generated. In this example, the filename is generic.dsc_hwc.5999. 2. To generate a report on the data collected and redirect the output to a file, type the following at the command line: prof dsc_hwc.results The report should look similar to the following partial listing. ------------------------------------------------------------------------------Profile listing generated Sun May 19 17:35:21 1996 with: prof generic.dsc_hwc.5999 ------------------------------------------------------------------------------Counter : Sec cache D misses Counter overflow value: 131 Total number of ovfls : 10 CPU : R10000 FPU : R10010 Clock : 196.0MHz Number of CPUs : 1 -------------------------------------------------------------------------------p[rocedures] using counter overflow. Sorted in descending order by the number of overflows in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------overflows(%) cum overflows(%) procedure (dso:file) 4( 40.0) 4( 40.0) memcpy (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/strings/bcopy.s) 2( 20.0) 6( 60.0) anneal (/usr/people/larry/generic64/generic:/usr/people/larry/generic64/generic.c) 1( 10.0) 7( 70.0) _findiop (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/stdio/flush.c) 19 Chapter 2: Tutorial for C Users 1( 10.0) 8( 80.0) init2da (/usr/people/larry/generic64/generic:/usr/people/larry/generic64/generic.c) 1( 10.0) 9( 90.0) libss_timer_unset (/usr/lib64/libssrt.so:../../ssrt/lib/sstimer.c) 1( 10.0) 10(100.0) _doprnt (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/print/doprnt.c) 10(100.0) TOTAL • The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program. • The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the anneal function shows two overflows, but the cumulative number of overflows is 6: 2 from anneal and 4 from memcpy. • The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure. An ideal Experiment This section takes you through the steps to perform an ideal experiment. For more information on ideal, see the “ideal Experiment” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -ideal generic This starts the experiment. First the executable and libraries are instrumented using pixie. This entails making copies of the libraries and executables, which are given a .pixie extension, and inserting information into the copies. Output from generic and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, generic, the experiment type, ideal, and the process ID is also generated. In this example, the filename is generic.ideal.14517 Beginning libraries /usr/lib32/libssrt.so /usr/lib32/libss.so /usr/lib32/libm.so /usr/lib32/libc.so.1 Ending libraries, beginning “generic” ... 20 Analyzing Performance Data Beginning libraries Ending libraries, beginning “dlslave.so” ... The output section that starts with “Beginning libraries” and ends with “Ending libraries, beginning “generic”,” tells you that ssrun is instrumenting first libraries, then the generic executable, and any DSOs that generic calls. 2. To generate a report on the data collected, type the following at the command line: prof > ideal.results This command redirects output to a file called ideal.results. The file contains results that look similar to the following partial listing: Prof run at: Sun May 19 17:49:10 1996 Command line: prof generic.ideal.14517 2662778531: Total number of cycles 17.75186s: Total execution time 1875323907: Total number of instructions executed 1.420: Ratio of cycles / instruction 150: Clock rate in MHz R4000: Target processor modelled --------------------------------------------------------Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------cycles(%) cum % secs instrns calls procedure(dso:file) 2524610038(94.81) 94.81 16.83 1797940023 1 anneal(generic:/usr/demos/SShop/genericn32/generic.c) 135001332( 5.07) 99.88 0.90 75000822 1 slaveusrtime(./dlslave.so:/usr/demos/SShop/genericn32/dlslave.c) 1593422( 0.06) 99.94 0.01 1378737 4382 memcpy(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 735797( 0.03) 99.97 0.00 506627 4123 fread(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fread.c) 187200( 0.01) 99.98 0.00 124800 1600 next(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) 136116( 0.01) 99.98 0.00 82498 1 iofile(generic:/usr/demos/SShop/genericn32/generic.c) 91200( 0.00) 99.98 0.00 62400 1600 _drand48(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) 21 Chapter 2: Tutorial for C Users 85497( 0.00) 99.99 0.00 56518 1 init2da(generic:/usr/demos/SShop/genericn32/generic.c) 74095( 0.00) 99.99 0.00 28063 628 __sinf(./libm.so:/work/cmplrs/libm/fsin.c) 56192( 0.00) 99.99 0.00 9360 16 offtime(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/gen/time_comm.c) 51431( 0.00) 99.99 0.00 36405 35 _doprnt(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/print/doprnt.c) 27951( 0.00) 100.00 0.00 19670 259 __filbuf(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/_filbuf.c) 21392( 0.00) 100.00 0.00 10136 58 fwrite(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fwrite.c) 12744( 0.00) 100.00 0.00 9497 231 _readdir(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/gen/readdir.c) 9960( 0.00) 100.00 0.00 7536 96 _xflsbuf(./libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/flush.c) 7211( 0.00) 100.00 0.00 3959 1 dirstat(generic:/usr/demos/SShop/genericn32/generic.c) ... 22 • The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 2524610038 cycles, or 94.81% of cycles were spent in the anneal procedure. • The cum% column shows the cumulative percentage of calls. For example, 99.88% of all calls were spent between the top two functions in the listing: anneal and slaveusrtime. • The secs column shows the number of seconds spent in the procedure. For example, 16.83 seconds were spent in the anneal procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention.) • The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal procedure. • The calls column reports the number of calls to the procedure. For example, there was just one call to the anneal procedure. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal in the file generic.c in the generic executable. Analyzing Performance Data An fpe Trace This section takes you through the steps to perform an fpe trace. For more information on the fpe trace, see the “fpe Trace” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -fpe generic This starts the experiment. Output from generic and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, generic, the experiment type, fpe, and the process ID is also generated. In this example, the filename is generic.fpe.2334. 2. To generate a report on the data collected, type the following at the command line: prof The report should look similar to the following partial listing. ------------------------------------------------------------------------------Profile listing generated Tue May 7 19:21:30 1996 with:prof generic.fpe.2334 ------------------------------------------------------------------------------Total FPEs : 4 Stack backtrace failed: 0 CPU : R4000 FPU : R4010 Clock : 150.0MHz Number of CPUs : 1 ------------------------------------------------------------------------------caller/total parents index %FPEs self descendents total (self) name callee/descend children ------------------------------------------------------------------------------[1] 100.0% 0 4 4 (0) __start [1] 0 4 4/4 0x10001aa0 main [2] ------------------------------------------------------------------------------0 4 4/4 0x10001aa0 __start [1] [2] 100.0% 0 4 4 (0) main [2] 0 4 4/4 0x10001b2c Scriptstring [3] ------------------------------------------------------------------------------0 4 4/4 0x10001b2c main [2] [3] 100.0% 0 4 4 (0) Scriptstring [3] 4 0 4/4 0x10001ec0 fpetraps [4] 23 Chapter 2: Tutorial for C Users ------------------------------------------------------------------------------4 0 4/4 0x10001ec0 Scriptstring [3] [4] 100.0% 4 0 4 (4) fpetraps [4] The report shows information for each function, its callers and its callees. The function names are show in the right-hand column of the report. The function that is being reported is shown outdented from its caller and callee(s). For example, the first function shown in this report is __start which has no callers and one callee. The remaining columns are described below. • The index column provides an index number for reference. • The %FPEs column shows the percentage of the total number of floating point exceptions that were found in the function. • The self column shows how many floating point exceptions were found in the function. For example, 0 floating point exceptions were found in __start. • The descendents columns shows how many floating point exceptions were found in the descendents of the function. For example, 4 floating point exceptions were found in the descendents of main. • The caller/total, total (self), callee/descend column provides information on the number of floating point exceptions out of the total that were found. • The parents, name, children column shows the function names, as described above. This concludes the tutorial. 24 Chapter 3 3. Tutorial for Fortran Users This chapter presents a tutorial for using the SpeedShop tools to gather and analyze performance data in a Fortran program, and covers these topics: • “Tutorial Overview” • “Tutorial Setup” • “Analyzing Performance Data” Note: Because of inherent differences between systems and also due to concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic form of the results should be the same. Tutorial Overview This tutorial is based on a standard benchmark program called linpackup. There are three versions of the program: the linpack directory contains files for the 32-bit ABI, the linpack64 directory contains files for the 64-bit ABI and the linpackn32 directory contains files for the N32-bit ABI. Each linpack directory contains versions of the program for a single processor (linpackup) and for multiple processors (linpackd). When you do the tutorial, choose the version of the program that is most appropriate for your system. A good guideline is to choose whichever version corresponds to the way you expect to develop your programs. This tutorial was written and tested using the single-processor version of the program (linpackup) in the linpackn32 directory. The linpack program is a standard benchmark designed to measure CPU performance in solving dense linear equations. The program focuses primarily on floating point performance. 25 Chapter 3: Tutorial for Fortran Users Output from the linpackup program looks like the following: ... norm. resid resid machep x(1) x(n) 5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00 times are reported for matrices of order 300 dgefa dgesl total mflops unit times for array with leading dimension of 301 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 2.125E+01 2.125E+01 2.125E+01 times for array with leading dimension of 300 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 1.180E+00 2.000E-02 1.200E+00 1.515E+01 1.320E-01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 1.181E+00 1.200E-02 1.193E+00 1.524E+01 1.312E-01 2.125E+01 2.143E+01 2.125E+01 2.130E+01 ratio The tutorial shows you how to perform the following experiments using ssrun, and how to interpret experiment-specific reports generated by prof: • usertime • pcsamp • dsc_hwc • ideal Tutorial Setup You need to copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial. 1. Change to the /usr/demos/SpeedShop directory. 2. Copy the appropriate linpack directory and its contents to a directory where you have write permission: cp -r linpack_version your_dir 3. Change to the directory you just created: cd your_dir/linpack_version 26 Analyzing Performance Data 4. Compile the program by typing make all This provides an executable for the experiment. Analyzing Performance Data This section provides steps on how to run the following experiments on the linpackup program and generate and interpret the results: • “A usertime Experiment” • “A Hardware Counter Experiment” • “A pcsamp Experiment” • “An ideal Experiment” You can follow the tutorial from start to finish, or you can follow steps for just the experiment(s) you want. A usertime Experiment This section takes you through the steps to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of user time spent in each function in your program. For more information on usertime, see the “usertime Experiment” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -v -usertime linpackup This starts the experiment. The -v flag tells ssrun to print a log to stderr. Output from linpackup and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, linpackup, the experiment type, usertime, and the process ID is also generated. In this example, the filename is linpackup.usertime.17566. ssrun: setenv _SPEEDSHOP_MARCHING_ORDERS ut:cu ssrun: setenv _SPEEDSHOP_EXPERIMENT_TYPE usertime ssrun: setenv _SPEEDSHOP_TARGET_FILE linpackup ssrun: setenv _RLD_LIST libss.so:libssrt.so:DEFAULT Please send the results of this run to: 27 Chapter 3: Tutorial for Fortran Users Jack J. Dongarra Mathematics and Computer Science Division Argonne National Laboratory Argonne, Illinois 60439 Telephone: 312-972-7246 ARPAnet: DONGARRA@ANL-MCS norm. resid resid machep x(1) x(n) 5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00 times are reported for matrices of order 300 dgefa dgesl total mflops unit times for array with leading dimension of 301 3.050E+00 3.000E-02 3.080E+00 5.903E+00 3.388E-01 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 5.500E+01 5.464E+01 5.464E+01 5.464E+01 times for array with leading dimension of 300 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 3.040E+00 3.000E-02 3.070E+00 5.922E+00 3.377E-01 3.040E+00 3.000E-02 3.070E+00 5.922E+00 3.377E-01 3.034E+00 3.000E-02 3.064E+00 5.933E+00 3.371E-01 5.464E+01 5.482E+01 5.482E+01 5.471E+01 ratio 2. To generate a report on the data collected, type the following at the command line: prof prof interprets the type of experiment you have performed and prints results to stdout. The report below shows partial prof output. ------------------------------------------------------------------------------Profile listing generated Sun May 19 18:31:51 1996 with: prof linpackup.usertime.17566 ------------------------------------------------------------------------------Total Time (secs) : 55.59 Total Samples : 1853 Stack backtrace failed: 0 Sample interval (ms) : 30 CPU : R8000 FPU : R8010 Clock : 75.0MHz Number of CPUs : 1 28 Analyzing Performance Data ------------------------------------------------------------------------------caller/total parents index %time self descendents total (self) name callee/descend children ------------------------------------------------------------------------------[1] 99.9% 0.00 55.56 1852 (0) __start [1] 0.00 55.56 1852/1852 0x10000ea8 main [2] ------------------------------------------------------------------------------0.00 55.56 1852/1852 0x10000ea8 __start [1] [2] 99.9% 0.00 55.56 1852 (0) main [2] 0.00 55.56 1852/1852 0xae9c218 linp [3] ------------------------------------------------------------------------------0.00 55.56 1852/1852 0xae9c218 main [2] [3] 99.9% 0.00 55.56 1852 (0) linp [3] 0.44 30.40 1028/1852 0x10002488 dgefa [4] 0.04 3.08 104/1852 0x10001020 dgefa [4] 0.04 3.05 103/1852 0x1000185c dgefa [4] 0.04 3.05 103/1852 0x10001688 dgefa [4] 0.04 3.05 103/1852 0x10001a94 dgefa [4] 0.04 3.05 103/1852 0x1000207c dgefa [4] 0.04 3.05 103/1852 0x10001ea8 dgefa [4] 0.04 3.02 102/1852 0x10002250 dgefa [4] 1.41 0.00 47/1852 0x10002434 matgen [6] 0.00 0.30 10/1852 0x1000255c dgesl [7] 0.15 0.00 5/1852 0x10002210 matgen [6] 0.15 0.00 5/1852 0x1000181c matgen [6] 0.15 0.00 5/1852 0x10001e68 matgen [6] 0.12 0.00 4/1852 0x10001194 matgen [6] ... The report shows information for each function, its callers and its callees. The function names are show in the right-hand column of the report. The function that is being reported is shown outdented from its caller and callee(s). For example, the first function shown in this report is __start which has no callers and one callee. The remaining columns are described below. • The index column provides an index number for reference. • The %time column shows the cumulative percentage of inclusive time spent in each function and its descendents. For example, in the third row, 99.9% of the time was spent in linp and all functions listed below it. • The self column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in linp, but 0.44 of a second was spent in the first call to dgefa. 29 Chapter 3: Tutorial for Fortran Users • The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, in the third row, 55.56 seconds were spent in callees of linp. • The caller/total, total (self), callee/descend column provides information on the number of cycles out of the total spent on the function, its callers and its callees. For example, the linp function (index number 3) shows 1852/1852 for its caller (main), 1852(0) for itself, and 1028/1852 for its first callee (dgefa.) A pcsamp Experiment This section takes you through the steps to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine line and function in your program. For more information on pcsamp, see the “pcsamp Experiment” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -pcsamp linpackup This starts the experiment. Output from linpackup and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, linpackup, the experiment type, pcsamp, and the process ID is also generated. In this example, the filename is linpackup.pcsamp.17576. ... norm. resid resid machep x(1) x(n) 5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00 ... 2. To generate a report on the data collected, type the following at the command line: prof > pcsamp.results ------------------------------------------------------------------------------Profile listing generated Sun May 19 18:38:50 1996 with: prof linpackup.pcsamp.17576 ------------------------------------------------------------------------------samples time CPU FPU Clock N-cpu S-interval Countsize 5421 54s R8000 R8010 75.0MHz 1 10.0ms 2(bytes) Each sample covers 4 bytes for every 10.0ms ( 0.02% of 54.2100s) -------------------------------------------------------------------------------p[rocedures] using pc-sampling. Sorted in descending order by the number of samples in each procedure. Unexecuted procedures are excluded. 30 Analyzing Performance Data ------------------------------------------------------------------------------samples time(%) cum time(%) procedure (dso:file) 5064 51s( 93.4) 51s( 93.4) daxpy (linpackup:linpackup.f) 240 2.4s( 4.4) 53s( 97.8) matgen (linpackup:linpackup.f) 76 0.76s( 1.4) 54s( 99.2) dgefa (linpackup:linpackup.f) 19 0.19s( 0.4) 54s( 99.6) dscal (linpackup:linpackup.f) 17 0.17s( 0.3) 54s( 99.9) idamax (linpackup:linpackup.f) 4 0.04s( 0.1) 54s(100.0) dmxpy (linpackup:linpackup.f) 1 0.01s( 0.0) 54s(100.0) _ioctl (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M4/sys/ioctl.s) 5421 54s(100.0) 54s(100.0) TOTAL • The samples columns shows how many samples were taken when the process was executing in the function. • The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function. • The cum time(%) column shows how much time has been spent in the function up to and including the procedure in the list. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure daxpy in the file linpackup.f in the linpackup executable. A Hardware Counter Experiment Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000 class of machines). This section takes you through the steps to perform a hardware counter experiment. There are a number of hardware counter experiments, but this tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see the“Hardware Counter Experiments” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -dsc_hwc linpackup This starts the experiment. 31 Chapter 3: Tutorial for Fortran Users This starts the experiment. Output from linpackup and from ssrun will be printed to stdout. A data file with a name generated by concatenating the process name, linpackup, the experiment type, dsc_hwc, and the process ID is also generated. In this example, the filename is linpackup.dsc_hwc.6180. 2. To generate a report on the data collected, type the following at the command line: prof > pcsamp.results Output similar to the following is generated. ------------------------------------------------------------------------------Profile listing generated Sun May 19 18:20:14 1996 with: prof linpackup.dsc_hwc.6180 ------------------------------------------------------------------------------Counter : Sec cache D misses Counter overflow value: 131 Total numer of ovfls : 2737 CPU : R10000 FPU : R10010 Clock : 196.0MHz Number of CPUs : 1 -------------------------------------------------------------------------------p[rocedures] using counter overflow. Sorted in descending order by the number of overflows in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------overflows(%) cum overflows(%) procedure (dso:file) 2133( 77.9) 2133( 77.9) DAXPY (linpackup:/usr/people/larry/linpack64/linpackup.f) 307( 11.2) 2440( 89.1) MATGEN (linpackup:/usr/people/larry/linpack64/linpackup.f) 275( 10.0) 2715( 99.2) DGEFA (linpackup:/usr/people/larry/linpack64/linpackup.f) 11( 0.4) 2726( 99.6) IDAMAX (linpackup:/usr/people/larry/linpack64/linpackup.f) 3( 0.1) 2729( 99.7) DMXPY (linpackup:/usr/people/larry/linpack64/linpackup.f) 3( 0.1) 2732( 99.8) DGESL (linpackup:/usr/people/larry/linpack64/linpackup.f) 1( 0.0) 2733( 99.9) memset (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/strings/bzero.s) 1( 0.0) 2734( 99.9) fflush (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/stdio/flush.c) 32 Analyzing Performance Data 1( 0.0) 2735( 99.9) _mixed_dtoa (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/math/mixed_dtoa.c) 1( 0.0) 2736(100.0) wsfe (/usr/lib64/libftn.so:/work/cmplrs/libI77/wsfe.c) 1( 0.0) 2737(100.0) f_exit (/usr/lib64/libftn.so:/work/cmplrs/libI77/close.c) 2737(100.0) TOTAL • The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program. • The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the MATGEN function shows 307 overflows, but the cumulative number of overflows is 2440. • The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure. An ideal Experiment This section takes you through the steps to perform an ideal experiment. For more information on collecting ideal-time data, and basic block counting, see the “ideal Experiment” section in Chapter 4, “Experiment Types.” 1. From the command line, type ssrun -ideal linpackup This starts the experiment. First the executable and libraries are instrumented using pixie. This entails making copies of the libraries and executables, which are given a .pixie extension, and inserting information into the copies. Output from linpackup and from ssrun will be printed to stdout as shown in the example below. A data file with a name generated by concatenating the process name, linpackup, the experiment type, ideal, and the process ID is also generated. In this example, the filename is linpackup.ideal.17580. Beginning libraries /usr/lib32/libssrt.so /usr/lib32/libss.so /usr/lib32/libftn.so /usr/lib32/libm.so /usr/lib32/libc.so.1 33 Chapter 3: Tutorial for Fortran Users Ending libraries, beginning “linpackup” ... 2. To generate a report on the data collected, type the following at the command line: prof > ideal.results This command redirects output to a file called ideal.results. The file should contain results that look something like the following: Prof run at: Sun May 19 18:46:10 1996 Command line: prof linpackup.ideal.17580 5722510379: Total number of cycles 76.30014s: Total execution time 4906763725: Total number of instructions executed 1.166: Ratio of cycles / instruction 75: Clock rate in MHz R8000: Target processor modelled --------------------------------------------------------Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------cycles(%) cum % secs instrns calls procedure(dso:file) 5404032607(94.43) 94.43 72.05 4639092022 772633 daxpy(linpackup:linpackup.f) 207582228( 3.63) 98.06 2.77 157405518 18 matgen(linpackup:linpackup.f) 67844858( 1.19) 99.25 0.90 72325769 17 dgefa(linpackup:linpackup.f) 19920277( 0.35) 99.60 0.27 17658342 5083 dscal(linpackup:linpackup.f) 18115251( 0.32) 99.91 0.24 15675343 5083 idamax(linpackup:linpackup.f) 4053920( 0.07) 99.98 0.05 3605124 1 dmxpy(linpackup:linpackup.f) 786709( 0.01) 100.00 0.01 695776 17 dgesl(linpackup:linpackup.f) 41357( 0.00) 100.00 0.00 83826 1116 __flsbuf(./libc.so.1:/work/irix/lib/libc/libc_n32_M4/stdio/_flsbuf.c) 30294( 0.00) 100.00 0.00 29094 1 linp(linpackup:linpackup.f) 17330( 0.00) 100.00 0.00 39823 867 x_putc(./libftn.so:/lv7/mtibuild/nodebug/workarea/mongoose/libI77/wsfe.c) 12294( 0.00) 100.00 0.00 25795 28 x_wEND(./libftn.so:/lv7/mtibuild/nodebug/workarea/mongoose/libI77/wsfe.c) 10620( 0.00) 100.00 0.00 14340 53 wrt_E(./libftn.so:/lv7/mtibuild/nodebug/workarea/mongoose/libI77/wrtfmt.c) 9617( 0.00) 100.00 0.00 14889 71 do_fio64_mp(./libftn.so:/lv7/mtibuild/nodebug/workarea/mongoose/libI77/fmt.c) 4940( 0.00) 100.00 0.00 7917 380 34 Analyzing Performance Data • The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 5404032607 cycles, or 94.43% of cycles were spent in the daxpy procedure. • The cum% column shows the cumulative percentage of calls. For example, 98.06% of all calls were spent between the top two functions in the listing: daxpy and matgen. • The secs column shows the number of seconds spent in the procedure. For example, 72.05 seconds were spent in the daxpy procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention.) • The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal procedure. • The calls column reports the number of calls to the procedure. For example, there were 772633 calls to the daxpy procedure. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure daxpy in the file linpackup.f in the linpackup executable. This concludes the tutorial. 35 Chapter 4 4. Experiment Types This chapter provides detailed information on each of the experiment types available within SpeedShop. It contains the following sections: • “Selecting an Experiment” • “usertime Experiment” • “pcsamp Experiment” • “Hardware Counter Experiments” • “ideal Experiment” • “fpe Trace” For information on how to run the experiments described in this chapter, see Chapter 6, “Setting Up and Running Experiments: ssrun.” Selecting an Experiment Table 4-1 shows the possible experiments you can perform using the SpeedShop tools and the reasons why you might want to choose a specific experiment. The Clues column shows when you might use an experiment. The Data Collected column indicates 37 Chapter 4: Experiment Types performance data collected by the experiment For detailed information on the experiments listed, see the sections below. Table 4-1 Summary of Experiments Experiment Type Clues Data Collected usertime Slow program, nothing else known. Inclusive and exclusive user time for each function by sampling the callstack at 30 millisecond intervals. Not CPU-bound. ideal CPU-bound Ideal CPU time at the function, source line and machine instruction levels using instrumentation for basic block counting. FPE Trace High system time. All floating point exceptions with the exception Presence of floating point type and the callstack at the time of the exception. operations. pcsamp High user CPU time. Actual CPU time at the source line, machine instruction and function levels by sampling the program counter at 10 or 1 millisecond intervals. hwc High user CPU time. On R10000 class machines, exclusive counts at the source line, machine instruction and function levels for overflows of the following counters: Clock Cycle, Graduated Instructions, Primary Instruction-cache Misses, Secondary Instruction-cache Misses, Primary Data-cache Misses, Secondary Data-cache Misses, TLB Misses, Graduated Floating-point Instructions. usertime Experiment Note: For this experiment, o32 executables must explicitly link with -lexc. The usertime experiment uses statistical call stack profiling, based on wall clock time, to measure inclusive and exclusive user time spent in each function while your program runs. This experiment uses an interval of 30 seconds. Data is measured by periodically sampling the callstack. The program’s callstack data is used to attribute exclusive user time to the function at the bottom of each callstack (i.e., 38 pcsamp Experiment the function being executed at the time of the sample), and to attribute inclusive user time to all the functions above the one currently being executed. The time spent in a procedure is determined by multiplying the number of times an instruction for that procedure appears in the stack by the average time interval between call stacks. Call stacks are gathered whether the program was running or blocked; hence, the time computed represents the total time, both within and outside the CPU. If the target process was blocked for a long time as a result of an instruction, that instruction will show up as having a high time. User time runs should incur a slowdown of execution of the program of no more than 15%. Data from a usertime experiment is statistical in nature and will show some variance from run to run. pcsamp Experiment The pcsamp experiment uses statistical PC sampling to estimate actual CPU time for each source code line, machine line and function in your program. The prof listing of this experiment shows exclusive PC-sampling time. This experiment is a lightweight, high-speed operation done with kernel support. The actual CPU time is calculated by multiplying the number of times an instruction appears in the PC by the interval specified for the experiment (either 1 or 10 milliseconds.) To collect the data, the kernel regularly stops the process if it is in the CPU, increments a counter for the current value of the PC, and resumes the process. The default sample interval is 10 milliseconds. If you specify the optional f prefix to the experiment, a sample interval of 1 millisecond is used. The experiment uses 16-bit bins, based on user and system time. If the optional x suffix is used, a 32-bit bin size will be used. 16-bit bins allow a maximum of 65,000 counts, whereas a 32-bit bin allows 65,000 x 65,000 the number of counts. Using a 32-bit bin provides more accurate information, but requires additional disk space. PC-sampling time runs should incur a slowdown of execution of the program of no more than 5%. The measurements are statistical in nature, and will exhibit variance inversely proportional to the running time. 39 Chapter 4: Experiment Types This experiment can be used together with the ideal experiment to compare actual and ideal times spent in the CPU. A major discrepancy between PC Sample CPU time and ideal CPU Time indicates: • Cache misses and floating point interlocks in a single process application • Secondary cache invalidations in an application with multiple processes that is run on a multiprocessor A comparison between basic block counts (ideal experiment) and PC profile counts (pcsamp experiment) is shown in Table 4-2. Table 4-2 Basic Block Counts and PC Profile Counts Compared Basic Block Counts PC Profile Counts Used to compute ideal CPU time Used to estimate actual CPU time Data collection by instrumenting Data collection done with the kernel Slows program down by factor of three Has minimal impact on program speed Generates an exact count Statistical counts Hardware Counter Experiments The experiments described in this section are available for systems that have hardware counters (R10000 class machines). Hardware counters allow you to count various types of events such as cache misses, and counts of issued and graduated instructions. A hardware counter works as follows: on each processor clock cycle, when events for which there are hardware counters occur, each event is counted by incrementing the appropriate hardware counter. For example, when a floating point instruction is graduated in a cycle, the Graduated Floating-point Instruction counter is incremented by one. There are two tools that allow you to access hardware counter data: perfex, a command-line interface that provides program-level event information, and SpeedShop, which allows you to perform the hardware counter experiments described below. For more information on perfex, and on hardware counters, click perfex or r10k_counters to view the reference pages. 40 Hardware Counter Experiments In SpeedShop hardware counter experiments, overflows of a particular hardware counter are recorded. Each hardware counter is configured to count from zero to a number designated as the overflow value. When the counter reaches the overflow value, it is reset to zero, and a count of how many overflows have occurred at the present program instruction address is incremented. Each experiment provides two possible overflow values; the values are prime numbers, so any profiles that seem the same for both overflow values should be statistically valid. The hardware counter experiments show where the overflows are being triggered in the program, at the function, source-line, and individual instruction level. When you run prof on the data collected during the experiment, the overflow counts are multiplied by the overflow value, to compute the total number of events. These numbers are statistical. The generated reports show exclusive hardware counts, that is, information about where the program counter was, not the callstack to get there. Hardware counter overflow profiling experiments should incur a slowdown of execution of the program of no more than 5%. Count data is kept as 32-bit integers only. [f]gi_hwc The [f]gi_hwc experiment counts overflows of the Graduated Instruction counter. The Graduated Instruction counter is incremented by the number of instructions that were graduated on the previous cycle. The experiment uses statistical PC sampling based on overflows of the counter at an overflow interval of 32771. If the optional f prefix is used, the overflow interval is 6553. [f]cy_hwc The [f]cy_hwc experiment counts overflows of the Cycle counter. The Cycle counter is incremented on each clock cycle. The experiment uses statistical PC sampling based on overflows of the counter, at an overflow interval of 16411. If the optional f prefix is used, the overflow interval is 3779. [f]ic_hwc The [f]ic_hwc experiment counts overflows of the Primary Instruction-cache Miss counter. The Primary Instruction-cache Miss counter is incremented one cycle after an instruction fetch request is entered into the Miss Handling Table. The experiment uses 41 Chapter 4: Experiment Types statistical PC sampling based on the overflow of the counter at an overflow interval of 2053. If the optional f prefix is used, the overflow interval is 419. [f]isc_hwc The [f]isc_hwc experiment counts overflows of the Secondary Instruction-cache Miss counter. The Secondary Instruction-cache Miss counter is incremented after the last 16-byte block of a 64-byte primary instruction cache line is written into the instruction cache. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 131. If the optional f prefix is used, the overflow interval is 29. [f]dc_hwc The [f]dc_hwc experiment counts overflows of the Primary Data-cache Miss counter. The Primary Data-cache Miss counter is incremented on the cycle after a primary cache data refill is begun. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 2053. If the optional f prefix is used, the overflow interval is 419. [f]dsc_hwc The [f]dsc_hwc experiment counts overflows of the Secondary Data-cache Miss counter. The Secondary Data-cache Miss counter is incremented on the cycle after the second 16-byte block of a primary data cache line is written into the data cache. The experiment uses statistical PC sampling, based on the overflow of the counter at an overflow interval of 131. If the optional f prefix is used, the overflow interval is 29. [f]tlb_hwc The [f]tlb_hwc experiment counts overflows of the TLB (translation lookaside buffer) counter. The TLB counter is incremented on the cycle after the TLB miss handler is invoked. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 257. If the optional f prefix is used, the overflow interval is 53. 42 Hardware Counter Experiments [f]gfp_hwc The [f]gfp_hwc experiment counts overflows of the Graduated Floating-point Instruction counter. The Graduated Floating-point Instruction counter is incremented by the number of floating point instructions which graduated on the previous cycle. The experiment uses statistical PC sampling based on overflows of the counter, at an overflow interval of 32771. If the optional f prefix is used, the overflow interval is 6553. prof_hwc The prof_hwc experiment allows you to set a hardware counter to use in the experiment, and to set a counter overflow interval using the following environment variables: _SPEEDSHOP_HWC_COUNTER_NUMBER The value of this variable may be any number between 0 and 31. Hardware counters are described in the MIPS R10000 Microprocessor User’s Manual, Chapter 14, and on the r10k_counters reference page. The hardware counter numbers are provided in Table 4-3. _SPEEDSHOP_HWC_COUNTER_OVERFLOW The value of this variable may be any number greater than 0. Some numbers may produce data that is not statistically random, but rather reflects a correlation between the overflow interval and a cyclic behavior in the application. You may want to do two or more runs with different overflow values. The experiment uses statistical PC sampling based on the overflow of the specified counter, at the specified interval. Note that these environment variables cannot be used for other hardware counter experiments. They are examined only when the prof_hwc experiment is specified. Hardware Counter Numbers The possible numeric values for the _SPEEDSHOP_HWC_COUNTER_NUMBER variable are: Table 4-3 Hardware Counter Numbers 0 Cycles 1 Issued instructions 43 Chapter 4: Experiment Types Table 4-3 (continued) 44 Hardware Counter Numbers 2 Issued loads 3 Issued stores 4 Issued store conditionals 5 Failed store conditionals 6 Decoded branches 7 Quadwords written back from secondary cache 8 Correctable secondary cache data array ECC errors 9 Primary instruction-cache misses 10 Secondary instruction-cache misses 11 Instruction misprediction from secondary cache way prediction table 12 External interventions 13 External invalidations 14 Virtual coherency conditions (or Functional unit completions, depending on hardware version) 15 Graduated instructions 16 Cycles 17 Graduated instructions 18 Graduated loads 19 Graduated stores 20 Graduated store conditionals 21 Graduated floating point instructions 22 Quadwords written back from primary data cache 23 TLB misses 24 Mispredicted branches 25 Primary data-cache misses ideal Experiment Table 4-3 (continued) Hardware Counter Numbers 26 Secondary data-cache misses 27 Data misprediction from secondary cache way prediction table 28 External intervention hits in secondary cache 29 External invalidation hits in secondary cache 30 Store/prefetch exclusive to clean block in secondary cache 31 Store/prefetch exclusive to shared block in secondary cache ideal Experiment The ideal experiment instruments the executables and any DSOs to permit basic block counting and counting of all dynamic (function-pointer) calls. This involves dividing the code into basic blocks, which are sets of instructions with a single entry point, a single exit point, and no branches into or out of the set. Counter code is inserted at the beginning of each basic block to increment a counter every time that basic block is executed. The target executable and all the DSOs it uses are instrumented, including libc.so.1, libexc.so, libm.so, libss.so, libssrt.so. Instrumented files with a .pixie extension are written to the current working directory. After the transformations are complete, the program’s symbol table and translation table are updated so that debuggers can map between transformed addresses and the original program’s addresses, and reference the measured performance data to the untransformed code. After instrumentation, ssrun executes the instrumented program. Data is generated as long as the process exits normally or receives a fatal signal that the program does not handle. prof uses a machine model to convert the block execution counts into an idealized exclusive time at the function, source line or machine instruction levels. By default, the machine model corresponds to the machine on which the target was run; the user can specify a different machine model for the analysis. The assumption made in calculating ideal CPU time is that each instruction takes exactly one cycle, and ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention). Each system call is also assumed to take one cycle. 45 Chapter 4: Experiment Types The computed ideal time is therefore always less than the real time that any run would take. See Table 4-2 for a comparison of running a pcsamp experiment, which generates estimated actual CPU time, and running an ideal experiment. Note that the execution time of an instrumented program is three-to-six times longer than an uninstrumented one. This timing change may alter the behavior of a program that deals with a graphical user interface (GUI), or depends on events such as SIGALRM that are based on an external clock. Also, during analysis, the instrumented executable might appear to be CPU-bound, whereas the original executable was I/O-bound. Basic block counts are translated to ideal CPU time displayed at the function, source line and machine line levels. Inclusive Basic Block Counting The basic block counting explained in the previous section allows you to measure ideal time spent in each procedure, but doesn’t propagate the time up to the caller of that procedure. For example, basic block counting may tell you that procedure sin(x) took the most time, but significant performance improvement can only be obtained by optimizing the callers of sin(x). Inclusive basic block counting solves this problem. Inclusive basic block counting calculates cycles just like regular basic block counting, and then propagates it proportionately to all its callers. The cycles of procedures obtained using regular basic block counting (called exclusive cycles), are divided up among its callers in proportion to the number of times they called this procedure. For example, if sin(x) takes 1000 cycles, and its callers, procedures foo() and bar(), call sin(x) 25 and 75 times respectively, 250 cycles are attributed to foo() and 750 to bar(). By propagating cycles this way, __start ends up with all the cycles counted in the program. Note: The assumption made in propagating times from a callee to a caller is that all calls are equivalent, so that the time attributed is divided equally for all calls. For some functions (sin(), for example), this assumption is plausible. For others (matrix multiply, for example), the assumption can be very misleading. If foo() calls matmult() 99 times for 2X2 matrices, while bar() calls it once for 100X100 matrices, the inclusive time report will attribute 99% of matmult()’s time to foo(), but actually almost all the time derives from bar()’s one call. To generate a report that shows inclusive time, specify the -gprof flag to prof. 46 fpe Trace fpe Trace A Floating Point Exception Trace collects each floating point exception with the exception type and the callstack at the time of the exception. Floating-point exception tracing experiments should incur a slowdown of execution of the program of no more than 15%. These measurements are exact, not statistical. prof generates a report that shows inclusive and exclusive floating-point exception counts. 47 Chapter 5 5. Collecting Data on Machine Resource Usage This chapter describes how to collect machine resource usage data using SpeedShop’s ssusage command. Finding out the machine resources that your program uses can help you identify performance bottlenecks and the performance experiments you need to run. You can use Table 1-1 to identify which experiments to run, based on the results of running ssusage on your program. ssusage Syntax ssusage prog_name [prog_args] prog_name The name of the executable for which you want to collect machine resource usage data. prog_args Arguments to your executable, if any. ssusage Results ssusage prints output to stderr. For example, the command: ssusage generic provides output similar to the following: ... 22.03 real, 18.18 user, 0.21 sys, 7 majf, 120 minf, 0 sw, 241 rb, 0 wb, 135 vcx, 648 icx The last two lines of the output is the machine resource usage information that ssusage provides. Each field in the report is described below. real The elapsed time during the command, in seconds. user The user CPU time in seconds. sys The system CPU time in seconds. 49 Chapter 5: Collecting Data on Machine Resource Usage majf Major page faults that cause physical I/O. minf Minor page faults that require mapping only. sw Process swaps. rb/wb Physical blocks read/written. Note that these are attributed to the process which first requests a block, but do not necessarily directly correlate with the process’ own I/O operations. vcx Voluntary context switches, that is, those caused by the process’ own actions. icx Involuntary context switches, that is, those caused by the scheduler. If the program terminates abnormally, a message is printed before the usage line. 50 Chapter 6 6. Setting Up and Running Experiments: ssrun This chapter provides information on how to set up and run performance analysis experiments using the ssrun command. It consists of the following sections: • “Building your Executable” • “Setting Up Output Directories and Files” • “Running Experiments” • “Running Experiments on MPI Programs” • “Using Calipers” • “Effects of ssrun” • “Customizing Your Environment” Building your Executable The ssrun command is designed to be used with normally built executables and default environment settings. However, there are some cases where you will need to make changes to the way you build your executable or set certain environment variables. This section lists the conditions under which you may need to change the way you build your executable program. For information on setting environment variables, see section “Customizing Your Environment.” • If you have used the ssrt_caliper_point function provided in the SpeedShop libraries, you need to explicitly link in the SpeedShop libraries libss.so and libssrt.so. For more information on setting caliper points, see section “Using Calipers.” • If you are planning to build your executable using the -32 option to cc, and you want to run the usertime experiment, you must add -lexc to the link line. For more information on cc -32, click cc to view the reference page. • If you have built a stripped executable, you need to rebuild a non-stripped version to use with SpeedShop. For example, if you are using ld to link your C program, do 51 Chapter 6: Setting Up and Running Experiments: ssrun not use the -s option since this strips debugging information from the program object and makes the program unusable for performance analysis. • If you have used compiler optimization level 3, and you are performing experiments that report function-level information, the procedure inlining it performs can result in extremely misleading profiles since the time spent in the inlined procedure will show up in the profile as time spent in the procedure into which it was inlined. It is generally better to use compiler optimization level 2 or less when gathering an execution profile. • If you are compiling MP Fortran programs, you may encounter anomalies in the displayed data: – For all FORTRAN MP compilations, parallel loops within the program are represented as subroutines with names relating to the source routine in which they are embedded. The naming conventions for these subroutines are different for 32-bit and 64-bit compilations. For example, in linpack, most of the time spent is in the routine DAXPY, which can be parallelized. In a 64-bit MP version, the routine has the name “DAXPY”, but most of that work is done in the MP routine named “DAXPY.PREGION1”. In a 32-bit version, the DAXPY routine is named “daxpy_”, and the MP routine “_daxpy_519_aaab_”. – In both 32-bit and 64-bit compilations with the -g option, for an ideal experiment, the source annotations behave differently and incorrectly in most cases. In 64-bit source annotations, the exclusive time is correctly shown for each line, but the inclusive time for the first line of the loop (do statement) includes the time spent in the loop body. This same time appears on the lines comprising the loop’s body, in effect representing a double-counting. In 32-bit source annotations, the exclusive time is incorrectly shown for the line comprising the loop’s body. The line-level data for the loop-body routine (“_daxpy_519_aaab_”) does not refer to proper lines. If the program was compiled with the -mp_keep flag, the line-level data should refer to the temporary files that are saved from the compilation, but the temporary files do not contain that information, so no source or disassembly data can be shown. The disassembly data for the main routine does not show the times for the loop-body. If the 32-bit program was compiled without -mp_keep flag, the line-level data for the loop-body routine is incorrect. Most lines refer to line 0 of the file, and the rest to other lines at seemingly random places in the file. Consequently, 52 Setting Up Output Directories and Files spurious annotations will appear on these other lines. Disassembly correctly shows the instructions and their data, but the line numbers are wrong. This reflects essentially the same double-counting problem as seen in 64-bit compilations, but the extra counts go to other places in the file, rather than to the first line of the loop. Setting Up Output Directories and Files When you run an experiment, performance data files are written to the current working directory by default, and they are named using the following convention: prog_name.exp_type.pid In a single-process application, ssrun generates a single performance data file, and in a multi-process application, it generates a performance data file for each process. You can change the default filename or directory for performance data files using environment variables. • Set _SPEEDSHOP_OUTPUT_DIRECTORY to the directory you want to use if you want to generate performance data files in a directory other than the current working directory. • Set _SPEEDSHOP_OUTPUT_FD to the number of the file descriptor to be used for writing the output file if you want to specify a file to which to write the performance data. • Set _SPEEDSHOP_OUTPUT_FILENAME to the filename you want to use for recording performance data. If _SPEEDSHOP_OUTPUT_DIRECTORY is also specified, it is prepended to the filename you specify. Running Experiments This section describes how to use ssrun to perform experiments. For information on using pixie directly, see Chapter 8, “Using SpeedShop in Expert Mode: pixie.” ssrun Syntax ssrun flags -exp_type prog_name prog_args 53 Chapter 6: Setting Up and Running Experiments: ssrun flags Zero or more of the flags described in Table 6-1 that control the data collection and the treatment of descendent processes or programs, and how the data is to be externalized. -exp_type The experiment type. Experiments are described in detail in Chapter 4, “Experiment Types.” prog_type The name of the program on which you want to run an experiment. args Arguments to your program, if any. ssrun generates a performance data file that is named as described in the section “Building your Executable.” Table 6-1 54 Flags for ssrun Name Result -hang Specifies that the process should be left waiting just before executing its first instruction. This allows you to attach the process to a debugger. -mo marching_orders Allows you to specify marching orders; if this option is used the environment variable _SSRUNTIME_MARCHING_ORDERS is not examined. -name target_name Specifies that the target should be run with argv[0] set to target_name. -purify Can be used only when the Purify product is installed. Specifies that purify should be run on the target, and then runs the resulting “purified” executable. Note that -purify and SpeedShop performance experiments cannot be combined. -v Prints a log of the operation of ssrun to stderr. The same behavior occurs if the environment variable _SPEEDSHOP_VERBOSE is set a to an empty string. -V Prints a detailed log of the operation of ssrun to stderr. The same behavior occurs if the environment variable _SPEEDSHOP_VERBOSE is set a to a non-zero-length string. This option can be used to see how to set the various environment variables, and how to invoke instrumentation when necessary. Running Experiments ssrun Examples This section provides examples of using ssrun with options and experiment types. For additional examples, see Chapter 2, “Tutorial for C Users” or Chapter 3, “Tutorial for Fortran Users.” Example Using the pcsampx Experiment The pcsampx experiment collects data to estimate the actual CPU time for each source code line, machine instruction and function in your program. The optional x suffix causes a 32-bit bin size to be used, allowing a larger number of counts to be recorded. For a more detailed description of the pcsamp experiment, see the “pcsamp Experiment” section in Chapter 4, “Experiment Types.” ssrun -pcsampx generic To see the performance data that has been generated, run prof on the performance data file, generic.pcsampx.16064. prof generic.pcsampx.16064 The report is printed to stdout. (This layout of this report has been altered slightly to accommodate presentation needs.) For more information on prof and the reports generated by prof, see Chapter 7, “Analyzing Experiment Results: prof.” ------------------------------------------------------------------------------Profile listing generated Thu May 23 10:30:40 1996 with: prof generic.pcsampx.16064 ------------------------------------------------------------------------------samples time CPU FPU Clock N-cpu S-interval Countsize 2058 21s R4000 R4010 150.0MHz 1 10.0ms 4(bytes) Each sample covers 4 bytes for every 10.0ms ( 0.05% of 20.5800s) -------------------------------------------------------------------------------p[rocedures] using pc-sampling. Sorted in descending order by the number of samples in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------samples time(%) cum time(%) procedure (dso:file) 1926 19s( 93.6) 19s( 93.6) anneal (generic:/usr/demos/SpeedShop/genericn32/generic.c) 111 1.1s( 5.4) 20s( 99.0) slaveusrtime (/usr/demos/SS/genericn32/dlslave.so:/usr/demos/SS/genericn32/dlslave.c) 55 Chapter 6: Setting Up and Running Experiments: ssrun 15 0.15s( 0.7) 21s( 99.7) _read (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/sys/read.s) 2 0.02s( 0.1) 21s( 99.8) memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 1 0.01s( 0.0) 21s( 99.9) _xstat (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/sys/xstat.s) 1 0.01s( 0.0) 21s( 99.9) _ltzset (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/gen/time_comm.c) 1 0.01s( 0.0) 21s(100.0) __sinf (/usr/lib32/libm.so:/work/cmplrs/libm/fsin.c) 1 0.01s( 0.0) 21s(100.0) _write (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/sys/write.s) 2058 21s(100.0) 21s(100.0) TOTAL Example Using the -v Option To get information about how a SpeedShop experiment is set up and performed, you can supply the -v option to ssrun. This example performs a pcsampx experiment on the generic executable: ssrun -v -pcsampx generic The ssrun command writes the following output to stderr. It displays information as the command line is parsed and shows the environment variables that ssrun sets. fraser 75% ssrun -v -pcsampx generic ssrun: ssrun: ssrun: ssrun: ... 56 setenv setenv setenv setenv _SPEEDSHOP_MARCHING_ORDERS pc,4,10000,0:cu _SPEEDSHOP_EXPERIMENT_TYPE pcsampx _SPEEDSHOP_TARGET_FILE generic _RLD_LIST libss.so:libssrt.so:DEFAULT Running Experiments on MPI Programs Using ssrun with a Debugger To use the ssrun command in conjunction with a debugger such as dbx or the ProDev WorkShop debugger, you need to call ssrun with the -hang option and your program. The following steps assume you want to run the FPE trace experiment on generic, and then run generic in a debugger. 1. Call ssrun as follows: ssrun -hang -fpe generic ssrun parses the command line, sets up the environment for the experiment, calls the target process using exec, and hangs the target process on exiting from the call to exec. 2. Get the process ID of the call to ssrun using a command such as ps. 3. Start your debugging session. 4. Attach the process to the debugger. 5. Run the process from the debugger. You can also invoke ssrun from within a debugger. In this case, ssrun leaves the target hung on exiting the call to exec, and informs the debugger of that fact. You can also use either dbx or the WorkShop debugger to set calipers to record performance data for a part of your program. See the section “Using Calipers” for more information on setting calipers. Running Experiments on MPI Programs The Message Passing Interface (MPI) is a library specification for message-passing, proposed as a standard by a committee of vendors, implementors, and users. It allows processes to communicate by “mailing” data “messages” to other processes, even those running on distant computers. 57 Chapter 6: Setting Up and Running Experiments: ssrun If your program uses the MPI, you need to set up SpeedShop experiments a little differently: 1. Set up a shell script that contains the call to ssrun and the experiment you want to run. For example, if you have a program called testit, and you want to run the pcsampx experiment, a script, named exp_script, might look like the following: #!/bin/sh ssrun -pcsampx testit 2. Call mpirun with the script name. mpirun -np 6 exp_script Using Calipers In some cases, you may want to generate performance data reports for only a part of your program. You can do this by setting caliper points to identify the area or areas for which you want to see performance data. When you run prof, you can specify a region for which to generate a report by supplying the -calipers option and the appropriate caliper numbers. For more information on prof -calipers, see “Using the -calipers Option” in Chapter 7, “Analyzing Experiment Results: prof.” You can set caliper points in three different ways: • You can explicitly link with the SpeedShop runtime and call ssrt_caliper_point to record a caliper sample. This is useful when you want to set a caliper point at a specific location in a file. • You can define a signal to be used to record a caliper sample by specifying a signal as a value to the environment variable _SPEEDSHOP_CALIPER_POINT_SIG and then sending the target the given signal. This is useful if you want to be able to set a caliper point as your program is running. • You can set a caliper sample trap in dbx or the WorkShop debugger. This is done by setting a breakpoint and when the process stops and evaluating the expression libss_caliper_point(1). This is useful if you are working with a debugger in conjunction with SpeedShop. An implicit caliper point is always present at the start of execution of the process. A final caliper-point is recorded when the process calls _exit. The implicit caliper point at the 58 Using Calipers beginning of the program is numbered 0, the first caliper point recorded is numbered 1, and any additional caliper points are numbered sequentially. In addition, caliper points are automatically recorded under the following circumstances to ensure that at least one valid set of data is recorded. • When a fatal signal is received, such as SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGEMT, SIGFPE, SIGBUS, SIGSEGV, SIGSYS, SIGXCPU or SIGXFSZ. Note that this list does not include SIGKILL. • When the program calls an exec function such as execve or execvp. • When a program closes a DSO by calling dlclose. • When an exit signal is received, such as SIGHUP, SIGINT, SIGPIPE, SIGALRM, SIGTERM, SIGUSR1, SIGUSR2, SIGPOLL, SIGIO, SIGRTMIN or SIGRTMAX. Setting Calipers with ssrt_caliper_point 1. Insert calls to ssrt_caliper_point in your source code. You can insert one or more calls at any point in your code. The function call should be given the argument 1 (True.) ... ssrt_caliper_point(1); ... 2. Link in the SpeedShop libraries libss.so and libssrt.so into your application. The libraries can be placed in any order on the link line. 3. Run your program with ssrun and the desired experiment type. For example, if you want to run the ideal experiment on generic: ssrun -ideal generic The caliper points you have set in the source file are recorded in the performance data file that is generated by ssrun. 59 Chapter 6: Setting Up and Running Experiments: ssrun Setting Calipers with Signals 1. Set the_SPEEDSHOP_CALIPER_POINT_SIG variable to the signal number you want to use. The following are good choices because they do not have any semantics already associated with them. You must choose a signal that does not terminate the program. The signal should also not be caught by the target program, as this would interfere with its use for triggering a caliper point. SIGUSR1 16 /* user defined signal 1 */ SIGUSR2 17 /* user defined signal 2 */ 2. Run ssrun with your program. 3. Run a command such as ps or top to determine the process ID of ssrun. This is also the process ID of the program you are working on. 4. Send the signal you used in step 1 to the process using the kill command. kill -sig_num pid A caliper point is set at the point in the program where the signal was received by the SpeedShop runtime. Setting Calipers with a Debugger From either dbx or the WorkShop debugger, you can set a caliper point anywhere it is possible to set a breakpoint: function entry or exit, line numbers, execution addresses, watchpoints, pollpoints (timer-based). You can also attach conditions and/or cycle counts. 1. Set a breakpoint in your program at the point at which you want to set a caliper point. 2. When the process stops, evaluate the expression libss_caliper_point(1). The evaluation of the expression always returns zero, but a side effect of the evaluation is the recording of the appropriate data. 3. Resume execution of the process. 60 Effects of ssrun Effects of ssrun When you call ssrun, the following operations are executed for all experiments: • The following environment variables are set: _SPEEDSHOP_MARCHING_ORDERS, _SPEEDSHOP_EXPERIMENT_TYPE, and _SPEEDSHOP_TARGET_FILE. For more information on these variables, see “Customizing Your Environment.” • The SpeedShop libraries libss.so and libssrt.so are inserted as part of your executable using the environment variable _RLD_LIST. • The target process is invoked by calling exec. • The SpeedShop runtime library writes the appropriate experiment data to the output file. Effects of ssrun -ideal When you run an ideal experiment, the following additional operations occur: • libssrt.so is inserted immediately in front of libc.so.1 in the executable’s library list. • ssrun generates .pixie versions of all the libraries that the program uses, as well as the executable. The .pixie files are written to the current working directory, and include code that allows performance data to be collected for each function and basic block. For more information on the ideal experiment, see the “ideal Experiment” section in Chapter 4, “Experiment Types.” Customizing Your Environment This section provides information on environment variables that you can use to customize your environment for ssrun. • “Setting Environment Variables for Spawned Processes” • “Setting General Environment Variables” 61 Chapter 6: Setting Up and Running Experiments: ssrun Setting Environment Variables for Spawned Processes If your program spawns processes using fork, exec, or sproc, you can use the environment variables described in Table 6-2 to control whether performance data is collected for the spawned processes. Table 6-2 Environment variables for Spawned Processes Variable Name Description _SPEEDSHOP_TRACE_FORK When set to True, specifies that processes spawned by calls to fork are monitored. It is set to True by default. _SPEEDSHOP_TRACE_EXEC When set to True, specifies that processes spawned by calls to exec are monitored. It is set to True by default. _SPEEDSHOP_TRACE_SPROC When set to True, specifies that processes spawned by calls to sproc are monitored. It is set to True by default. Setting General Environment Variables You can set any of following environment variables to customize the way that ssrun works. For a complete list of SpeedShop environment variables, click speedshop to view the reference page. Table 6-3 62 Environment Variables for ssrun Variable Name Description _SPEEDSHOP_VERBOSE Writes a log of each program’s operation to stderr. If it is set to an empty string, only major events are logged; if it is set to a non-empty string more detailed events are logged. _SPEEDSHOP_SILENT When set to an empty or non-empty string, suppresses all output other than fatal error messages from SpeedShop. Overrides _SPEEDSHOP_VERBOSE. Customizing Your Environment Table 6-3 (continued) Environment Variables for ssrun Variable Name Description _SPEEDSHOP_SAMPLING_MODE Used for PC-sampling and hardware-counter profiling. If set to 1, it generates data for the base executable only. If it is not set, or set to anything other than 1, it generates data for the executable and all DSOs. _SPEEDSHOP_INIT_DEFERRED_SIGNAL When set to a signal number, the experiment is not initialized when the target process starts, but is delayed until the specified signal is sent to the process. A handler for the signal is installed when the process starts. You must ensure that the handler is not overwritten by the target code. _SPEEDSHOP_SBRK_BUFFER_LENGTH Defines the maximum size of the internal malloc arena used. This arena is completely separate from the user’s arena and has a default size of 0x400000. _SPEEDSHOP_FILE_BUFFER_LENGTH Defines the size of the buffer used for writing experiment files. The default length is 8kb. The buffer is used for writing small records to the file; large records are written directly, to avoid the buffering overhead. There are two additional environment variables: _SPEEDSHOP_EXPERIMENT_TYPEE and _SPEEDSHOP_MARCHING_ORDERS. These variables are set by ssrun when you run an experiment. If you are using ssrun, you do not need to set the variables. _SPEEDSHOP_EXPERIMENT_TYPE passes the name of the experiment to the runtime. ssrun parses the command line and generates marching orders for the experiment. The _SPEEDSHOP_MARCHING_ORDERS variable passes the marching orders of the experiment to the runtime. For examples of these variables, see “Example Using the -v Option.” 63 Chapter 7 7. Analyzing Experiment Results: prof This chapter provides information on how to view and analyze experiment results. It consists of the following sections: • “Using prof to Generate Performance Reports” • “Using prof with ssrun” • “Using prof Options” • “Generating Reports For Different Machine Types” • “Generating Reports for Multiprocessed Executables” • “Generating Compiler Feedback Files” • “Interpreting Reports” Using prof to Generate Performance Reports Performance data is examined using prof, a text-based report generator that prints to stdout. The prof command can be used in two modes: • To generate a report from performance data gathered during experiments recorded by ssrun: prof ... This chapter focuses on the use of prof to generate reports from ssrun experiments. • To generate a report from data files produced by running a program that has been instrumented by pixie: prof executable_name [options] [pixie counts file] You can find information on this use of prof in Chapter 8, “Using SpeedShop in Expert Mode: pixie.” 65 Chapter 7: Analyzing Experiment Results: prof prof Syntax The syntax for prof when using it with data files from ssrun is: prof options data_file data_file ... options Zero or more of the options described in Table 7-1. data_file One or more names of performance data files generated by ssrun. These files are usually of the format prog_name.exp_type.pid. prof Options Table 7-1 lists prof options. For more information, click prof to view the reference page. Table 7-1 Options for prof Name Result -calipers n1 n2 Restricts analysis to a segment of program execution. This option only works for SpeedShop experiments. Causes prof to compute the data between caliper points n1 and n2, rather than for the entire experiment. If n1 >= n2, an error is reported. If n1 is negative it is set to the beginning of the experiment. If n2 is greater than the maximum number of caliper points recorded, it is set to the maximum. If n1 is omitted, zero (the beginning of the program) is assumed. –c[lock] n Lists the number of seconds spent in each routine, based on the CPU clock frequency n, expressed in megahertz. This option is useful when generating reports for ideal experiments, or for basic block counting data obtained with pixie. The default is to use the clock frequency of the machine where the performance data was collected. -cycle n Sets the cycle time to n nanoseconds. -den[sity] Prints a list of procedures with non-zero instruction cycles sorted by the instruction density, which is the number of cycles per instruction. -dis[assemble] Disassembles and annotates the analyzed object code with cycle times if you have run an ideal experiment collected data using pixie, or the number of PC samples if you have run a pcsamp experiment. –dso [dso_name] 66 Generates a report only for the named DSO. If you don’t specify dso_name, prof prints a list of applicable DSO names. Only the basename of the DSO needs to be specified. Using prof to Generate Performance Reports Table 7-1 (continued) Options for prof Name Result -dsolist List all the DSOs in the program and their start and end text addresses. –e[xclude] proc1...procN Excludes information on the procedures specified. If you specify uppercase –E, prof also omits the specified procedures from the base upon which it calculates percentages. –feedback Produces files with information that can be used to (a) arrange procedures in the binary in an optimal ordering using cord, and (b) tell the compiler how to optimize compilation of the program using cc -fb filename.cfb. This option can be used when generating reports for ideal experiments, or for basic block counting data obtained with pixie. cord feedback files are named program.fb or libso.fb. Compiler feedback files are named progam.cfb or libso.cfb. These are binary files and may be dumped using the fbdump command. Procedures are normally ordered by their measured invocation counts; if -gprof is also specified, procedures are ordered using call graph counts, rather than invocation counts. –gprof Calculates cycles and propagates basic block counting to a procedure’s callers proportionately. This option can be used when generating reports for ideal experiments, or for basic block counting data obtained with pixie. –h[eavy] Lists the most heavily used lines of source code in descending order of use, sorting lines by their frequency of use. This option can be used when generating reports for ideal experiments, or for basic block counting data obtained with pixie. –i[nvocations] Lists the number of times each procedure is invoked. –l[ines] Lists the most heavily used lines of source code in descending order of use, but lists lines grouped by procedure, sorted by cycles executed per procedure. -nocounts Analyzes an executable or a .o file using the pixie machine model, and assuming each instruction is executed once. This analysis cannot match any possible real run of any executable which contains one or more conditional branch instructions. –o[nly] proc1...procN Reports information on only the procedures specified. If you specify uppercase –O, prof uses only the procedures, rather than the entire program, as the base upon which it calculates percentages. –p[rocedures] Lists the time spent in each procedure. 67 Chapter 7: Analyzing Experiment Results: prof Table 7-1 (continued) Options for prof Name Result –q[uit] n Condenses output listings by truncating -p[rocedures], -h[eavy], -l[ines], and -gprof listings. You can specify n in three ways: n, an integer, truncates everything after n lines; n%, an integer followed by a percent sign, truncates everything after the line containing n% calls in the %calls column; n, an integer, followed by cum%, truncates everything after the line containing ncum% calls in the cum% column. That is, it truncates the listing after the last procedure which brings the cumulative total to n%. If -gprof is also specified, it behaves the same as -q n%. For example, -q 15 truncates each part of the report after 15 lines of text. -q 15% truncates each part of the report after the first line that represents less than 15% of the whole, and -q 15cum% truncates each part of the report after the line that brought the cumulative percentage above 15%. -r10000|-r8000 Overrides the default processor scheduling model that prof uses to generate a |-r5000|-r4000 report. If this option is not specified, prof uses the scheduling model for the processor on which the experiment is being run. |-r3000 -S (-source) Disassembles and annotates the analyzed object code with cycle times, or PC samples, and source code. –z[ero] Lists the procedures that are never invoked. This option can be used when generating reports for ideal experiments, or for basic block counting data obtained with pixie. prof Output prof generates a performance report that is printed to stdout. Warning and fatal errors are printed to stderr. Note: Fortran alternate entry point times are attributed to the main function/subroutine, since there is no general way for prof to separate the times for the alternate entries. 68 Using prof with ssrun Using prof with ssrun When you call prof with one or more SpeedShop performance data files, it collects the data from all the output files and produces a listing depending on the experiment type. The prof command is able to detect which experiment was run and generate and appropriate report. It provides reports for all experiment types. In cases where prof accepts more than one data file as input, it sums up the results. The multiple input data files must be generated from the same executable, using the same experiment type. prof may report times for procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF stands for “Dummy Function,” and indicates cycles spent in parts of text which are not in any function: init and fini sections, and MIPS.stubs sections, for example. The types of reports that prof generates are described in the following sections: • “usertime Experiment Reports” • “pcsamp Experiment Reports” • “Hardware Counter Experiment Reports” • “ideal Experiment Reports” • “FPE Trace Reports” usertime Experiment Reports For usertime experiments, prof generates a list of callers and callees of each function, with information on how much time was spent in the function, its callers and its callees. The report shows information for each function, its callers and its callees. The function names are show in the right-hand column of the report. The function that is being reported is shown outdented from its caller and callee(s). For example, the first function 69 Chapter 7: Analyzing Experiment Results: prof shown in this report is __start which has no callers and two callees. The remaining columns are described below. • The index column provides an index number for reference. • The %time column shows the cumulative percentage of time spent in each function. For example, 99.9% of the time was spent in Scriptstring and all functions listed below it. • The self column shows how much time, in seconds, was spent in the function. For example, less than one hundredth of a second was spent in __start, but 0.03 of a second was spent in __readenv_sigfpe. • The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, 21.48 seconds were spent in main. • The caller/total, total (self), callee/descend column provides information on the number of cycles out of the total spent on the function, its callers and its callees. For example, the anneal function (index number 5) shows 623/623 for its caller (usrtime), 623(622) for itself, and 1/1 for its callee (init2da.) ------------------------------------------------------------------------------Profile listing generated Sun May 19 16:32:23 1996 with: prof generic.usertime.14427 ------------------------------------------------------------------------------Total Time (secs) : 21.51 Total Samples : 717 Stack backtrace failed: 0 Sample interval (ms) : 30 CPU : R4000 FPU : R4010 Clock : 150.0MHz Number of CPUs : 1 ------------------------------------------------------------------------------caller/total parents index %time self descendents total (self) name callee/descend children ------------------------------------------------------------------------------[1] 100.0% 0.00 21.51 717 (0) __start [1] 0.00 21.48 716/717 0x10001b44 main [2] 0.03 0.00 1/717 0x10001b04 __readenv_sigfpe[20] 70 Using prof with ssrun pcsamp Experiment Reports For [f]pcsamp[x] experiments, prof generates a function list annotated with the number of samples taken for the function, and the estimated time spent in the function. • The samples columns shows how many samples of the function were taken. • The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function. • The cum time(%) column shows how much time has been spent up to and including the procedure being examined. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal in the file generic.c in the generic executable. ------------------------------------------------------------------------------Profile listing generated Sun May 19 17:21:27 1996 with: prof generic.fpcsamp.14480 ------------------------------------------------------------------------------samples time CPU FPU Clock N-cpu S-interval Countsize 19077 19s R4000 R4010 150.0MHz 1 1.0ms 2(bytes) Each sample covers 4 bytes for every 1.0ms ( 0.01% of 19.0770s) -------------------------------------------------------------------------------p[rocedures] using pc-sampling. Sorted in descending order by the number of samples in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------samples time(%) cum time(%) procedure (dso:file) 17794 18s( 93.3) 18s( 93.3) anneal (/usr/demos/SShop/genericn32/generic:/usr/demos/SShop/genericn32/generic.c) 71 Chapter 7: Analyzing Experiment Results: prof Hardware Counter Experiment Reports For the various hwc experiments, prof generates a function list annotated with the number of overflows generated by the function. • The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program. • The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the anneal function shows two overflows, but the cumulative number of overflows is 6: 2 from anneal and 4 from memcpy. • The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure. ------------------------------------------------------------------------------Profile listing generated Sun May 19 17:35:21 1996 with: prof generic.dsc_hwc.5999 ------------------------------------------------------------------------------Counter : Sec cache D misses Counter overflow value: 131 Total numer of ovfls : 10 CPU : R10000 FPU : R10010 Clock : 196.0MHz Number of CPUs : 1 -------------------------------------------------------------------------------p[rocedures] using counter overflow. Sorted in descending order by the number of overflows in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------overflows(%) cum overflows(%) procedure (dso:file) 4( 40.0) 4( 40.0) memcpy (/usr/lib64/libc.so.1:/work/irix/lib/libc/libc_64_M4/strings/bcopy.s) 72 Using prof with ssrun ideal Experiment Reports For ideal experiments, prof generates a function list annotated with the number of cycles and instructions attributed to the function, and the estimated time spent in the function. prof does not take into account interactions between basic blocks. Within a single basic block, prof computes cycles for one execution and multiplies it with the number of times that basic block is executed. If any of the object files linked into the application have been stripped of line-number information (with ld -x for example), prof warns about the affected procedures. The instruction counts for such procedures are shown as a procedure total, not on a per-basic-block basis. Where a line number would normally appear in a report on a function without line numbers question marks appear instead. • The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 2524610038 cycles, or 94.81% of cycles were spent in the anneal procedure. • The cum% column shows the cumulative percentage of calls. For example, 99.88% of all calls were spent between the top two functions in the listing: anneal and slaveusrtime. • The secs column shows the number of seconds spent in the procedure. For example, 16.83 seconds were spent in the anneal procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention.) • The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal procedure. • The calls column reports the number of calls to the procedure. For example, there was just one call to the anneal procedure. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal in the file generic.c in the generic executable. Prof run at: Sun May 19 17:49:10 1996 Command line: prof generic.ideal.14517 2662778531: Total number of cycles 17.75186s: Total execution time 1875323907: Total number of instructions executed 73 Chapter 7: Analyzing Experiment Results: prof 1.420: Ratio of cycles / instruction 150: Clock rate in MHz R4000: Target processor modelled --------------------------------------------------------Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------cycles(%) cum % secs instrns calls procedure(dso:file) 2524610038(94.81) 94.81 16.83 1797940023 1 anneal(generic:/usr/demos/SShop/genericn32/generic.c) If the -gprof flag is added to prof, a list of callers and callees of each function is provided: index cycles(%) [1] 2661528037(99.95%) self self(%) self kids called/total parents kids(%) called+self name index kids called/total children 71( 0.00%) 2661527966(100.00%) 0 44 2661527913 1/1 5 0 1/1 4 0 1/1 __start [1] main [2] __istart [107] __readenv_sigfpe [108] -------------------------------------------------------------------------------44 2661527913 1/1 __start [1][2] 2661527957(99.95%) 44( 0.00%) 2661527913(100.00%) 1 main [2] 2152 2661524760 1/1 Scriptstring[3] 67 934 1/1 exit [55] -------------------------------------------------------------------------------2152 2661524760 1/1 main [2] [3] 2661526912(99.95%) 2152( 0.00%) 2661524760(100.00%) 1 Scriptstring [3] 40 2525080081 1/1 usrtime [4] 82 135044460 1/1 libdso [6] 68058 1148856 1/2 iofile [10] 124 52933 2/8 genLog [16] 7211 45001 1/1 dirstat [27] 1438 32051 1/1 linklist [31] 632 32051 1/1 fpetraps [32] 124 10922 2/19 fprintf [20] 696 0 45/45 strcmp [61] -------------------------------------------------------------------------------- 74 Using prof with ssrun 40 2525080081 1/1 [4] 2525080121(94.83%) 40( 0.00%) 2525080081(100.00%) 1 2524610038 437992 1/1 62 26466 1/8 62 5461 1/19 Scriptstring[3] usrtime [4] anneal [5] genLog [16] fprintf [20] FPE Trace Reports The report shows information for each function, its callers and its callees. The function names are show in the right-hand column of the report. The function that is being reported is shown outdented from its caller and callee(s). For example, the first function shown in this report is __start which has no callers and one callee. The remaining columns are described below. • The index column provides an index number for reference. • The %FPEs column shows the percentage of the total number of floating point exceptions that were found in the function. • The self column shows how many floating point exceptions were found in the function. For example, 0 floating point exceptions were found in __start. • The descendents columns shows how many floating point exceptions were found in the descendents of the function. For example, 4 floating point exceptions were found in the descendents of main. • The caller/total, total (self), callee/descend column provides information on the number of floating point exceptions out of the total that were found. • The parents, name, children column shows the function names, as described above. ------------------------------------------------------------------------------Profile listing generated Tue May 7 19:21:30 1996 with:prof generic.fpe.2334 ------------------------------------------------------------------------------Total FPEs : 4 Stack backtrace failed: 0 CPU : R4000 FPU : R4010 Clock : 150.0MHz Number of CPUs : 1 ------------------------------------------------------------------------------caller/total parents index %FPEs self descendents total (self) name callee/descend children 75 Chapter 7: Analyzing Experiment Results: prof ------------------------------------------------------------------------------[1] 100.0% 0 4 4 (0) __start [1] 0 4 4/4 0x10001aa0 main [2] ------------------------------------------------------------------------------- Using prof Options This section shows the output from calling prof with some of the options available for prof. Using the -dis Option For pcsamp and ideal experiments, the -dis option to prof can be used to obtain machine instruction information. prof provides the standard report and then appends the machine instruction information to the end of the report. The examples below show partial output from prof, showing just the machine instruction report. This example shows output from calling prof in the following way: prof -dis generic.pcsamp.PID ------------------------------------------------------------------------------* -dis[assemble] listing annotated pc-samples * * Procedures with zero samples are excluded. * ------------------------------------------------------------------------------/usr/demos/SpeedShop/genericn32/generic.c anneal: <0x10005aa0-0x10005d78> 1752 total samples(93.19%) [1448] 0x10005aa0 0x27bdff90 addiu sp,sp,-112 [1448] 0x10005aa4 0xffbf0020 sd ra,32(sp) [1448] 0x10005aa8 0xffbc0018 sd gp,24(sp) [1448] 0x10005aac 0x3c010002 lui at,0x2 # 4 [1448] 0x10005ab0 0x2421acb8 addiu at,at,-21320 [1448] 0x10005ab4 0x0039e021 addu gp,at,t9 [1450] 0x10005ab8 0xd7808040 ldc1 $f0,-32704(gp) <2 cycle stall for following instruction> [1450] 0x10005abc 0xf7a00010 sdc1 $f0,16(sp) [1452] 0x10005ac0 0x9f8281b0 lwu v0,-32336(gp) [1452] 0x10005ac4 0x24010001 li at,1 # 12 <1 cycle stall for following instruction> 76 # 1 # 2 # 3 # 5 # 6 # 7 # 10 # 11 Using prof Options [1452] 0x10005ac8 0xac410000 sw at,0(v0) [1453] 0x10005acc 0x9f998184 lwu t9,-32380(gp) <2 cycle stall for following instruction> [1453] 0x10005ad0 0x0320f809 jalr ra,t9 # 18 [1453] 0x10005ad4 0000000000 nop # 19 <2 cycle stall for following instruction> [1461] 0x10005ad8 0xafa00008 sw zero,8(sp) [1461] 0x10005adc 0x8fa10008 lw at,8(sp) ... # 14 # 15 # 22 # 23 The listing shows statistics about the procedure anneal in the file generic.c and lists the beginning and ending addresses of anneal: <0x1001c90c-0x1001e238> • The first column lists the line number of the instruction: [1448] • The second column lists the beginning address of the instruction: 0x10005aa0. • The third column shows the instruction in hexadecimal: 0x27bdff90. • The next column reports the assembler form (mnemonic) of the instruction: addiu sp,sp,-112. • The last column reports the cycle in which the instruction executed: # 1 Other information includes: • The number of times an above branch was executed and taken: Preceding branch executed 1 times, taken 0 times • The total number of cycles in a basic block and the percentage of the total cycles for that basic block, the number of times the branch terminating that basic block was executed, and the number of cycles for one execution of that basic block: 2152 total cycles(0.00%) invoked 1 times, average 2152 cycles/invocation. • Any cycle stalls (cycles that were wasted.) Using the -S Option For ideal experiments, the -S option to prof can be used to obtain source line information. prof provides the standard report and then appends the source line information to the end of the report. The examples below show partial output from prof, showing just the source line report. 77 Chapter 7: Analyzing Experiment Results: prof This example shows output from calling prof in the following way: prof -S generic.ideal.PID ... ----------------------disassembly listing ----------------------... <2 cycle stall for following instruction> ^--7 total cycles(0.00%) executed 1 times, average 7 cycles.---^ /usr/demos/SpeedShop/genericn32/generic.c dirstat: <0x100022a8-0x100023a8> 7211 total cycles(0.00%) invoked 1 times, average 7211 cycles/invocation 225: ^L 226: /* Simple routines to execute various types of behaviors */ 227: 228: /*=======================================================*/ 229: /* adddso -- add a DSO using sgidladd function */ 230: #ifndef NONSHARED 231: 232: static void *dl_object = NULL; 233: 234: int 235: adddso() 236: { 237: int i; 238: 239: /* see if already linked */ 240: if(dl_object != NULL) { 241: fprintf(stderr, “libdso: dl_object already linked\n”); 242: return 0; 243: } 244: 245: /* Log the event */ 246: genLog(“start of adddso”); 247: 248: /* open the dynamic shared object */ 249: dl_object = sgidladd(DYNSONAME, RTLD_LAZY); 250: if(dl_object == NULL) { 251: fprintf(stderr, “adddso: sgidladd of %s failed--%s\n”, 252: DYNSONAME, dlerror()); 253: return 0; 254: } 78 Using prof Options 255: 256: 257: 258: 259: 260: return 261: } 262: #endif ... /* invoke the routine */ i = dlslave_routine(); fprintf(stderr, “\tadddso: dynamic routine returned %d\n”, i); 0; Using the -calipers Option When you run prof on the output of an experiment in which you have recorded caliper points, you can use the -calipers option to specify the area of the program for which you want to generate a performance report. For example, if you set just one caliper point in the middle of your program, prof can provide a report from the beginning of the program up to the first caliper point using the following command: prof -calipers 0 1 prof can also provide a report from the caliper point to the end of the program using the following command: prof -calipers 1 2 If you set two caliper points, prof can generate a report from the first to the second caliper point: prof -calipers 1 2 Using the -gprof Option For ideal experiments, the -gprof option to prof can be used to obtain inclusive basic block counting information. prof provides the standard report and then appends the inclusive function counts information to the end of the report. The example below shows partial output from prof, showing just the inclusive function counts report. With inclusive cycle counting, prof prints a list of functions at the end which are called but not defined. This list includes functions starting with _rld because rld is not instrumented. 79 Chapter 7: Analyzing Experiment Results: prof prof fails to list cycles of a procedure in the inclusive listing for the following reasons: • init & fini sections, and MIPS stubs are not part of any procedure. • Calls to procedures that don’t use a “jump and link” are not recognized as procedure calls. • When execution of global procedures with the same name occurs in different DSOs, only one of them is listed. These exceptions are listed at the end of the report. This example shows output from calling prof in the following way: prof -gprof generic.ideal.14641 ... call graph profile: The sum of self and descendents is the major sort for this listing. function entries: index the index of the function in the call graph listing, as an aid to locating it. cycles(%cycles) the total cycles (percentage of total) of the program accounted for by this function and its descendents. self(%) cycles (percent of total) spent in this function itself. kids(%) cycles (percent of total) spent in the descendents of this function on behalf of this function. called the number of times this function is called (other than recursive calls). self the number of times this function calls itself recursively. name the name of the function, with an indication of its membership in a cycle, if any. index the index of the function in the call graph listing, as an aid to locating it. parent listings: 80 Using prof Options self* cycles of this function’s self time which is due to calls from this parent. kids* cycles of this function’s descendent time which is due to calls from this parent. called** the number of times this function is called by this parent. This is the numerator of the fraction which divides up the function’s time to its parents. total* the number of times this function was called by all of its parents. This is the denominator of the propagation fraction. parents the name of this parent, with an indication of the parent’s membership in a cycle, if any. index the index of this parent in the call graph listing, as an aid in locating it. children listings: self* cycles of this child’s self time which is due to being called by this function. kids* cycles of this child’s descendent’s time which is due to being called by this function. called** the number of times this child is called by this function. This is the numerator of the propagation fraction for this child. total* the number of times this child is called by all functions. This is the denominator of the propagation fraction. children the name of this child, and an indication of its membership in a cycle, if any. index the index of this child in the call graph listing, as an aid to locating it. * these fields are omitted for parents (or children) in the same cycle as the function. If the function (or child) is a member of a cycle, the propagated times and propagation denominator represent the self time and descendent time of the cycle as a whole. 81 Chapter 7: Analyzing Experiment Results: prof ** static-only parents and children are indicated by a call count of 0. cycle listings: the cycle as a whole is listed with the same fields as a function entry. Below it are listed the members of the cycle, and their contributions to the time and call counts of the cycle. All times are in milliseconds. -------------------------------------------------------------------------------NOTE: any functions which are not part of the call graph are listed at the end of the gprof listing -------------------------------------------------------------------------------self kids called/total parents index cycles(%) self(%) kids(%) called+self name index self kids called/total children [1] 2661528080(99.95%) 71( 0.00%) 2661528009(100.00%) 0 __start [1] 44 2661527956 1/1 main [2] 5 0 1/1 __istart [107] 4 0 1/1 __readenv_sigfpe [108] -------------------------------------------------------------------------------44 2661527956 1/1 __start [1] [2] 2661528000(99.95%) 44( 0.00%) 2661527956(100.00%) 1 main [2] 2152 2661524803 1/1 Scriptstring [3] 67 934 1/1 exit [55] Generating Reports For Different Machine Types If you need to generate a report for a machine model that is different from the one on which the experiment was performed, you can use several of the prof options to specify a machine model. For example, if you record an ideal experiment on a R4000 processor with a clock frequency of 100 megahertz, but you want to generate a report for an R10000 processor, the prof command would be: prof -r10000 -clock 196 generic.ideal.4561 82 Generating Reports for Multiprocessed Executables Generating Reports for Multiprocessed Executables You can gather data from executables that use the sproc and sprocsp system calls, such as those executables generated by POWER Fortran and POWER C. Prepare and run the job using the same method as for uniprocessed executables. For multiprocessed executables, each thread of execution writes its own separate data file. View these data files with prof like any other data files. The only difference between multiprocessed and regular executables is the way in which the data files are named. The data files are named prog_name.Counts.process_id. This naming convention avoids the potential conflict of multiple threads attempting to write simultaneously to the same file. Generating Compiler Feedback Files If you run an ideal experiment, run prof with the -feedback option to generate a feedback file that can be used to arrange procedures more efficiently on the next recompile. You can rearrange procedures using the -fb flag to cc, or using the cord command. For more information, click cc or cord to view the reference pages. Interpreting Reports If the target process was blocked for a long time as a result of an instruction, that instruction will show up as having a low or zero CPU time. On the other hand, CPU-intensive instructions will show up as having a high CPU time. One way to sanity-check inclusive cycle counts is to look at the percentage cycles for __start. If the value is anything less than 98 -99%, the inclusive report is suspect. Look for other warnings that prof didn’t take into account certain procedures. 83 Chapter 8 8. Using SpeedShop in Expert Mode: pixie This chapter provides information on how to run pixie and prof without invoking ssrun. By calling pixie directly, you can generate the following performance data: • An exact count of the number of times each basic block in your program is executed. A basic block is a sequence of instructions that is entered only at the beginning and exits only at the end. • Counts for callers of a routine as well as counts for callees. prof can provide inclusive basic block counting by propagating regular counts to callers of a routine. For more information on basic block counting and inclusive basic block counting, see Chapter 7, “Analyzing Experiment Results: prof.” This chapter contains the following sections: • “Using pixie” • “Obtaining Basic Block Counts” • “Obtaining Inclusive Basic Block Counts” Using pixie Use pixie to measure the frequency of code execution. pixie reads an executable program, partitions it into basic blocks, and writes (instruments) an equivalent program containing additional code that counts the execution of each basic block. Note that the execution time of an instrumented program is two-to-five times longer than an uninstrumented one. This timing change may alter the behavior of a program that deals with a graphical user interface (GUI), or depends on events such as SIGALARM that are based on an external clock. 85 Chapter 8: Using SpeedShop in Expert Mode: pixie pixie Syntax The syntax for pixie is: pixie prog_name [options] prog_name Name of the input program. options Zero or more of the keywords listed in Table 8-1. pixie Options Table 8-1 lists pixie options. For a complete list of options, click the word pixie to view the reference page. Table 8-1 86 Options for pixie Name Result -addlibs lib1.s0:...libN.so Adds lib1.s0:...libN.so to the library list of the executable. No libraries are added by default. -copy Produces a copy of the target with function list (map) and arc list (graph) sections but does not instrument the target. -directory dir_name Writes output files to dir_name. Files are written to the current directory by default. -fcncounts Produces an instrumented executable that counts function calls and arc calls, but not basic-block or branch counts. –[no]autopixie Permits or prevents a recursive instrumenting of all dynamic shared libraries used by the input file during run time. pixie keeps the timestamp and checksum from the original executable. Thus, before instrumenting a shared library, pixie checks any lib.pixie files that it finds matching the lib it is to instrument. If the fields match, they are not instrumented. pixie cannot detect shared libraries opened with dlopen (and hence does not instrument them). All used DSOs need to be instrumented for the .pixie executable to work. The default behavior with shared libraries is –noautopixie. The default behavior with an executable is –autopixie. Obtaining Basic Block Counts Table 8-1 (continued) Options for pixie Name Result -[no]longbranch During instrumentation, some transformations can push a branch offset beyond its legal range and pixie generates warnings about branch offsets being out of range. This option causes pixie to transform these instructions into jumps. The default is -nolongbranch. –[no]verbose Prints or suppresses messages summarizing the binary-to-binary translation process. The default is –noverbose. -suffix .suffix Appends .suffix to the pixified executable and DSOs. The default suffix is .pixie. pixie Output The pixie command generates a set of files with a .pixie extension. These files are essentially copies of your original executable and any DSOs you specified in the call to pixie with code inserted to enable the collection of performance data when the .pixie version of your program is run. If you use the -verbose flag with pixie, it reports the size of the old and new code. The new code size is the size of the code pixie will actually execute. It does not count read-only data (including a copy of the original text and another data block the same size as the original text) put into the text section. Calling size on the .pixie file reports a much larger text size than pixie -verbose, because size also counts everything in the text segment. When you run the .pixie version of your program, one or more .Counts files are generated. The name of an output .Counts file is that of the original program with any leading directory names removed and .Counts appended. If the program executes calls to sproc, sprocsp or fork, multiple .Counts files are generated—one for each process in the share group. In this case, each file will have the process ID appended to its name. Obtaining Basic Block Counts Use this procedure to obtain basic block counts. Also refer to Figure 8-1, which illustrates how basic block counting works. 87 Chapter 8: Using SpeedShop in Expert Mode: pixie 1. Compile and link your program. The following example uses the input file myprog.c. % cc -o myprog myprog.c The cc compiler compiles myprog.c into an executable called myprog. 2. Run pixie to generate the equivalent program containing basic-block-counting code. % pixie myprog pixie takes myprog and writes an equivalent program, myprog.pixie, containing additional code that counts the execution of each basic block. pixie also writes an equivalent program for each shared object used by the program (in the form: libname.so.pixie), containing additional code that counts the execution of each basic block. For example, if myprog uses libc.so.1, pixie generates libc.so.1.pixie. 3. Set the path for your .pixie files. pixie uses the rld search path for libraries (see rld(1) for the default paths). If the .pixie files are in your local directory, set the path as: % setenv LD_LIBRARY_PATH . 4. Execute the file(s) generated by pixie (myprog.pixie) in the same way you executed the original program. % myprog.pixie This program generates a list of basic block counts in files named myprog.Counts. If the program executes fork or sproc, a process ID is appended to the end of the filename (for example, myprog.Counts.345) for each process. Note: Your program may not run as you expect when you invoke it with a .pixie extension. Some programs, uncompress and vi for example, treat their arguments differently when the name of the program changes. You may need to rename the .pixie version of your program back to its original name. Note: To generate a valid .Counts file, your program must terminate normally or with a call to exit--if it terminates with a signal such as SIGINT, the program must use a signal handler and leave the program through exit. 5. Run the profile formatting program prof specifying the name of the original program and the .Counts file for the program. % prof myprog myprog.Counts prof extracts information from myprog.Counts and prints it in an easily readable format. If multiple .Counts files exist, you can use the wildcard character (*) to specify all of the files. % prof myprog myprog.Counts* 88 Obtaining Basic Block Counts You can run the program several times, altering the input data, to create multiple profile data files. The time computation assumes a “best case” execution; actual execution takes longer. This is because the time includes predicted stalls within a basic block, but not actual stalls that may occur entering a basic block. Also it assumes that all instructions and data are in cache (for example, it excludes the delays due to cache misses and memory fetches and stores). The complete output of the –pixie option is often extremely large. Use the –quit option with prof to restrict the size of the report. Refer to Chapter 7, “Analyzing Experiment Results: prof.” for details about prof options. 89 Chapter 8: Using SpeedShop in Expert Mode: pixie e.c nam g pro pile Com a ate cre ram o t g ixie ro e p nt p cut uivale name e x E eq rog new ie p pix am ogr pr ) new .pixie e t e cu m ta Exe rognaect da (p coll to iles aF . Dat name g (proounts) C ts e cut oun Exe me.C gna ata pro rmat d f o pr to fo ing list ics d e t st t ma tati For ofile s r of p Figure 8-1 90 How Basic Block Counting Works Obtaining Basic Block Counts Examples of Basic Block Counting The examples in this section illustrate how to use prof to obtain basic block counting information from a C program, generic. Example Using prof –invocations The partial listing below illustrates the report generated for basic block counts in generic. prof first provides a standard report of basic block counts, then provides a report reflecting any options provided to prof. % prof -i generic generic.Counts Prof run at: Fri May 17 12:39:22 1996 Command line: prof -i generic generic.Counts 2662778530: 17.75186s: 1875323864: 1.420: 150: R4000: Total number of cycles Total execution time Total number of instructions executed Ratio of cycles / instruction Clock rate in MHz Target processor modelled --------------------------------------------------------Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------cycles(%) cum % secs instrns calls procedure(dso:file) 2524610038(94.81) 94.81 16.83 1797940023 1 anneal(generic:/usr/demos/SS/genericn32/generic.c) 135001332( 5.07) 99.88 0.90 75000822 1 slaveusrtime(./dlslave.so:/usr/demos/SS/genericn32/dlslave.c) 1593518( 0.06) 99.94 0.01 1378788 4385 memcpy(/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 735797( 0.03) 99.97 0.00 506627 4123 fread(/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fread.c) 187200( 0.01) 99.98 0.00 124800 1600 next(/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) 136116( 0.01) 99.98 0.00 82498 1 iofile(generic:/usr/demos/SS/genericn32/generic.c) 91 Chapter 8: Using SpeedShop in Expert Mode: pixie 91200( 0.00) 99.98 0.00 62400 1600 _drand48(/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) ... • The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 2524610038 cycles, or 94.81% of cycles were spent in the anneal procedure. • The cum% column shows the cumulative percentage of calls. For example, 99.88% of all calls were spent between the top two functions in the listing: anneal and slaveusrtime. • The secs column shows the number of seconds spent in the procedure. For example, 16.83 seconds were spent in the anneal procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention.) • The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal procedure. • The calls column reports the number of calls to each procedure. For example, there was just one call to the anneal procedure. • The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal in the file generic.c in the generic executable. The partial listing below illustrates the use of the –i[nvocations] option. For each procedure, prof reports the number of times it was invoked from each of its possible callers and lists the procedure(s) that called it. --------------------------------------------------------Procedures sorted in descending order of times invoked. Unexecuted procedures are not listed. --------------------------------------------------------Total number of procedure invocations: 15114 calls(%) cum% size(bytes) procedure (dso:file) 4385(29.01) 29.01 3416 memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 4123(27.28) 56.29 1304 fread (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fread.c) 92 Obtaining Basic Block Counts 1600(10.59) 66.88 312 next (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) 1600(10.59) 77.46 180 _drand48 (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/math/drand48.c) 628( 4.16) 81.62 368 __sinf (/usr/lib32/libm.so:/work/cmplrs/libm/fsin.c) 259( 1.71) 83.33 524 __filbuf (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/_filbuf.c) The above listing shows the total procedure invocations (calls) during the run: 12113082. • The calls(%) column reports the number of calls (and the percentage of total calls) per procedure. For example, there were 4385 calls (or 29.01% of the total) spent in the procedure memcpy. • The cum% column shows the cumulative percentage of calls. For example, 56.29% of all calls were spent between memcpy and fread. • The size(bytes) column shows the total byte size of a procedure. For example, the procedure memcpy is 3416 bytes. • The procedure (dso:file) column lists the procedure, its DSO name and its filename. For example, the first line reports statistics for the procedure memcpy in the file bcopy.s in libc.so. Example Using prof –heavy The following partial listing shows the source code lines responsible for the largest portion of execution time produced with the –heavy option. % prof -heavy generic generic.Counts The partial listing below shows basic block counts sorted in descending order of cycles used. The fields in the report are described in section “ideal Experiment Reports” section in Chapter 7, “Analyzing Experiment Results: prof.” --------------------------------------------------------Lines listed in descending order of cycle counts. --------------------------------------------------------cycles(%) cum % times line procedure (dso:file) 2309934120(86.75%) 86.75% 14440000 1465 anneal (generic:/usr/demos/SpeedShop/genericn32/generic.c) 207945880( 7.81%) 94.56% 14440000 1464 anneal (generic:/usr/demos/SpeedShop/genericn32/generic.c) 93 Chapter 8: Using SpeedShop in Expert Mode: pixie 81000506( 3.04%) 97.60% 5000000 29 slaveusrtime (dlslave.so:/usr/demos/SpeedShop/genericn32/dlslave.c) 54000000( 2.03%) 99.63% 5000000 30 slaveusrtime (dlslave.so:/usr/demos/SpeedShop/genericn32/dlslave.c) 6600000( 0.25%) 99.88% 380000 1463 anneal (generic:/usr/demos/SpeedShop/genericn32/generic.c) 418380( 0.02%) 99.89% 32981 493 memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 418380( 0.02%) 99.91% 32981 494 memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 139482( 0.01%) 99.91% 32981 496 memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 139460( 0.01%) 99.92% 32981 495 memcpy (/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 130009( 0.00%) 99.92% 10000 1461 anneal (generic:/usr/demos/SpeedShop/genericn32/generic.c) Example Using prof –quit You can limit the output of prof to information on only the most time-consuming parts of the program by specifying the –quit option. You can instruct prof to quit after a particular number of lines of output, after listing the elements consuming more than a certain percentage of the total, or after the portion of each listing whose cumulative use is a certain amount. Consider the following sample listing: % prof -quit 4 generic generic.Counts Prof run at: Fri May 17 14:09:12 1996 Command line: prof -quit 4 generic generic.Counts 2662778530: Total number of cycles 17.75186s: Total execution time 1875323864: Total number of instructions executed 1.420: Ratio of cycles / instruction 150: Clock rate in MHz R4000: Target processor modelled --------------------------------------------------------Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------- 94 Obtaining Inclusive Basic Block Counts cycles(%) cum % secs instrns calls procedure(dso:file) 2524610038(94.81) 94.81 16.83 1797940023 1 anneal(generic:/usr/demos/SpeedShop/genericn32/generic.c) 135001332( 5.07) 99.88 0.90 75000822 1 slaveusrtime(./dlslave.so:/usr/demos/SpeedShop/genericn32/dlslave.c) 1593518( 0.06) 99.94 0.01 1378788 4385 memcpy(/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 735797( 0.03) 99.97 0.00 506627 4123 fread(/usr/lib32/libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fread.c) Obtaining Inclusive Basic Block Counts Inclusive basic block counting counts basic blocks and generates a call graph. By propagating regular counts to callers of a routine, prof provides inclusive basic block counting. For more information on inclusive basic block counting, see the “ideal Experiment” section in Chapter 4, “Experiment Types.” Use the following procedure to obtain inclusive basic clock counts. Also refer to Figure 8-1 which illustrates how inclusive basic block counting works. 1. Compile and link your program. The following example uses the input file myprog.c. % cc -o myprog myprog.c The cc compiler compiles myprog.c into an executable called myprog. 2. Run pixie to generate the equivalent program containing basic-block-counting code. % pixie myprog pixie takes myprog and writes an equivalent program, myprog.pixie, containing additional code that counts the execution of each basic block. pixie also writes an equivalent program for each shared object used by the program (in the form: libname.so.pixie), containing additional code that counts the execution of each basic block. For example, if myprog uses libc.so.1, pixie generates libc.so.1.pixie. 3. Set the path for your .pixie files. pixie uses the rld search path for libraries (see rld(1) for the default paths). If the .pixie files are in your local directory, set the path as: % setenv LD_LIBRARY_PATH . 4. Execute the file(s) generated by pixie (myprog.pixie) in the same way you executed the original program. % myprog.pixie 95 Chapter 8: Using SpeedShop in Expert Mode: pixie This program generates a list of basic block counts in files named myprog.Counts. If the program executes fork or sproc, a process ID is appended to the end of the filename (for example, myprog.Counts.345) for each process. The –gprof information is bundled in the .Counts file. Note: Your program may not run as you expect when you invoke it with a .pixie extension. Some programs, uncompress and vi for example, treat their arguments differently when the name of the program changes. You may need to rename the .pixie version of your program back to its original name. Note: To generate a valid .Counts file, your program must terminate normally or with a call to exit--if it terminates with a signal such as SIGINT, the program must use a signal handler and leave the program through exit. 5. Run the profile formatting program prof specifying the name of the original program, the -gprof flag, and the .Counts file for the program. % prof -gprof myprog myprog.Counts prof extracts information from myprog.Counts and prints it in an easily readable format. If multiple .Counts files exist, you can use the wildcard character (*) to specify all of the files. % prof -gprof myprog myprog.Counts* 96 Obtaining Inclusive Basic Block Counts .c ame gn pro pile Com ate cre m o t a r e pixi prog ute alent ame c e Ex quiv rogn e an ixie p p iles aF . Dat name g am ogr pr ) new .pixie e t cu me ta Exe rognaect da (p coll to (proounts) C ts e oun cut me.C e x a E gn a ro at of prmat d r p g fo f pro to g stin d li unts e t t o ma e c For lusiv c n i of Figure 8-2 How Inclusive Basic Block Counting Works 97 Chapter 8: Using SpeedShop in Expert Mode: pixie Example of prof –gprof Here’s part of a sample output obtained by using the –gprof option. The fields in the report are explained in detail in the report, but are not provided in this example. For more information on the -gprof option, see Chapter 7, “Analyzing Experiment Results: prof.” (The format of the output has been adjusted slightly). % prof -gprof generic generic.Counts ... Prof run at: Fri May 17 14:42:25 1996 Command line: prof -gprof generic generic.Counts ... self kids called/total parents index cycles(%) self(%) kids(%) called+self name index self kids called/total children [1] 2662767961 (100.00%) 71( 0.00%) 2662767890(100.00%) 0 __start [1] 44 2662767837 1/1 main [2] 5 0 1/1 __istart[111] 4 0 1/1 __readenv_sigfpe[112] -------------------------------------------------------------------------------44 2662767837 1/1 __start [1] [2] 2662767881(100.00%) 44( 0.00%) 2662767837(100.00%) 1 main [2] 2152 2662764245 1/1 Scriptstring[3] 67 926 1/1 exit [58] 96 309 1/1 _sigset [67] 32 10 1/9 _gettimeofday[68] ----------------------------------------------------------------------------... 98 Chapter 9 9. Miscellaneous Commands This chapter describes SpeedShop commands for exploring memory usage and paging, and printing data files generated by SpeedShop tools. It contains the following sections: • “Using the thrash Command” • “Using the squeeze Command” • “Calculating the Working Set of a Program” • “Dumping Performance Data Files” • “Dumping Compiler Feedback Files” Using the thrash Command The thrash command allows you to explore paging behavior by allocating a region of virtual memory, and either randomly or sequentially accessing that memory to explore the system paging behavior. thrash Syntax thrash [args] args One or more of the following flags: -k N The amount of memory to access in kilobytes, where N is the number of kilobytes. -m N The amount of memory to access in megabytes, where N is the number of megabytes. -n count The number of references to make before exiting. The default is 10,000. -p N The amount of memory to access in pages, where N is the number of pages. 99 Chapter 9: Miscellaneous Commands -s Sequential thrashing. The default is random -w time The amount of time thrash should sleep after thrashing but before exiting. Effects of thrash Once the memory is allocated, thrash prints a message on stdout saying how much memory it is using and then proceeds to thrash over it. Here’s an example: fraser 82% thrash -m 4 thrashing randomly: 4.00 MB (= 0x00400000 = 4194304 bytes = 1024 pages) 10000 iterations You can use thrash in conjunction with ssusage and squeeze to determine the approximate available working memory on a system, as described in section “Calculating the Working Set of a Program” below. Using the squeeze Command The squeeze command allows you to specify an amount of virtual memory to lock down into real memory, thus making it unavailable to other processes. This command can only be used by superuser. squeeze Syntax squeeze [flag] amount flag One of the following flags. If no flag is specified, the default is megabytes. -k Kilobytes -m Megabytes -p Pages -% A percentage of the installed memory amount 100 The amount of memory to be locked. Calculating the Working Set of a Program Effects of squeeze squeeze performs the following operations: • Locks down the amount of virtual memory you supply as an argument to the command. • Prints a message to stdout saying how much memory has been locked, and how much working memory is available. • Sleeps indefinitely, or until interrupted by SIGINT or SIGTERM, at which time it frees up the memory and exits with an exit message. You should wait until after the exit message is printed before doing any experiments. Here’s an example: fraser 1# squeeze 4 squeeze: leaving 60.00 MB ( = 0x03c01000 = 62918656 ) available memory; pinned 4.00 MB ( = 0x00400000 = 4194304 ) at address 0x1000e000; from 64.00 MB ( = 0x04001000 = 67112960 ) installed memory. Use Ctrl-c to exit squeeze. The following message is printed: squeeze exiting Calculating the Working Set of a Program You can use the thrash, squeeze and ssusage commands together to determine the approximate working set of a program as follows. For all practical purposes, the working set of your program is the size of memory allocated. 1. Choose a machine that has a large amount of physical memory (enough to allow your target application to run without any paging other than at start-up.) 2. Make sure that the machine is running a minimal number of applications that will remain fairly consistent for the duration of these steps. 3. Run thrash with ssusage to determine the working set of the kernel and any other applications you have running. In this example, the thrash command uses 4MB of memory. ssusage thrash -m 4 101 Chapter 9: Miscellaneous Commands When the thrash command completes, ssusage prints the resource usage of thrash; the value labelled majf gives the number of major page faults (i.e. the number of faults that required a physical read.) When run on a machine with a large amount of physical memory, this value is the number of faults needed to start the program, which is the minimum number for any run. For more information on ssusage, see Chapter 5, “Collecting Data on Machine Resource Usage.” 4. As superuser in a separate window, run the squeeze command to lock down an amount of memory. 5. Rerun thrash with ssusage. ssusage thrash -m 4 6. Repeat steps 4 and 5, increasing the amount of memory for squeeze, until the majf number begins to rise. The amount of working memory available reported by squeeze at the point at which page faults begin to rise for thrash tells you the combined working set of thrash (~4MB), the kernel and any other applications you have running. 7. Deduct the 4MB that thrash uses from the amount of working memory reported by squeeze at the point the page faults began to rise to find out the approximate working set of the kernel and any other applications that are running on the machine. This number will be needed when you reach step 15. 8. Make sure the applications that the machine is running remain consistent with the setup from step 2. 9. Run ssusage with your program to ensure that the machine has the amount of memory your program needs. ssusage prog_name When your program exits, ssusage prints the application’s resource usage: the majf field gives the number of major page faults. When run on a machine with a large amount of physical memory, this value is the number of faults needed to start the program, which is the minimum number for any run. 10. Switch to superuser. 11. Run squeeze to lock down an amount of memory. The following example locks down 15 megabytes of memory. squeeze 15 12. Rerun your program with ssusage. 13. Repeat steps 11 and 12 until the majf number begins to rise. 102 Dumping Performance Data Files 14. Deduct the amount squeezed at the point at which the application begins to page fault from the total amount of physical memory in the system to find the combined working set of your program, the kernel and any other applications you have running. 15. Deduct the amount of working memory calculated in step 7 from the total amount of physical memory in the system to find the approximate working set of your program. Dumping Performance Data Files All the performance data for a single process is in one file. The file begins with a prologue, and continues with a mixture of performance data, sample records, and control records. The ssdump command can be used for printing performance data files. It provides a formatted ASCII dump of one or more performance experiment data files. This command is most likely to be useful in verifying performance data that does not seem accurate when reported through prof. ssdump Syntax ssdump [options] {datafile1 ... datafileN} ... options Zero or more of the following print options: -d Prints detailed information for each bead. -h Prints the hex contents of the body of each bead. -i index Prints only one bead at index in the file. -q Suppresses printing of those fields that will normally change from run to run such as process IDs and time stamps. This option is useful for QA work, to enable automatic comparisons of recorded experiments. -s offset Prints only one bead at offset into the file. datafile1 ... datafileN If any of the named files contain beads indicating additional data files, from descendant processes of the original run, those files will also be dumped. 103 Chapter 9: Miscellaneous Commands Experiment File Format The file is written as a string of “beads”, each of which is a record with a 32-bit type, a 32-bit byte count, and a body whose length is given by the byte-count, rounded up to a double-word boundary. The file prologue starts with a file-identifier bead, which acts as a magic-number indicating that the file is an SpeedShop data file. Other beads in the prologue give the machine and executable name, the hardware inventory describing the machine, the machine page size, O/S revision, date and checksum information about the executable, the target name (the target is the executable after instrumentation), the arguments with which the target was invoked, the instrumentation performed, and the types of performance data that are to be recorded in the remainder of the file. The following example calls ssdump on performance data for an ideal experiment. ssdump generic.ideal.847 Here’s some partial output from ssdump. The format has been adjusted slightly to meet presentation needs. Printing experiment record file “generic.ideal.847” (394024 bytes), last written on Thu 16 May 96 23:09:55 SpeedShop File Preface 1 version 3 created: Thu 16 May 96 23:09:28.520 Hardware Inventory 2, offset 32 = 0x00000020 hardware inventory: 26 items class 1, type 1, contrlr 75, unit 255, state 13 class 1, type 2, contrlr 0, unit 0, state 4130 ... Experiment name 3, offset 464 = 0x000001d0 ideal 104 Experiment marching orders it 4, offset 480 = 0x000001e0 Capture module symbol it,4 5, offset 496 = 0x000001f0 Executable file generic 6, offset 512 = 0x00000200 Target file 7, offset 528 = 0x00000210 Dumping Performance Data Files generic.pixie Target arguments 8, offset 552 = 0x00000228 Time: Thu 16 May 96 23:09:28.520, process pid = 847 arguments: ““ Target begin process # 76283912, pid Program Object List process # 0, pid = 847, Program Object 0, Named Link Time Run Time 9, offset 592 = 0x00000250 = 847, event # 0 10, offset 632 = 0x00000278 event # 0, -- 5 DSOs `generic’ Address: 0x0000000000405a98 Address: 0x0000000000405a98 Size: 0x000000000000c6a0 (50848) Base Pointer: 0x0000000000000000 Program Object 1, Named `./libss.so.pixie’ Link Time Address: 0x000000003f854b90 Run Time Address: 0x000000003f854b90 Size: 0x00000000000012a0 (4768) Base Pointer: 0x0000000000000000 Program Object 2, Named `./libssrt.so.pixie’ Link Time Address: 0x000000003f564b94 Run Time Address: 0x000000003f564b94 Size: 0x0000000000136010 (1269776) Base Pointer: 0x0000000000000000 ... Target DSO open 11, offset 936 = 0x000003a8 process # 0, pid = 847, event # 0 at time = Thu 16 May 96 23:09:52.888 fname = dlslave.so Program Object List process # 0, pid = 847, Program Object 0, Named Link Time Run Time 12, offset 992 = 0x000003e0 event # 0, -- 6 DSOs `generic’ Address: 0x0000000000405a98 Address: 0x0000000000405a98 Size: 0x000000000000c6a0 (50848) Base Pointer: 0x0000000000000000 ... Sample event trigger 13, offset 1360 = 0x00000550 process 00007, trap index #0 at time = Thu 16 May 96 23:09:54.476, #0 105 Chapter 9: Miscellaneous Commands Event count array (32-bit) 14, offset 1408 = 0x00000580 integer array, dso index = 0, array size = 921 ... Function pointer trace dso index = 0, 7 function pointer arcs 20, offset 393592 = 0x00060178 Function pointer trace dso index = 2, 6 function pointer arcs 21, offset 393776 = 0x00060230 Sample data end marker 22, offset 393936 = 0x000602d0 Target termination 23, offset 393984 = 0x00060300 process # 0, pid = 847, event # 0 at time = Thu 16 May 96 23:09:55.084 ** End-of-File 24, offset 394024 = 0x00060328 **** End of experiment record file “generic.ideal.847” at time = Thu 16 May 96 23:09:28.520 Dumping Compiler Feedback Files The fbdump command can be used to print out the compiler feedback files generated by running prof -feedback. For more information on using compiler feedback files, click cord or cc to view the reference pages. fbdump Syntax fbdump options filename options 106 Zero or more of the options described in table Table 9-1. Dumping Compiler Feedback Files filename Table 9-1 The feedback filename. This file has a .fb extension. Options for fbdump Option Result -all Prints feedback using all options. This is the default. -ascii Prints feedback information in the same style as earlier version of the feedback dump program. -bb Prints feedback per basic block table as described in “cmplrs/fb.h”. If -verbose is specified, all basic blocks are printed, even those with zero execution counts. If -verbose is not specified, fbdump prints only the basic blocks that have non-zero execution counts. -call Prints the feedback call table as described in “cmplrs/fb.h”. If -verbose is specified, all the points of call are printed, even if they have not been called. If -verbose is not specified, fbdump prints only the relevant information on the calls. -header Prints the feedback file header as described in “cmplrs/fb.h”. -proc Prints the feedback procedure table as described in “cmplrs/fb.h”. If -verbose is specified, all procedures will be printed, even if they are not invoked. If -verbose is not specified, fbdump prints only the relevant information on the procedures that have been invoked. -sections Prints the feedback file section headers table as described in “cmplrs/fb.h”. -str Prints feedback string table. -verbose Prints all the information in verbose mode including a table with all zero entries. 107 Index Symbols A _RLD_LIST variable, 61 _SPEEDSHOP_CALIPER_POINT_SIG variable, 58, 60 _SPEEDSHOP_EXPERIMENT_TYPE variable, 61, 63 _SPEEDSHOP_FILE_BUFFER_LENGTH variable, 63 _SPEEDSHOP_HWC_COUNTER_NUMBER variable, 43 _SPEEDSHOP_HWC_COUNTER_OVERFLOW variable, 43 _SPEEDSHOP_INIT_DEFERRED_SIGNAL variable, 63 _SPEEDSHOP_MARCHING_ORDERS variable, 61, 63 _SPEEDSHOP_OUTPUT_DIRECTORY variable, 53 _SPEEDSHOP_OUTPUT_FD variable, 53 _SPEEDSHOP_OUTPUT_FILENAME variable, 53 _SPEEDSHOP_SAMPLING_MODE variable, 63 _SPEEDSHOP_SBRK_BUFFER_LENGTH variable, 63 _SPEEDSHOP_SILENT variable, 62 _SPEEDSHOP_TARGET_FILE variable, 61 _SPEEDSHOP_TRACE_EXEC variable, 62 _SPEEDSHOP_TRACE_FORK variable, 62 _SPEEDSHOP_TRACE_SPROC variable, 62 _SPEEDSHOP_VERBOSE variable, 62 API, 6 setting calipers, 10 B basic block counting, 20, 33, 45 overview, 5 C C examples, 11 calipers, 10, 51, 58 and prof, 79 automatic, 59 sample traps, 58, 60 ssrt_caliper_point, 58, 59 using signals, 58, 60 using the debugger, 58, 60 -calipers option, 10 call stack profiling, 14, 27, 38 compiler feedback files, 83 compiler optimization restrictions, 52 cord, 83, 106 .Counts file, 88, 96 CPU-bound processes, 2 cy_hwc experiment, 41 109 Index D data display anomalies, 52 dc_hwc experiment, 42 debugger setting calipers, 10, 58, 60 using ssrun, 57 dsc_hwc experiment, 42 DSOs, 7 E environment variables __SPEEDSHOP_FILE_BUFFER_LENGTH, 63 __SPEEDSHOP_INIT_DEFERRED_SIGNAL, 63 __SPEEDSHOP_SAMPLING_MODE, 63 __SPEEDSHOP_SBRK_BUFFER_LENGTH, 63 __SPEEDSHOP_TRACE_SPROC, 62 __SPEEDSHOP_VERBOSE, 62 _RLD_LIST, 61 _SPEEDSHOP_CALIPER_POINT_SIG, 58, 60 _SPEEDSHOP_EXPERIMENT_TYP, 63 _SPEEDSHOP_EXPERIMENT_TYPE, 61 _SPEEDSHOP_HWC_COUNTER_NUMBER, 43 _SPEEDSHOP_HWC_COUNTER_OVERFLOW, 43 _SPEEDSHOP_MARCHING_ORDERS, 61, 63 _SPEEDSHOP_OUTPUT_DIRECTORY, 53 _SPEEDSHOP_OUTPUT_FD, 53 _SPEEDSHOP_OUTPUT_FILENAME, 53 _SPEEDSHOP_SILENT, 62 _SPEEDSHOP_TARGET_FILE, 61 _SPEEDSHOP_TRACE_FORK, 62 LD_LIBRARY_PATH, 13 SPEEDSHOP_TRACE_EXEC, 62 examples C, 11 Fortran, 25 exec, 6 110 executables calculating a working set, 101 stripped, 51 experiment data, 9 controlling output file, 53 file format, 104 filenames, 53 experiments choosing, 9, 37 cy_hwc, 41 dc_hwc, 42 dsc_hwc, 42 fpe, 75 fpe trace, 9, 23, 47 gfp_hwc, 43 gi_hwc, 41 hardware counter, 19, 31, 40, 72 hardware counters, 9 ic_hwc, 41 ideal, 20, 33, 45, 73 isc_hwc, 42 pcsamp, 39 pcsamp, 9, 16, 30, 71 prof_hwc, 43 tlb_hwc, 42 usertime, 9, 14, 27, 38, 69 F fbdump, 106 overview, 4 files compiler feedback, 106 performance data, 9, 103 format, 104 floating-point exceptions, 23, 47 floating-point exception trace, 9 overview, 5 fork, 6 Index Fortran examples, 25 limitations, 52 fpcsampx experiment, 39 fpe trace experiment, 9, 47, 75 overview, 5 tutorial, 23 G generic program, 11 gfp_hwc experiment, 43 gi_hwc experiment, 41 -gprof example, 98 H hardware counter experiment, 72 hardware counter experiments, 9, 40 overview, 5 tutorial, 19, 31 hardware counter numbers, 43 hardware counter overflows, 19, 31, 40 hwc experiments, 9, 40 overview, 5 I ic_hwc experiment, 41 ideal experiment, 45, 73 effects, 61 overview, 5 tutorial, 20, 33 I/O-bound processes, 2 isc_hwc experiment, 42 L LD_LIBRARY_PATH, 95 LD_LIBRARY_PATH variable, 13, 88 libfpe_ss.so, 6 libmalloc_ss.so, 6 libraries libfpe_ss.so, 6 libmalloc_ss.so, 6 libssrt.so, 6, 61 libss.so, 6, 61 linking in SpeedShop, 59 libssrt.so, 6, 59, 61 libss.so, 6, 59, 61 linpack benchmark, 25 locking memory, 100 M machine resource usage, 49 memory locking, 100 memory-bound processes, 2 message-passing paradigms, 7 MP Fortran limitations, 52 MPI, 7 with ssrun, 57 multi-processor executables, 7, 52 profiling, 83 P paging behavior, 99 pcsamp experiment, 9, 39, 71 example, 55 overview, 5 111 Index tutorial, 16, 30 PC sampling, 39 tutorial, 16, 30 perfex, 40 performance analysis phases, 7 theory, 1 performance data files dumping, 103 performance problems, 1, 9 Bugs, 2 CPU, 2 I/0, 2 memory, 2 pixie, 45, 85 and prof –heavy example, 93 and prof –i example, 92 –autopixie option, 86 command option, 85 command syntax, 85 .Counts file, 88, 96 examples, 87 output size, 89 overview, 4 restricting output, 89 setting search path, 88, 95 –verbose option, 87 processes forking, 6 prof Also see profiling -calipers example, 79 -calipers option, 10 compiler feedback, 106 -dis example, 76 -gprof example, 74, 79 –heavy example, 93 –invocations example, 92 options, 66 output, 68 112 overview, 4, 8 -S example, 77 steps, 8 syntax, 66 using with pixie, 65 using with ssrun, 65 prof_hwc experiment, 43 profiles interpreting, 83 profiling -calipers option, 79 –clock option, 66 command syntax, 66 -dis option, 76 –dis option, 66 –dsolist option, 67 –dso option, 66 –exclude option, 67 –feedback option, 67 fpe trace experiment, 75 -gprof option, 67, 79 hardware counter experiments, 72 –heavy option, 67 example, 93 ideal experiment, 73 inclusive basic block counts, 74 –invocations option, 67 example, 92 –lines option, 67 machine scheduler option, 82 multiprocessor executables, 83 –only option, 67 pcsamp experiment, 71 procedure invocation example, 91 –procedures option, 67 processor scheduler option option, 68 –quit option, 68, 89, 94 -S option, 77 –S option, 68 usertime experiment, 69 Index –zero option, 68 program counter sampling, 39 programs calculating a working set, 101 stripped, 51 pthreads, 7 R rearranging procedures, 83 reports for different machine models, 82 fpe trace experiment, 75 hardware counter experiments, 72 ideal experiment, 73 interpreting, 83 pcsamp experiment, 71 usertime experiment, 69 using calipers, 79 rld search path, 88, 95 S search path rld, 88, 95 setting calipers, 10, 58 shared libraries, 7 signals setting calipers, 10, 58, 60 SpeedShop overview, 3 SpeedShop API, 6 SpeedShop demo generic, 11 linpack, 25 SpeedShop libraries, 61 libfpe_ss.so, 6 libmalloc_ss.so, 6 libssrt.so, 6 libss.so, 6 linking, 59 sproc, 6 squeeze, 100 calculating a working set, 101 overview, 4 ssdump, 103 ssrt_caliper_point, 6, 58, 59 executable requirements, 51 ssrun effects, 61 examples, 55 flags, 54 MPI programs, 57 overview, 3, 8 restrictions, 51 setup, 51 steps, 8 syntax, 54 using a debugger, 57 -v option example, 56 ssusage calculating a working set, 101 overview, 3 statistical call stack profiling overview, 5 statistical hardware counter sampling overview, 5 statistical PC sampling overview, 5 stripped executables, 51 system, 6 113 Index T thrash, 99 calculating a working set, 101 overview, 4 tlb_hwc experiment, 42 tracing floating-point exceptions, 9 tutorial C, 11 Fortran, 25 U usertime experiment, 9, 38, 69 overview, 5 restrictions, 51 tutorial, 14, 27 W working set, 101 114 Tell Us About This Manual As a user of Silicon Graphics documentation, your comments are important to us. They help us to better understand your needs and to improve the quality of our documentation. Any information that you provide will be useful. Here is a list of suggested topics to comment on: • General impression of the document • Omission of material that you expected to find • Technical errors • Relevance of the material to the job you had to do • Quality of the printing and binding Important Note Please include the title and part number of the document you are commenting on. The part number for this document is 007-3311-001. Thank you! Three Ways to Reach Us GE NO POSTA SARY NECES D IF MAILE IN THE S D STATE UNITE S BUSINES REPLY MAIL Inc. Graphics, Blvd. Silicon Shoreline 2011 N. View, CA 94043 Mountain The postcard opposite this page has space for your comments. Write your comments on the postage-paid card for your country, then detach and mail it. If your country is not listed, either use the international card and apply the necessary postage or use electronic mail or FAX for your reply. If electronic mail is available to you, write your comments in an e-mail message and mail it to either of these addresses: • If you are on the Internet, use this address: techpubs@sgi.com • For UUCP mail, use this address through any backbone site: [your_site]!sgi!techpubs You can forward your comments (or annotated copies of pages from the manual) to Technical Publications at this FAX number: 415 965-0964
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.1 Linearized : No Page Count : 130 Create Date : 1999:01:07 22:23:04 Producer : Acrobat Distiller 3.02EXIF Metadata provided by EXIF.tools