ADIOS Users Manual 1.2.1
User Manual:
Open the PDF directly: View PDF .
Page Count: 94
Download | |
Open PDF In Browser | View PDF |
ORNL/TM-2009/100 ADIOS 1.2.1 User’s Manual August 2010 DOCUMENT AVAILABILITY Reports produced after January 1, 1996, are generally available free via the U.S. Department of Energy (DOE) Information Bridge: Web site: http://www.osti.gov/bridge Reports produced before January 1, 1996, may be purchased by members of the public from the following source: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone: 703-605-6000 (1-800-553-6847) TDD: 703-487-4639 Fax: 703-605-6900 E-mail: info@ntis.fedworld.gov Web site: http://www.ntis.gov/support/ordernowabout.htm Reports are available to DOE employees, DOE contractors, Energy Technology Data Exchange (ETDE) representatives, and International Nuclear Information System (INIS) representatives from the following source: Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone: 865-576-8401 Fax: 865-576-5728 E-mail: reports@adonis.osti.gov Web site: http://www.osti.gov/contact.html This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. ORNL/TM-2009/100 ADIOS 1.2.1 USER’S MANUAL Prepared for the Office of Science U.S. Department of Energy S. Hodson, S. Klasky, Q. Liu, J. Lofstead, N. Podhorszki, F. Zheng, M. Wolf, T. Kordenbrock, H. Abbasi, N. Samatova Aug. 2010 Prepared by OAK RIDGE NATIONAL LABORATORY Oak Ridge, Tennessee 37831-6070 managed by UT-BATTELLE, LLC for the U.S. DEPARTMENT OF ENERGY under contract DE-AC05-00OR22725 Contents 1 Introduction ............................................................................ 1 1.1 1.2 1.3 1.4 1.5 Goals .........................................................................................................................................................1 What Is ADIOS? ....................................................................................................................................1 The Basic ADIOS Group Concept ..................................................................................................1 Other Interesting Features of ADIOS..........................................................................................1 Future ADIOS 2.0 Goals ....................................................................................................................2 2 Installation .............................................................................. 3 2.1 Obtaining ADIOS..................................................................................................................................3 2.2 Quick Installation................................................................................................................................3 2.2.1 Linux cluster.................................................................................................................................3 2.2.2 Cray XT5.........................................................................................................................................3 2.2.3 Support for Matlab ....................................................................................................................4 2.3 ADIOS Dependencies .........................................................................................................................4 2.3.1 Mini-‐XML parser (required)..................................................................................................4 2.3.2 MPI and MPI-‐IO (required)....................................................................................................4 2.3.3 Fortran90 compiler (optional).............................................................................................4 2.3.4 Serial NetCDF-‐3 (optional) ....................................................................................................4 2.3.5 Serial HDF5 (optional).............................................................................................................5 2.3.6 PHDF5 (optional).......................................................................................................................5 2.3.7 NetCDF-‐4 Parallel ......................................................................................................................5 2.3.8 Read-‐only installation..............................................................................................................5 2.4 Full Installation....................................................................................................................................5 2.5 Compiling applications using ADIOS ..........................................................................................6 3 ADIOS Write API..................................................................... 7 3.1 Write API Description .......................................................................................................................7 3.1.1 Introduction .................................................................................................................................7 3.1.2 ADIOS-‐required functions......................................................................................................7 3.1.3 Nonblocking functions.......................................................................................................... 11 3.1.4 Other function .......................................................................................................................... 11 3.1.5 Create a first ADIOS program ............................................................................................ 12 4 XML Config File Format .....................................................13 4.1 Overview.............................................................................................................................................. 13 4.2 adios-‐group......................................................................................................................................... 14 4.2.1 Declaration................................................................................................................................. 14 4.2.2 Variables ..................................................................................................................................... 14 4.2.3 Attributes ................................................................................................................................... 15 4.2.4 Gwrite/src.................................................................................................................................. 16 4.2.5 Global arrays ............................................................................................................................. 16 4.2.6 Time-‐index................................................................................................................................. 17 4.2.7 Declaration................................................................................................................................. 17 4.2.8 Methods list ............................................................................................................................... 18 4.3 Buffer specification ......................................................................................................................... 18 4.3.1 Declaration................................................................................................................................. 19 4.4 Enabling Histogram ........................................................................................................................ 19 iii 4.4.1 Declaration................................................................................................................................. 19 4.5 An Example XML file....................................................................................................................... 20 5 Transport methods .............................................................21 5.1 Synchronous methods ................................................................................................................... 21 5.1.1 NULL ............................................................................................................................................. 21 5.1.2 POSIX ............................................................................................................................................ 21 5.1.3 MPI ................................................................................................................................................ 21 5.1.4 MPI_LUSTRE.............................................................................................................................. 23 5.1.5 MPI_AMR..................................................................................................................................... 23 5.1.6 PHDF5 .......................................................................................................................................... 24 5.1.7 NetCDF4 ...................................................................................................................................... 25 5.1.8 Other methods.......................................................................................................................... 26 5.2 Asynchronous methods................................................................................................................. 26 5.2.1 Network Scalable Service Interface (NSSI).................................................................. 26 5.2.2 DataTap ....................................................................................................................................... 29 5.2.3 Decoupled and Asynchronous Remote Transfers (DART) ................................... 30 5.2.4 (DIMES) ....................................................................................................................................... 32 5.3 Other research methods at ORNL ............................................................................................. 32 5.3.1 MPI-‐CIO ....................................................................................................................................... 32 5.3.2 MPI-‐AIO ....................................................................................................................................... 32 6 ADIOS Read API ....................................................................34 6.1 Introduction ....................................................................................................................................... 34 6.2 Read C API description .................................................................................................................. 35 6.2.1 adios_errmsg / adios_errno................................................................................................ 35 6.2.2 adios_fopen................................................................................................................................ 35 6.2.3 adios_fclose................................................................................................................................ 36 6.2.4 adios_gopen / adios_gopen_byid...................................................................................... 36 6.2.5 adios_gclose............................................................................................................................... 37 6.2.6 adios_inq_var / adios_inq_var_byid ................................................................................ 37 6.2.7 adios_free_varinfo................................................................................................................... 38 6.2.8 adios_read_var / adios_read_var_byid ........................................................................... 38 6.2.9 adios_get_attr / adios_get_attr_byid ............................................................................... 39 6.2.10 adios_type_to_string............................................................................................................ 39 6.2.11 adios_type_size ...................................................................................................................... 39 6.3 Time series analysis API Description:..................................................................................... 39 6.3.1 adios_stat_cor / adios_stat_cov ......................................................................................... 40 6.4 Read Fortran API description..................................................................................................... 40 6.5 Compiling and linking applications.......................................................................................... 43 7 BP file format ........................................................................43 7.1 Introduction ....................................................................................................................................... 43 7.2 Footer.................................................................................................................................................... 44 7.2.1 Version......................................................................................................................................... 44 7.2.2 Offsets of indices ..................................................................................................................... 45 7.2.3 Indices.......................................................................................................................................... 45 7.3 Process Groups ................................................................................................................................. 47 7.3.1 PG header ................................................................................................................................... 48 7.3.2 Vars list........................................................................................................................................ 49 iv 7.3.3 Attributes list ............................................................................................................................ 49 8 Utilities....................................................................................51 8.1 adios_lint.............................................................................................................................................. 51 8.2 bpls......................................................................................................................................................... 51 8.3 bpdump ................................................................................................................................................ 53 9 Converters .............................................................................55 9.1 9.2 9.3 9.4 bp2h5 .................................................................................................................................................... 55 bp2ncd .................................................................................................................................................. 55 bp2ascii ................................................................................................................................................ 55 Parallel Converter Tools ............................................................................................................... 56 10 Group read/write process................................................57 10.1 Gwrite/gread/read....................................................................................................................... 57 10.2 Add conditional expression ...................................................................................................... 58 10.3 Dependency in Makefile ............................................................................................................. 58 11 C Programming with ADIOS .............................................59 11.1 Non-‐ADIOS Program .................................................................................................................... 59 11.2 Construct an XML File ................................................................................................................. 60 11.3 Generate .ch file (s)....................................................................................................................... 60 11.4 POSIX transport method (P writers, P subfiles + 1 metadata file) .......................... 61 11.5 MPI-‐IO transport method (P writers, 1 file) ...................................................................... 62 11.6 Reading data from the same number of processors ...................................................... 63 11.7 Writing to Shared Files (P writers, N files) ........................................................................ 64 11.8 Global Arrays................................................................................................................................... 66 11.8.1 MPI-‐IO transport method (P writers, 1 file)............................................................. 67 11.8.2 POSIX transport method (P writers, P Subfiles + 1 Metadata file) ................. 68 11.9 Writing Time-‐Index into a Variable....................................................................................... 68 11.10 Reading statistics........................................................................................................................ 70 12 Developer Manual ...............................................................72 12.1 Create New Transport Methods.............................................................................................. 72 12.1.1 Add the new method macros in adios_transport_hooks.h ................................. 72 12.1.2 Create adios_abc.c ................................................................................................................ 73 12.1.3 A walk-‐through example................................................................................................... 74 12.2 Profiling the Application and ADIOS..................................................................................... 79 12.2.1 Use profiling API in source code.................................................................................... 80 12.2.2 Use wrapper library ............................................................................................................ 83 13 Appendix ................................................................................84 v Figures Figure 1. ADIOS programming example.................................................................................. 12 Figure 2. Example XML configuration ...................................................................................... 14 Figure 3. Example XML file for time allocation..................................................................... 20 Figure 4. Server-‐friendly metadata approach: offset the create/open in time ...... 22 Figure 5. Example XML ................................................................................................................... 25 Figure 6. Example C source ........................................................................................................... 26 Figure 7. Example Original Client XML .................................................................................... 27 Figure 8. Example NSSI Client XML ........................................................................................... 27 Figure 9. Example NSSI Staging Service XML........................................................................ 27 Figure 10. Example PBS script with NSSI Staging Service............................................... 28 Figure 11. DataTap architecture................................................................................................. 29 Figure 12. Select DART as a transport method in the configuration file example.30 Figure 13. Start the server component in a job file first................................................... 31 Figure 14. Wait for server start-‐up completion and export the configuration to environment variables.................................................................................................................... 31 Figure 15. BP file structure ........................................................................................................... 44 Figure 16. Group index table ........................................................................................................ 46 Figure 17. Variables index table.................................................................................................. 47 Figure 18. Process group structure ........................................................................................... 48 Figure 19. Attribute entry structure ......................................................................................... 50 Figure 20. bpls utility....................................................................................................................... 52 Figure 21. bpdump utility .............................................................................................................. 54 Figure 22. Original program (examples/C/manual/1_nonadios_example.c)......... 60 Figure 23. Example config.xml file ............................................................................................. 60 Figure 24. Example gwrite_temperature.ch file................................................................... 61 Figure 25. Example adios program to write P files from P processors (examples/C/manual/2_adios_write.c)................................................................................... 62 Figure 26. Read in data generated by 2_adios_write using gread_temperature.ch (examples/C/manual/3_adios_read.c) .................................................................................... 64 Figure 27. Example of a generated gread_temperature.ch file...................................... 64 Figure 28. Example ADIOS program writing N files from P processors (N)............ 65 Figure 29. Config.xml for a global array (examples/C/global-‐ array/adios_global.xml) ................................................................................................................. 66 Figure 30. gwrite header file generated from config.xml ................................................ 67 Figure 31. Config.xml for a global array with time (examples/C/global-‐array-‐ time/adios_globaltime.xml).......................................................................................................... 69 Figure 32. Config.xml for creating histogram for an array variable (examples/C/stat/stat.xml).......................................................................................................... 70 vi Abbreviations ADIOS Adaptive Input/Output System API Application Program Interface DART Decoupled and Asynchronous Remote Transfers GTC Gyrokinetic Turbulence Code HPC high-‐performance computing I/O input/output MDS metadata server MPI Message-‐Passing Interface NCCS National Center for Computational Sciences ORNL Oak Ridge National Laboratory OS operating system PG process group POSIX Portable Operating System Interface RDMA remote direct memory access XML Extensible Markup Language vii Acknowledgments The Adaptive Input/Output (I/O) system (ADIOS) is a joint product of the National Center of Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) and the Center for Experimental Research in Computer Systems at the Georgia Institute of Technology. This work is being led by Scott Klasky (ORNL); Jay Lofstead (Georgia Tech, funded from Sandia Labs) is the main contributor. ADIOS has greatly benefited from the efforts of the following ORNL staff: Steve Hodson, who gave tremendous input and guidance; Chen Jin, who integrated ADIOS routines into multiple scientific applications; Norbert Podhorszki, who integrated ADIOS with the Kepler workflow system and worked with Qing Gary Liu on the read API. ADIOS also benefited from the efforts of the Georgia Tech team, including Prof. Karsten Schwan, Prof. Matt Wolf, Hassan Abbasi, and Fang Zheng. Wei Keng Liao, Northwestern University, and Wang Di, SUN, have also been invaluable in our coding efforts of ADIOS, writing several important code parts. Essentially, ADIOS is componentization of I/O transport methods. Among the suite of transport methods, Decoupled and Asynchronous Remote Transfers (DART) was developed by Prof. Manish Parashar and his student Ciprian Docan of Rutgers University. Without a scientific application, ADIOS would not have come this far. Special thanks go to Stephane Ethier at the Princeton Plasma Physics Laboratory (GTS); Researcher Yong Xiao and Prof. Zhihong Lin from the University of California, Irvine (GTC); Julian Cummings at the California Institute of Technology; Seung-‐ Hoe and Prof. C. S. Chang at New York University (XGC); Jackie Chen and Ray Grout at Sandia (S3D); and Luis Chacon at ORNL (Pixie3D). This project is sponsored by ORNL, Georgia Tech, The Scientific Data Management Center (SDM) at Lawrence Berkeley National Laboratory, and the U.S. Department of Defense. ADIOS contributors ANL: Rob Ross Georgia Tech: Hasan Abbasi, Jay Lofstead, Karsten Schwan, Fang Zheng, NCSU: Xiaosong Ma, Sriram Lakshminarasimhan, Abhijit Sachidananda, Michael Warren Northwestern University: Alok Choudhary, Wei Keng Liao, Chen Jin ORNL: Steve Hodson, Scott Klasky, Qing Gary Liu, Norbert Podhorszki, Steve Poole, Nagiza Samatova, Matthew Wolf Rutgers University: Ciprian Docan, Fan Zhang, Manish Parashar Sandia: Todd Kordenbrock SUN: Wang Di viii 1 Introduction 1.1 Goals As computational power has increased dramatically with the increase in the number of processors, input/output (IO) performance has become one of the most significant bottlenecks in today’s high-‐performance computing (HPC) applications. With this in mind, ORNL and the Georgia Institute of Technology’s Center for Experimental Research in Computer Systems have teamed together to design the Adaptive I/O System (ADIOS) as a componentization of the IO layer, which is scalable, portable, and efficient on different clusters or supercomputer platforms. We are also providing easy-‐to-‐use, high-‐level application program interfaces (APIs) so that application scientists can easily adapt the ADIOS library and produce science without diving too deeply into computer configuration and skills. 1.2 What Is ADIOS? ADIOS is a state-‐of-‐the-‐art componentization of the IO system that has demonstrated impressive IO performance results on leadership class machines and clusters; sometimes showing an improvement of more than 1000 times over well known parallel file formats. ADIOS is essentially an I/O componentization of different I/O transport methods. This feature allows flexibility for application scientists to adopt the best I/O method for different computer infrastructures with very little modification of their scientific applications. ADIOS has a suite of simple, easy-‐to-‐use APIs. Instead of being provided as the arguments of APIs, all the required metadata are stored in an external Extensible Markup Language (XML) configuration file, which is readable, editable, and portable for most machines. 1.3 The Basic ADIOS Group Concept The ADIOS “group” is a concept in which input variables are tagged according to the functionality of their respective output files. For example, a common scientific application has checkpoint files prefixed with restart and monitoring files prefixed with diagnostics. In the XML configuration file, the user can define two separate groups with tag names of adios-‐group as “restart” and “diagnostic.” Each group contains a set of variables and attributes that need to be written into their respective output files. Each group can choose to have different I/O transport methods, which can be optimal for their I/O patterns. 1.4 Other Interesting Features of ADIOS ADIOS contains a new self-‐describing file format, BP. The BP file format was specifically designed to support delayed consistency, lightweight data characterization, and resilience. ADIOS also contains python scripts that allow users to easily write entire “groups” with the inclusion of one include statement inside their Fortran/C code. Another interesting feature of ADIOS is that it allows 1 users to use multiple I/O methods for a single group. This is especially useful if users want to write data out to the file system, simultaneously capturing the metadata in a database method, and visualizing with a visualization method. The read API enables reading arbitrary subarrays of variables in a BP file and thus variables written out from N processor can be read in on arbitrary number of processors. ADIOS also takes care of the endianness problem at converting to the reader’s architecture automatically at reading time. Matlab reader is included in the release while the VisIt parallel interactive visualization software can read BP files too (from version 2.0). ADIOS is fully supported on Cray XT and IBM BlueGene/P computers as well as on Linux clusters and Mac OSX. 1.5 Future ADIOS 2.0 Goals One of the main goals for ADIOS 2.0 is to produce faster reads via indexing methods. Another goal is to provide more advanced data types via XML in ADIOS so that it will be compatible with F90/c/C++ structures/objects. We will also work on the following advanced topics for ADIOS 2.0: • • • A link to an external database for provenance recording. Autonomics through a feedback mechanism from the file system to optimize I/O performance. For instance, ADIOS can be adaptively changed from a synchronous to an asynchronous method or can decide when to write restart to improve I/O performance. A staging area for data querying, analysis, and in situ visualization. 2 2 Installation 2.1 Obtaining ADIOS You can download the latest version from the following website http://www.nccs.gov/user-support/adios 2.2 Quick Installation To get started with ADIOS, the following steps can be used to configure, build, test, and install the ADIOS library, header files, and support programs. cd trunk/ ./configure –prefix=--with-mxml= make make install Note: There is a runconf batch script in the trunk set up for our machines. Studying it can help you setting up the appropriate environment variables and configure options for your system. 2.2.1 Linux cluster The following is a snapshot of the batch scripts on Ewok, an Intel-‐based Infiniband cluster running Linux: export MPICC=mpicc export MPIFC=mpif90 export CC=pgcc export FC=pgf90 export CFLAGS=”-fPIC” ./configure --prefix = --with-mxml= --with-hdf5= --with-netcdf= The compiler pointed by MPICC is used to build all the parallel codes and tools using MPI, while the compiler pointed by CC is used to build the sequential tools. In practice, mpicc uses the compiler pointed by CC and adds the MPI library automatically. On clusters, this makes no real difference, but on Bluegene, or Cray XT, parallel codes are built for compute nodes, while the sequential tools are built for the login nodes. The –fPIC compiler flag is needed only if you build the Matlab tools. 2.2.2 Cray XT5 To install ADIOS on a Cray XT5, the right compiler commands and configure flags need to be set. The required commands for ADIOS installation on Jaguar are as follows: 3 export CC=cc export FC=ftn ./configure --prefix = --with-mxml= --with-hdf5= --with-netcdf= 2.2.3 Support for Matlab Matlab requires ADIOS be built with the GNU C compiler. It also requires relocatable codes, so you need to add the –fPIC flag to CFLAGS before configuring ADIOS. The matlab reader is not built automatically at make and is not installed with ADIOS. You need to compile it with Matlab’s MEX compiler after the make and copy the files manually to somewhere where Matlab can see them. cd tools/matlab make matlab 2.3 ADIOS Dependencies 2.3.1 Mini-‐XML parser (required) The Mini-‐XML library is used to parse XML configuration files. Mini-‐XML can be downloaded from http://www.minixml.org/software.php 2.3.2 MPI and MPI-‐IO (required) MPI and MPI-‐IO is required for the ADIOS 1.2 release. Currently, most large-‐scale scientific applications rely on the Message Passing Interface (MPI) library to implement communication among processes. For instance, when the Portable Operating System Interface (POSIX) is used as transport method, the rank of each processor in the same communication group, which needs to be retrieved by the certain MPI APIs, is commonly used in defining the output files. MPI-‐IO can also be considered the most generic I/O library on large-‐scale platforms. 2.3.3 Fortran90 compiler (optional) The Fortran 90 interface and example codes are compiled only if there is an f90 compiler available. By default it is required but you can disable it with the option --disable-fortran. 2.3.4 Serial NetCDF-‐3 (optional) The bp2ncd converter utility to NetCDF format is built only if NetCDF is available. Currently ADIOS uses the NetCDF-‐3 library. Use the option --with-netcdf= or ensure that the NETCDF_DIR environment variable is set before configuring ADIOS. 4 2.3.5 Serial HDF5 (optional) The bp2h5 converter utility to HDF5 format is built only if a HDF5 library is available. Currently ADIOS uses the 1.6 version of the HDF5 API but it can be built and used with the 1.8.x version of the HDF5 library too. Use the option --with-hdf5= when co nfiguring ADIOS. 2.3.6 PHDF5 (optional) The transport method writing files in the Parallel HDF5 format is built only if a parallel version of the HDF5 library is (also) available. You need to use the option --with-phdf5= to build this transport method. If you define Parallel HDF5 and do not define serial HDF5, then bp2h5 will be built with the parallel library. Note that if you build this transport method, ADIOS will depend on PHDF5 when you link any application with ADIOS even if you the application does not intend to use this method. If you have problems compiling ADIOS with PHDF5 due to missing flags or libraries, you can define them using --with-phdf5-incdir= , --with-phdf5-libdir= and --with-phdf5-libs= 2.3.7 NetCDF-‐4 Parallel The NC4 transport method writes files using the NetCDF-‐4 library which in turn is based on the parallel HDF5 library. You need to use the option --with-nc4par= to build this transport method. Also, you need the parallel HDF5 library. 2.3.8 Read-‐only installation If you just want the read API to be compiled for reading BP files, use the --disable-write option. 2.4 Full Installation The following list is the complete set of options that can be used with configure to build ADIOS and its support utilities: --help print the usage of ./configure command --with-tags[=TAGS] include additional configurations [automatic] --with-mxml=DIR Location of Mini-XML library --with-hdf5= --with-hdf5-incdir= --with-hdf5-libdir= --with-phdf5= --with-phdf5-incdir= --with-phdf5-libdir= 5 --with-netcdf= --with-netcdf-incdir= --with-netcdf-libdir= --with-nc4par= --with-nc4par-incdir= --with-nc4par-libdir= --with-nc4par-libs= , e.g. lnetcdf Some influential environment variables are lists below: CC CFLAGS LDFLAGS CPPFLAGS CPP CXX CXXFLAGS FC FCFLAGS CXXCPP F77 FFLAGS MPICC MPIFC C compiler command C compiler flags linker flags, e.g. -L if you have libraries in a nonstandard directory C/C++ preprocessor flags, e.g. -I if you have headers in a nonstandard directory C preprocessor C++ compiler command C++ compiler flags Fortran compiler command Fortran compiler flags C++ preprocessor Fortran 77 compiler command Fortran 77 compiler flags MPI C compiler command MPI Fortran compiler command 2.5 Compiling applications using ADIOS Adios configuration creates a text file that contains the flags and library dependencies that should be used when compiling/linking user applications that use ADIOS. This file is installed as bin/adios_config.flags under the installation directory by make install. A script, named adios_config is also installed that can print out selected flags. Moreover, if you copy the adios_config.flags file and remove all “ characters from it, you can include that file in your Makefile and use the flags. 6 3 ADIOS Write API As mentioned earlier, ADIOS writing is comprised of two parts: the XML configuration file and APIs. In this section, we will explain the functionality of the writing API in detail and how they are applied in the program. 3.1 Write API Description 3.1.1 Introduction ADIOS provides both Fortran and C routines. All ADIOS routines and constants begin with the prefix “adios_”. For the remainder of this section, only the C versions of ADIOS APIs are presented. The primary differences between the C and Fortran routines is that error codes are returned in a separate argument for Fortran as opposed to the return value for C routines. A unique feature of ADIOS is group implementation, which is constituted by a list of variables and associated with individual transport methods. This flexibility allows the applications to make the best use of the file system according to its own different I/O patterns. 3.1.2 ADIOS-‐required functions This section contains the basic functions needed to integrate ADIOS into scientific applications. ADIOS is a lightweight I/O library, and there are only seven required functions from which users can write scalable, portable programs with flexible I/O implementation on supported platforms: adios_init—initialize ADIOS and load the configuration file adios_open—open the group associated with the file adios_group_size—pass the group size to allocate the memory adios_write—write the data either to internal buffer or disk adios_read—associate the buffer space for data read into adios_close—commit write/read operation and close the data adios_finalize—terminate ADIOS You can add functions to your working knowledge incrementally without having to learn everything at once. For example, you can achieve better I/O performance on some platforms by simply adding the asynchronous functions adios_start_calculation, adios_end_calculation, and adios_end_iteration to your repertoire. These functions will be detailed below in addition to the seven indispensable functions. The following provides the detailed descriptions of required APIs when users apply ADIOS in the Fortran or C applications. 7 3.1.2.1 adios_init This API is required only once in the program. It loads XML configuration file and establishes the execution environment. Before any ADIOS operation starts, adios_init is required to be called to create internal representations of various data types and to define the transport methods used for writing. int adios_init (const char * xml_fname) Input: xml_fname – string containing the name of the XML configuration file Fortran example: call adios_init ("config.xml", ierr) 3.1.2.2 adios_open This API is called whenever a new output file is opened. adios_open, corresponding to fopen (not surprisingly), opens an adios-‐group given by group_name and associates it with one or a list of transport methods, which can be identified in future operations by the File structure whose pointer is returned as fd_p. The group name should match the one defined in the XML file. The I/O handle fd_p prepares the data types for the subsequent calls to write data using the io_handle. The third argument, file_name, is a string representing the name of the file. As the last argument, mode is a string containing a file access mode. It can be any of these three mode specifiers: “r,” “w,” or “a.” Currently, ADIOS supports three access modes: “write or create if file does not exist,” “read,” and “append file.” The call opens the file only if no coordination is needed among processes for transport methods that the users have chosen for this adios_group, such as POSIX method. Otherwise, the actual file will be opened in adios_group_size based on the provided argument comm, which will be examined in Sect. 4.1.2.3. As the last argument, we pass the pointer of coordination communicator down to the transport method layer in ADIOS. This communicator is required in MPI-‐IO–based methods such as collective and independent MPI-‐IO. int adios_open (int64_t * fd_p, const char * group_name ,const char * file_name, const char * mode,void *comm) Input: fd_p—pointer to the internal file structure group_name—string containing the name of the group file_name—string containing the name of the file to be opened mode—string containing a file access mode comm— communicator for multi-‐process coordination Fortran example: call adios_open (handle, “restart”, “restart.bp”, "w", comm, ierr) 8 3.1.2.3 adios_group_size This function passes the size of the group to the internal ADIOS transport structure to facilitate the internal buffer management and to construct the group index table. The first argument is the file handle. The second argument is the size of the payload for the group opened in the adios_open routine. This value can be calculated manually or through our python script. It does not affect read operation because the size of the data can be retrieved from the file itself. The third argument is the returned value for the total size of this group, including payload size and the metadata overhead. The value can be used for performance benchmarks, such as I/O speed. int adios_group_size (int64_t * fd_p, uint64_t group_size, uint64_t * total_size) Input: fd_p—pointer to the internal file structure group_size—size of data payload in bytes to be written out. If there is an integer 2 × 3 array, the payload size is 4*2*3 (4 is the size of integer) output : total_size—the total sum of payload and overhead, which includes name, data type, dimensions and other metadata) Fortran example: call adios_group_size (handle, groupsize, totalsize, ierr) 3.1.2.4 adios_write The adios_write routine submits a data element var for writing and associates it with the given var_name, which has been defined in the adios group opened by adios_open. If the ADIOS buffer is big enough to hold all the data that the adios group needs to write, this API only copies the data to buffer. Otherwise, adios_write will write to disk without buffering. Currently, adios_write supports only the address of the contiguous block of memory to be written. In the case of a noncontiguous array comprising a series of subcontiguous memory blocks, var should be given separately for each piece. In the next XML section, we will further explain that var_name is the value of attribute “name” while var is the value of attribute “gwrite,” both of which are defined in the corresponding element inside adios_group in the XML file. By default, it will be the same as the value of attribute “name” if “gwrite” is not defined. int adios_write (int64_t fd_p, const char * var_name, void * var) Input: fd_p—pointer to the internal file structure var_name—string containing the annotation name of scalar or vector in the XML file var —the address of the data element defined need to be written 9 Fortran example: call adios_write (handle, "myvar", v, ierr) 3.1.2.5 adios_read The write API contains a read function (historically, the first one) that uses the same transport method and the xml config file to read in data. It works only on the same number of processes as the data was written out. Typically, checkpoint/restart files are written and read on the same number of processors and this function is the simplest way to read in data. However, if you need to read in on a different number of processors, or you do not want to carry the xml config file with the reading application, you should use the newer and more generic read API discussed in Section 6. Similar to adios_write, adios_read submits a buffer space var for reading a data element into. This does NOT actually perform the read. Actual population of the buffer space will happen on the call to adios_close. In other words, the value(s) of var can only be utilized after adios_close is performed. Here, var_name corresponds to the value of attribute “gread“ in the element declaration while var is mapped to the value of attribute “name.” By default, it will be as same as the value of attribute “name” if “gread” is not defined. int adios_read (int64_t fd_p, const char * var_name, uint64_t read_size, void * var ) Input: fd_p – pointer to the internal file structure var_name – the name of variable recorded in the file var – the address of variable defined in source code read_size – size in bytes of the data to be read in Fortran example: call adios_read (handle, “myvar”, 8, v, ierr) 3.1.2.6 adios_close The adios_close routine commits the writing buffer to disk, closes the file, and releases the handle. At that point, all of the data that have been copied during adios_write will be sent as-‐is downstream. If the handle were opened for read, it would fetch the data, parse it, and populate it into the provided buffers. This is currently hard-‐coded to use posix I/O calls. int adios_close (int64_t * fd_p); Input: fd_p – pointer to the internal file structure 10 Fortran example: call adios_close (handle, ierr) 3.1.2.7 adios_finalize The adios_finalize routine releases all the resources allocated by ADIOS and guarantees that all remaining ADIOS operations are finished before the code exits. The ADIOS execution environment is terminated once the routine is fulfilled. The proc_id parameter provides users the opportunity to customize special operation on proc_id—usually the ID of the head node. int adios_finalize (int proc_id) Input: proc_id – the rank of the processe in the communicator or the user-‐ defined coordination variable Fortran example: call adios_finalize (rank, ierr) 3.1.3 Nonblocking functions 3.1.3.1 adios_end_iteration The adios_end_iteration provides the pacing indicator. Based on the entry in the XML file, it will tell the transport method how much time has elapsed in a transfer. 3.1.3.2 adios_start_ calculation/ adios_end_calculation Together, adios_start_calculation and adios_end_calculation indicate to the scientific code when nonblocking methods should focus on engaging their I/O communication efforts because the process is mainly performing intense, stand-‐ alone computation. Otherwise, the code is deemed likely to be communicating heavily for computation coordination. Any attempts to write or read during those times will negatively impact both the asynchronous I/O performance and the interprocess messaging. 3.1.4 Other function One of our design goals is to keep ADIOS APIs as simple as possible. In addition to the basic I/O functions, we provide another routine listed below. 3.1.4.1 adios_get_write_buffer The adios_get_write_buffer function returns the buffer space allocated from the ADIOS buffer domain. In other words, instead of allocating memory from free memory space, users can directly use the allocated ADIOS buffer area and thus avoid copying memory from the ADIOS buffer to a user-‐defined buffer. int adios_get_write_buffer (int64_t fd_p, const char * var_name, uint64_t * size, void ** buffer) 11 Input: fd_p – pointer to the internal File structure var_name – name of the variable that will be read size – size of the buffer to request output: buffer – initial address of read-‐in buffer for storing the data of var_name 3.1.5 Create a first ADIOS program Figure 1 is a programming example that illustrates how to write a double-‐ precision array t and a double-‐precision array with size of NX into file called “test.bp,” which is organized in BP, our native tagged binary file format. This format allows users to include rich metadata associated with the block of binary data as well the indexing mechanism for different blocks of data (see Chap. 5). /*example of parallel MPI write into a single file */ #include // ADIOS header file required #include ”adios.h” int main (int argc, char *argv[]) { int i, rank, NX; double t [NX]; // ADIOS variables declaration int64_t handle; uint_64 total_size; MPI_Comm comm = MPI_COMM_WORLD; MPI_Init ( &argc, &argv); MPI_Comm_rank (comm, &rank); // data initialization for ( i=0; i 13 Figure 2. Example XML configuration 4.2 adios-‐group The adios-‐group element represents a container for a list of variables that share the common I/O pattern as stated in the basic concepts of ADIOS in first chapter. In this case, the group domain division logically corresponds to the different functions of output in scientific applications, such as restart, diagnosis, and snapshot. Depending on the different applications, adios-‐group can occur as many times as is needed. 4.2.1 Declaration The following example illustrates how to declare an adios group inside an XML file. First we start with adios-‐group as our tag name, which is case insensitive. It has an indispensable attribute called “name,” whose value is usually defined as a descriptive string indicating the function of the group. In this case, the string is called “restart,” because the files into which this group is written are used as checkpoints. The second attribute “host-‐language” indicates the language in which this group’s I/O operations are written. The value of attribute “coordination-‐communicator” is used to coordinate the operations on a shared file accessed by multiple processes in the same communicator domain. “Coordination-‐var” provides the ability to use the user-‐defined variable, for example mype, rather than an MPI communicator for file coordination. …… ……… …… Required: • name—containing a descriptive string to name the group Optional: • host-‐language—language in which the source code for group is written • coordination-‐communicator—MPI-‐IO writing to a shared file • coordination-‐var—coordination variables for non-‐MPI methods, such as Datatap method • time-‐index—time attribute variable 4.2.2 Variables The nested variable element “var” for adios_group, which can be either an array or a primitive data type, is determined by the dimension attribute provided. 14 4.2.2.1 Declaration The following is an example showing how to define a variable in the XML file. 4.2.2.2 Attribute list The attributes associated with var element as follows: Required: • name – the string name of variable stored in the output file • type – the data type of the variable Optional: • gwrite – the value will be used in the python scripts to generate adios_write routines; the default value will be the same as attribute name if gwrite is not defined. • gread – the value will be used in the python scripts to generate adios_read routines’ the default value will be the same as attribute name if gread is not defined. • path -‐ HDF-‐5-‐style path for the element or path to the HDF-‐5 group or data item to which this attribute is attached. The default value is “/”. • dimensions -‐ a comma-‐separated list of numbers and/or names that correspond to integer var elements determine the size of this item. If not specified, the variable is scalar. • read – value is either yes or no; in the case of no, the adios_read routine will not be generated for this var entry. If undefined, the default value will be yes. 4.2.3 Attributes The attribute element for adios_group provides the users with the ability to specify more descriptive information about the variables or group. The attributes can be defined in both static or dynamic fashions. 4.2.3.1 Declaration The static type of attributes can be defined as follows: 15 If an attribute has dynamic value that is determined by the runtime execution of the program, it can be specified as follows: where var “time” need to be defined in the same adios-‐group. 4.2.3.2 Attribute list Required: • name -‐ name of the attribute • path – hierarchical path inside the file for the attribute • value – attribute has static value of the attribute, mutually exclusive with the attribute var • type – string or numeric type, paired with attribute value, in other words,, mutually exclusive with the attribute var also • var – attribute has dynamic value that is defined by a variable in var 4.2.4 Gwrite/src The element is unlike or , which are parsed and stored in the internal file structure in ADIOS. The element only affects the execution of python scripts (see Chap. 10). Any content (usually comments, conditional statements, or loop statements) in the value of attribute “src” is copied identically into generated pre-‐processing files. Declaration Required: • src -‐ any statement that needs to be added into the source code. This code must will be inserted into the source code, and must be able to be compiled in the host language, C or Fortran. 4.2.5 Global arrays The global-bounds element is an optional nested element for the adios-‐group. It specifies the global space and offsets within that space for the enclosed variable elements. In the case of writing to a shared file, the global-‐bounds information is recorded in BP file and can be interpreted by converters or other postprocessing 16 tools or used to write out either HDF5 or NetCDF files by using PHDF5 or the PnetCDF method. 4.2.6 Time-‐index ADIOS allows a dataset to be expanded in the space domain given by global bounds and in time domain. It is very common for scientific applications to write out a monitoring file at regular intervals. The file usually contains a group of time-‐ based variables that have undetermined dimensional value on the time axis. ADIOS is Similar to NetCDF in that it accumulates the time-‐index in terms of the number of records, which theoretically can be added to infinity. If any of variables in an adios group are time based, they can be marked out by adding the time-‐index variable as another dimension value. 4.2.6.1 Declaration … variable declarations … Required: • dimensions – the dimension of global space • offsets – the offset of the data set in global space Any variables used in the global-‐bounds element for dimensions or offsets declaration need to be defined in the same adios-‐group as either variables or attributes. For detailed global arrays use, see the example illustrated in Section 11.8. Changing I/O Without Changing Source: The method element provides the hook between the adios-‐group and the transport methods. The user employs a different transport method simply by changing the method attribute of the method element. If more than one method element is provided for a given group, each element will be invoked in the order specified. This neatly gives triggering opportunities for workflows. To trigger a workflow once the analysis data set has been written to disk, the user makes two element entries for the analysis adios-‐ group. The first indicates how to write to disk, and the second performs the trigger for the workflow system. No recompilation, relinking, or any other code changes are required for any of these changes to the XML file. 4.2.7 Declaration The transport element is used to specify the mapping of an I/O transport method, including optional initialization parameters, to the respective adios-‐group. There are two major attributes required for the method element: Required: • group -‐ corresponds to an adios-‐group specified earlier in the file. • method – a string indicating a transport method to use with the associated adios-‐group Optional: • priority– a numeric priority for the I/O method to better schedule this write with others that may be pending currently • base-‐path–the root directory to use when writing to disk or similar purposes • iterations– a number of iterations between writes of this group used to gauge how quickly this data should be evacuated from the compute node 4.2.8 Methods list As the componentization of the IO substrate, ADIOS supports a list of transport methods, described in Section 5: • • • • • • • • • • • • • NULL POSIX MPI MPI-‐LUSTRE MPI-‐AMR PHDF5 NC4 (NETCDF4) NSSI DATATAP DART DIMES MPI-‐CIO (as research method, not published in 1.2) ADAPTIVE (as research method, not published in 1.2) 4.3 Buffer specification The buffer element defines the attributes for internal buffer size and creating time that apply to the whole application (Figure 3). The attribute allocate-‐time is identified as being either “now” or “oncall” to indicate when the buffer should be allocated. An “oncall” attribute waits until the programmer decides that all memory needed for calculation has been allocated. It then calls upon ADIOS to allocate buffer. There are two alternative attributes for users to define the buffer size: MB and free-‐memory-‐percentage. 18 4.3.1 Declaration Required: • size-‐MB – the user-‐defined size of buffer in megabytes. ADIOS can at most allocate from compute nodes. It is exclusive with free-‐memory-‐ percentage. • free-‐memory percentage – the user-‐defined percentage from 0 to 100% of freememory available on the machine. It is exclusive with size-‐MB. • allocate-‐time – indicates when the buffer should be allocated 4.4 Enabling Histogram ADIOS 1.2 has the ability to compute a histogram of the given variable’s data values at write time. This is specified via the tag in the XML file. The parameters "adios-group" and "var" specify the variable for which the histogram is to be performed. "var" is the name of the variable and "adios-group" is the name of the adios group to which the variable belongs to. 4.4.1 Declaration The histogram binning intervals can be input in two ways via the XML file: • By listing the break points as a list of comma separated values in the parameter "break-‐points" • By specifying the boundaries of the breaks, and the number of intervals between variable’s min and max values Both inputs create the bins (-‐Inf, 0), [0, 100), [100, 200), [200, 300), [300, Inf). For this example, the final set of frequencies for these 5 binning intervals will be calculated. Required: • adios-‐group – corresponds to an adios-‐group specified earlier in the file. • var -‐ corresponds to a variable in adios-‐group specified earlier in the file. Optional: 19 • break-‐points -‐ list of comma separated values sorted in ascending order • min -‐ minimum value of the binning boundary • max -‐ maximum value of the binning boundary (it should be greater than min) • count -‐ number of break points between the min and max values A valid set of binning intervals must be provided either by specifying "min," "max," and "count" parameters or by providing the "break-‐points." The intervals given under "break-‐points" will take precedence when calculating the histogram intervals, if "min," "max," and "count" as well as “break-‐points” are provided. 4.5 An Example XML file Figure 3. Example XML file for time allocation. 20 5 Transport methods Because of the time it can take to move data from one process to another or to write and read data to and from a disk, it is often advantageous to arrange the program so that some work can be done while the messages are in transit. So far, we have used non-‐blocking operations to avoid waiting. Here we describe some details for arranging a program so that computation and I/O can take place simultaneously. 5.1 Synchronous methods 5.1.1 NULL The ADIOS NULL method allows users to quickly comment out an ADIOS group by changing the transport method to “NULL,” users can test the speed of the routine by timing the output against no I/O. This is especially useful when working with asynchronous methods, which take an indeterminate amount of time. Another useful feature of this I/O is that it quickly allows users to test out the system and determine whether bugs are caused by the I/O system or by other places in the codes. 5.1.2 POSIX The simplest method provided in ADIOS just does binary POSIX I/O operations. Currently, it does not support shared file writing or reading and has limited additional functionality. The main purpose for the POSIX I/O method is to provide a simple way to migrate a one-‐file-‐per-‐process I/O routine to ADIOS and to test the results without introducing any complexity from MPI-‐IO or other I/O methods. Performance gains just by using this transport method are likely due to our aggressive buffering for better streaming performance to storage. The buffering method writes out files in BP format, which is a compact, self-‐describing format. Additional features may be added to the ADIOS POSIX transport method over time. A new transport method with a related name, such as POSIX-‐ASCII, may be provided to perform I/O with additional features. The POSIX-‐ASCII example would write out a text version of the data formatted nicely according to some parameters provided in the XML file. 5.1.3 MPI Many large-‐scale scientific simulations generate a large amount of data, spanning thousands of files or datasets. The use of MPI-‐IO reduces the amount of files and thus is helpful for data management, storage, and access. The original MPI method was developed based on our experiments with generating the better MPI-‐IO performance on the ORNL Jaguar machine. Many of the insights have helped us achieve excellent performance on both the Jaguar XT4 machine and on the other clusters. Some of the key insights we have taken 21 advantage of include artificially serialized MPI_File_open calls and additional timing delays that can achieve reduced delays due to metadata server (MDS) conflicts on the attached Lustre storage system. The adapted code takes full advantage of NxM grouping through the coordination-‐communicator. This grouping generates one file per coordination-‐ communicator with the data stored sequentially based on the process rank within the communicator. Figure 4 presents in the example of GTC code, 32 processes in the same Toroidal zone write to one integrated file. Additional serialization of the MPI_File_open calls is done using this communicator as well because each process may have a different size data payload. Rank 0 calculates the size that it will write, calls MPI_File_open, and then sends its size to rank 1. Rank 1 listens for the offset to start from, adds its calculated size, does an MPI_File_open, and sends the new offset to rank 2. This continues for all processes within the communicator. Additional delays for performance based on the number of processes in the communicator and the projected load on the Lustre MDS can be used to introduce some additional artificial delays that ultimately reduce the amount of time the MPI_File_open calls take by reducing the bottleneck at the MDS. An important fact to be noted is that individual file pointers are retrieved by MPI_File_open so that each process has its own file pointer for file seek and other I/O operations. Figure 4. Server-‐friendly metadata approach: offset the create/open in time We built the MPI transport method, mainly with Lustre in mind because it has been the primary parallel storage service we have available. However, other file-‐ system-‐specific tunings are certainly possible and fully planned as part of this transport method system. For each new file system we encounter, a new transport method implementation tuned for that file system, and potentially that platform, can be developed without impacting any of the scientific code. The MPI transport method is the most mature, fully featured, and well tested method in ADIOS. We recommend to anyone creating a new transport method 22 that they study it as a model of full functionality and some of the advantages that can be made through careful management of the storage resources. 5.1.4 MPI_LUSTRE The MPI_LUSTRE method is the MPI method with stripe alignment to achieve even greater write performance on the Lustre file system. Each writing process’ data is aligned to Lustre stripes. This results in better parallelization of the storage elements. The drawback of using this method is that empty chunks are created between the data sets of the separate processes in the output file, and thus the file size is larger than with using the MPI method. The size of an empty space is the difference between the size of the output data of one writing process and the total size of Lustre stripes that can hold that amount of data, so that the next writing process’ output starts aligned with another stripe. Choose the stripe size for the output file therefore carefully, to make the empty space as small as possible. The following XML snippet shows how to use the MPI_LUSTRE method in ADIOS. stripe_count=16,stripe_size=4194304,block_size=4194304 There are three key parameters used in this method. • stripe_count specifies how many storage targets to use for the whole output file. If not set, the default value is 4. • stripe_size specifies Lustre stripe size in bytes. If not set, the default value is 1048576 (i.e. 1 MB). • block_size specifies the size of each I/O write request. As an example, if total data size to be written from one process is 800 MB at a time, and you want ADIOS to issue twenty I/O write requests issued from one process to Lustre during the writing, then the block_size should be 40MB. 5.1.5 MPI_AMR The MPI_AMR method is designed to maximize write performance for applications such as adaptive mesh refinement (AMR) on the Lustre file system. In AMR-‐like applications, each processor outputs varying amount of data and some can output very small size data. Based upon MPI_LUSTRE method, MPI_AMR further improves the write speed by 1. aggregating data from multiple MPI processors into large chunks. This effectively increases the size of each request and reduces the number of I/O requests. 2. threading the metadata operations such as file open. Users are encouraged to call adios_open and adios_group_size API as early as possible. In case Lustre MDS has a performance hit, the overall metadata performance 23 won't be affected. The following code snippet shows a typical way of using this method to improve metadata performance. adios_open(...); adios_group_size(...); …… //do your computation …… adios_write(..); adios_write(..); adios_close(..); 3. further removing communication and wide striping overhead by writing out subfiles. Please refer to POSIX method on how to read data from subfiles. The following XML snippet shows how to use MPI_AMR method in ADIOS. There are five key parameters used in this method.stripe_count=1;stripe_size=10485760;block_size=10485760; num_aggregators=2400;merging_pgs=0 • stripe_count specifies how many storage targets to stripe across for each subfile. If not set, the default value is Lustre’s default value (i.e. 4). It is recommended that this value set to 1 in the ADIOS 1.2 release. • stripe_size specifies Lustre stripe size in bytes. If not set, the default value is 1048576 (i.e. 1 MB). • block_size specifies the size of each I/O write request. As an example, if block_size is 4 MB and the total data to write out is 8 MB, there will be two I/O write requests issued. • num_aggregators specifies the number of aggregators to use. • merging_pgs is a flag that specifies whether ADIOS process groups are merged during aggregation operation. It is recommended that this flag set to 0 in the ADIOS 1.2 release. Now for the selection of num_aggregators parameter, suppose you have a MPI job with 120,000 processors and the number of aggregator is set to 2400. Then each aggregator will aggregate the data from 120,000/2400=50 processors. Carefully note that setting num_aggregators too small can incur out-‐of-‐memory issue. 5.1.6 PHDF5 HDF5, as a hierarchical File structure, has been widely adopted for data storage in various scientific research fields. Parallel HDF5 (PHDF5) provides a series of APIs to perform the I/O operations in parallel from multiple processors, which dramatically improves the I/O performance of the sequential approach to 24 read/write an HDF5 file. In order to make the difference in transport methods and file formats transparent to the end users, we provide a mechanism that write/read an HDF5 file with the same schema by keeping the same common adios routines with only one entry change in the XML file. This method provides users with the capability to write out exactly the same HDF5 files as those generated by their original PHDF5 routines. Doing so allows for the same analysis tool chain to analyze the data. Currently, HDF5 supports two I/O modes: independent and Collective read or write, which can use either the MPI or the POSIX driver by specifying the dataset transfer property list in H5Dwrite function calls. In this release, only the MPI driver is supported in ADIOS; later on, both I/O drivers will be supported by changing the attribute information for PHDF5 method elements in XML. 5.1.7 NetCDF4 Another widely accepted standard file format is NetCDF, which is the most frequently used file format in the climate and weather research communities. Beginning with the NetCDF 4.0 release, NetCDF has added PHDF5 as a new option for data storage called the “netcdf-‐4 format”. When a NetCDF4 file is opened in this new format, NetCDF4 inherits PHDF5’s parallel I/O capabilities. The NetCDF4 method creates a single shared filed in the “netcdf-‐4 format” and uses the parallel I/O features. The NetCDF4 method supports multiple open files. To select the NetCDF4 method use “NC4” as the method name in the XML file. Restrictions: Due to the collective nature of the NetCDF4 API, there are some legal XML files that will not work with the NetCDF4 method. The most notable incompatibility is an XML fragment that creates an array variable without a surrounding global-‐bounds. Within the application, a call to adios_set_path() is used to add a unique prefix to the variable name. A rank-‐based prefix is an example.Figure 5. Example XML 25 char path[1024]; adios_init ("config.xml"); adios_open (&adios_handle, "atoms", filename, "w", &comm); sprintf(path, “node_%d_”, myrank); adios_set_path(adios_handle, path); #include "gwrite_atoms.ch" adios_close (adios_handle); adios_finalize (myrank); Figure 6. Example C source This technique is an optimization that allows each rank to creates a variable of the exact dimensions of the data being written. In this example, each rank may be tracking a different number of atoms. The NetCDF4 collective API expects each rank to write the same variable with the same dimensions. The example violates both of these expectations. Note: NetCDF4 files created in the new “netcdf-‐4 format” cannot be opened with existing tools linked with NetCDF 3.x. However, NetCDF4 provides a backward compatibility API, so that these tools can be relinked with NetCDF4. After relink, these tools can open files in the “netcdf-‐4 format”. 5.1.8 Other methods ADIOS provides an easy plug-‐in mechanism for users or developers to design their own transport method. A step-‐by-‐step instruction for inserting a new I/O method is given in Section 12.1. Users are likely to choose the best method from among the supported or customized methods for the running their platforms, thus avoiding the need to verify their source codes due to the switching of I/O methods. 5.2 Asynchronous methods 5.2.1 Network Scalable Service Interface (NSSI) The Network Scalable Service Interface (NSSI) is a client-‐server development framework for large-‐scale HPC systems. NSSI was originally developed out of necessity for the Lightweight File Systems (LWFS) project, a joint effort between researchers at Sandia National Laboratories and the University of New Mexico. The LWFS approach was to provide a core set of fundamental capabilities for security, data-‐movement, and storage, and allow extensibility through the development of additional services. The NSSI framework was designed to be the vehicle to enable the rapid development of such services. The NSSI method is composed of two components – a client method and a staging service. The client method does not perform any file I/O. Instead, all ADIOS operations become requests to the staging service. The staging service is an ADIOS application, which allows the user to select any ADIOS method for output. 26 Client requests fall into two categories – pass-‐through and cached. Pass-‐through requests are requests that are synchronous on the staging service and return an error immediately on failure. adios_open() is an example of a pass-‐through request. Cached requests are requests that are asynchronous on the staging service and return an error at a later time on failure. adios_write() is an example of a cached request. All data cached for a particular file is aggregated and flushed when the client calls adios_close(). Each component requires its own XML config file. The client method can be selected in the client XML config using “NSSI” as the method. The service XML config must be the same as the client XML config except that the method is “NSSI_FILTER”. When the NSSI_FILTER method is selected, the “submethod” parameter is required. The “submethod” parameter specifies the ADIOS method that the staging service will use for output. Converting an existing XML config file for use with NSSI is illustrated in the following three Figures. max_storage_targets=160 Figure 7. Example Original Client XMLFigure 8. Example NSSI Client XML submethod=”MPI”;subparameters=”max_storage_targets=160” Figure 9. Example NSSI Staging Service XML After creating new config files, the application’s PBS script (or other runtime script) must be modified to start the staging service prior to application launch and stop the staging service after application termination. The ADIOS distribution includes three scripts to help with these tasks. The start.nssi.staging.sh script launches the staging service. start.nssi.staging.sh takes two arguments – the number of staging services and an XML config file. The create.nssi.config.sh script creates an XML file that the NSSI method uses to locate the staging services. create.nssi.config.sh takes two arguments – the name of the output config file and the name of the file containing a list of service contact info. The service contact file is created by the staging service at startup. The 27 staging service uses the ADIOS_NSSI_CONTACT_INFO environment variable to determine the pathname of the contact file. The kill.nssi.staging.sh script sends a kill request to the staging service. kill.nssi.staging.sh takes one argument – the name of the file containing a list of service contact info (ADIOS_NSSI_CONTACT_INFO). The staging service will gracefully terminate. #!/bin/bash #PBS -‐l walltime=01:00:00,size=128 export RUNTIME_PATH=/tmp/work/$USER/genarray3d.$PBS_JOBID mkdir -‐p $RUNTIME_PATH cd $RUNTIME_PATH export ADIOS_NSSI_CONTACT_INFO=$RUNTIME_PATH/nssi_contact.xml export ADIOS_NSSI_CONFIG_FILE=$RUNTIME_PATH/nssi_config.xml $ADIOS_DIR/scripts/start.nssi.staging.sh 4 $RUNTIME_PATH/genarray3d.server.xml >server.log 2>&1 & sleep 3 $ADIOS_DIR/scripts/create.nssi.config.sh $ADIOS_NSSI_CONFIG_FILE $ADIOS_NSSI_CONTACT_INFO aprun -‐n 64 $ADIOS_SRC_PATH/tests/genarray/genarray $RUNTIME_PATH/test.output 4 4 4 128 128 80 >runlog $ADIOS_DIR/scripts/kill.nssi.staging.sh $ADIOS_NSSI_CONTACT_INFO Figure 10. Example PBS script with NSSI Staging Service Figure 10 is a example PBS script that highlights the changes required to launch the NSSI staging service. Required Environment Variables. The NSSI Staging Service requires that the ADIOS_NSSI_CONTACT_INFO variable be set. This variable specifies the full pathname of the file that the service uses to save its contact information. Depending on the platform, the contact information is a NID/PID pair or a hostname/port pair. Rank0 is responsible for gathering the contact information from all members of the job and writing the contact file. The NSSI method requires that the ADIOS_NSSI_CONFIG_FILE variable be set. This variable specifies the full pathname of the file that contains the complete configuration information for the NSSI method. A configuration file with contact information and reasonable defaults for everything else can be created with the create.nssi.config.sh script. Calculating the Number of Staging Services Required. Remember that all adios_write() operations are cached requests. This implies that the staging service must have enough RAM available to cache all data written by its clients between adios_open() and adios_close(). The current aggregation algorithm requires a buffer equal to the size of the data into which the data is aggregated. The start.nssi.staging.sh script launches a single service per node, so the largest amount of data that can be cached per service is 50% of the memory on a node minus system overhead. System overhead can be estimated at 500MB. If a node has 16GB of memory, the amount of data that can be cached is 7.75GB ((16GB-‐ 28 500MB)/2). To balance the load on the staging services, the number of clients should be evenly divisible by the number of staging services. Calculating the Number of Additional Cores Required for Staging. The NSSI staging services run on compute nodes, so additional resources are required to run the job. For each staging service required, add the number of cores per node to the size of the job. If each node has 12 cores and the job requires 16 staging services, add 192 cores to the job. The NSSI transport method is experimental and is not included with the public version of the ADIOS source code in this release; however it is available for use on the XT4 and XT5 machines at ORNL. 5.2.2 DataTap DataTap is an asynchronous data transport method built to ensure very high levels of scalability through server-‐directed I/O. It is implemented as a request-‐ read service designed to bridge the order-‐of-‐magnitude difference between available memories on the I/O partition compared with the compute partition. We assume the existence of a large number of compute nodes producing data (we refer to them as “DataTap clients”) and a smaller number of I/O nodes receiving the data (we refer to them as “DataTap servers”) (see Figure 11). Figure 11. DataTap architecture Upon application request, the compute node marks up the data in PBIO format and issues a request for a data transfer to the server. The server queues the request until sufficient receive buffer space is available. The major cost associated with setting up the transfer is the cost of allocating the data buffer and copying the data. However, this overhead is small enough to have little impact on the overall application runtime. When the server has sufficient buffer space, a remote direct memory access (RDMA) read request is issued to the client to read the remote data into a local buffer. The data are then written out to disk or transmitted over the network as input for further processing in the I/O Graph. We used the Gyrokinetic Turbulence Code (GTC) as an experimental tested for the DataTap transport. GTC is a particle-‐in-‐cell code for simulating fusion within tokamaks, and it is able to scale to multiple thousands of processors. In its default I/O pattern, the dominant I/O cost is from each processor writing out the local particle array into a file. Asynchronous I/O reduces this cost to just a local memory copy, thereby reducing the overhead of I/O in the application. 29 The DataTap transport method is experimental and is not included with the public version of the ADIOS source code in this release; however it is available for use on the XT4 and XT5 machines at ORNL. 5.2.3 Decoupled and Asynchronous Remote Transfers (DART) DART is an asynchronous I/O transfer method within ADIOS that enables low-‐ overhead, high-‐throughput data extraction from a running simulation. DART consists of two main components: (1) a DARTClient module and (2) a DARTServer module. Internally, DART uses RDMA to implement the communication, coordination, and data transport between the DARTClient and the DARTServer modules. The DARTClient module is a light library that provides the asynchronous I/O API. It integrates with the ADIOS layer by extending the generic ADIOS data transport hooks. It uses the ADIOS layer features to collect and encode the data written by the application into a local transport buffer. Once it has collected data from a simulation, DARTClient notifies the DARTServer through a coordination channel that it has data available to send out. DARTClient then returns and allows the application to continue its computations while data are asynchronously extracted by the DARTServer. The DARTServer module is a stand-‐alone service that runs independently of a simulation on a set of dedicated nodes in the staging area. It transfers data from the DARTClient and can save it to local storage system, e.g., Lustre file system, stream it to remote sites, e.g., Ewok cluster, or serve it directly from the staging area to other applications. One instance of the DARTServer can service multiple DARTClient instances in parallel. Further, the server can run in cooperative mode (i.e., multiple instances of the server cooperate to service the clients in parallel and to balance load). The DARTServer receives notification messages from the clients, schedules the requests, and initiates the data transfers from the clients in parallel. The server schedules and prioritizes the data transfers while the simulation is computing in order to overlap data transfers with computations, to maximize data throughput, and to minimize the overhead on the simulation. DART is an asynchronous method available in ADIOS, that can be selected by specifying the transport method in the external ADIOS XML configuration file as “DART”.Figure 12. Select DART as a transport method in the configuration file example. To make use of the DART transport, an application job needs to also run the DARTServer component together with the application. The server should be 30 configured and started before the application as a separate job in the system. For example: aprun -‐n $SPROC ./dart_server –s $SPROC –c $PROC &> log.server & Figure 13. Start the server component in a job file first. The variable $SPROC represents the number of server instances to run, and the variable $PROC represents the number of application processes. For example if the job script runs a coupling scenario with two applications that run on 128 and 432 processors respectively, then the value of $PROC is 560. The ‘&’ character at the end of the line would place the ‘aprun’ command in the background, and will allow the job script to continue and run the other applications. The server processes produce a configuration file, i.e., ‘conf’ that is used by the DARTClient component to connect to the servers. This file contains the ‘nid’ (network identifier), and ‘pid’ (process identifier) of the master server, which coordinates the client registration and discovery process. The job script should wait for the servers to start-‐up and produce the ‘conf’ file, which it can then export to environment variables, e.g., P2TNID, and P2TPID. The clients can use these variables to connect to the server. Exporting the master server identifier through environment variable prevents the larger number of clients from accessing the file system at once. while [ ! –f conf ]; do echo “Waiting for servers to start-up” sleep 2s done while read line; do export set “${line}” done < conf Figure 14. Wait for server start-‐up completion and export the configuration to environment variables. The server component will terminate automatically when the applications will finish. The DARTClient components will send an unregister message to the server before they finish execution, and the servers will exit after they receive $PROC unregister messages. The DART transport method is experimental and is not included with the public version of the ADIOS source code in this release; however it is available for use on the XT4 and XT5 machines at ORNL. 31 5.2.4 (DIMES) 5.3 Other research methods at ORNL 5.3.1 MPI-‐CIO MPI-‐IO defines a set of portable programming interfaces that enable multiple processes to have concurrent access to shared files [1]. It is often used to store and retrieve structured data in their canonical order. The interfaces are split into two types: collective I/O and independent I/O. Collective functions require all processes to participate. Independent I/O, in contrast, requires no process synchronization. Collective I/O enables process collaboration to rearrange I/O requests for better performance [2,3]. The collective I/O method in ADIOS first defines MPI fileviews for all processes based on the data partitioning information provided in the XML configuration file. ADIOS also generates MPI-‐IO hints, such as data sieving and I/O aggregators, based on the access pattern and underlying file system configuration. The hints are supplied to the MPI-‐IO library for further performance enhancement. The syntax to describe the data-‐partitioning pattern in the XML file uses the tag, which defines the global array size and the offsets of local subarrays in the global space. The global-‐bounds element contains one or more nested var elements, each specifying a local array that exists within the described dimensions and offset. Multiple global-‐bounds elements are permitted, and strictly local arrays can be specified outside the context of the global-‐bounds element. As with other data elements, each of the attributes of the global-‐bounds element is provided by the adios_write call. The dimensions attribute is specified by all participating processes and defines how big the total global space is. This value must agree for all nodes. The offset attribute specifies the offset into this global space to which the local values are addressed. The actual size of the local element is specified in the nested var element(s). For example, if the global bounds dimension were 50 and the offset were 10, then the var(s) nested within the global-‐bounds would all be declared in a global array of 50 elements with each local array starting at an offset of 10 from the start of the array. If more than one var is nested within the global-‐bounds, they share the declaration of the bounds but are treated individually and independently for data storage purposes. This research method is installed on Jaguar at ORNL only but is not part of the public release. 5.3.2 MPI-‐AIO The initial implementation of the asynchronous MPI-‐IO method (MPI-‐AIO) is patterned after the MPI-‐IO method. Scheduled metadata commands are performed with the same serialization of MPI_Open calls as given in Figure 4 on page 22. 32 The degree of I/O synchronicity depends on several factors. First, the ADIOS library must be built with versions of MPI that are built with asynchronous I/O support through the MPI_File_iwrite, MPI_File_iread, and MPI_Wait calls. If asynchronous I/O is not available, the calls revert to synchronous (read blocking) behavior identical to the MPI-‐IO method described in the previous section. Another important factor is the amount of available ADIOS buffer space. In the MPI-‐IO method, data are transported and ADIOS buffer allocation is reclaimed for subsequent use with calls to adios_close (). In the MPI-‐AIO method, the “close” process can be deferred until buffer allocation is needed for new data. However, if the buffer allocation is exceeded, the data must be synchronously transported before the application can proceed. The deferral of data transport is key to effectively scheduling asynchronous I/O with a computation. In ADIOS version 1.2, the application explicitly signals that data transport must be complete with intelligent placement of the adios_close () call to indicate when I/O must be complete. Later versions of ADIOS will perform I/O between adios_begin_calculation and adios_end_calculation calls, and complete I/O on adios_end_iteration calls. This research module is not released in ADIOS 1.2. 33 6 ADIOS Read API 6.1 Introduction We can read in any variable and any sub-‐array of a variable with the read API as well as the attributes. There were three design choices when creating this API: 1. Groups in the BP files are handled separately Most BP files contain a single group and the variables and attributes in that group have their paths so it looks like they are organized into a hierarchy. If a BP file contains more than one groups, the second group can have a variable with the same path and name as a variable in the first group. We choose not to add the name of the groups to the root of all paths because that is inconvenient for the majority of the BP files containing a single group. 2. Dimensions of arrays are reported differently for C and Fortran When reading from a different language than writing, the storage order of the dimensions is the opposite. Instead of transposing multidimensional arrays in memory to order the data correctly at read time, simply the dimensions are reported reversed. 3. The C API returns structures filled with information while the Fortran API returns information in individual arguments Since the BP file format is metadata rich, and the metadata is immediately accessible in the footer of the file, we can have an easy to use API with few functions. The open function returns information on the number of elements and timesteps and the list of groups in the file. The group open returns the list of variables and attributes in the group. The inquiry of a variable returns not just the type and dimensionality of a variable but also the global minimum and maximum of it without reading in the content of the variable from the file. The read API library has two versions. The MPI version should be used in parallel programs. Only the rank=0 process reads the footer of the file and broadcasts it to the other processes in adios_fopen(). File access is handled through MPI-‐IO functions. Sequential programs can use any of the two versions but if you do not want dependency on MPI, link your program with the non-‐MPI version, which uses POSIX I/O functions. In this case, you need to compile your code with the -‐D_NOMPI option. There is no difference in performance or functionality in the two versions (in sequential applications). Note that the write API contains the adios_read() function, which is useful to read in data from the same number of processors as the data was written from, like handling checkpoint/restart data (see Section 3.1.2.5.). However, if you need to 34 read in from a different number of processors or to read in only a subset of an array variable, you need to use this read API. 6.2 Read C API description Note: for Fortran, please read section 6.4 on page 40. The sequence of reading in a variable from the BP file is -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ open file open a group inquiry the variable to get type and dimensions allocate memory for the variable read in variable (whole or part of it) free varinfo data structure close group close file Example codes using the C API are -‐ examples/C/read_all/read_all.c -‐ examples/C/global-‐array/adios_read_global 6.2.1 adios_errmsg / adios_errno int adios_errno char * adios_errmsg() If an error occurrs during the call of a C api function, it either returns NULL (instead of a pointer to an allocated structure) or a negative number. It also sets the integer adios_errno variable (the negative return value is actually -‐1 times the errno value). Moreover, it prints the error message into an internal buffer, which can be retrieved by adios_errmsg(). Note that adios_errmsg() returns the pointer to the internal buffer instead of duplicating the string, so refrain from writing anything into it. Also, only the last error message is available at any time. 6.2.2 adios_fopen ADIOS_FILE * adios_fopen (const char * fname, MPI_Comm comm) ADIOS FILE is a struct of uint64_t fh; File handler int groups_count; Number of adios groups in file int vars_count; Number of variables in all groups int attrs_count; Number of attributes in all groups int tidx_start; First timestep in file, usually 1 35 int ntimesteps; Number of timesteps in file. There is always at least one timestep int version; ADIOS BP version of file format uint64_t file_size; Size of file in bytes int 0: little endian, 1: big endian You do not need to care about this. endianness; char ** group_namelist; Names of the adios groups in the file (cf. groups_count) The array for the list of group names is allocated in the function and is freed in the close function. If you use the MPI version of the library, pass the communicator, which is the communicator of all processes that call the open function. Rank=0 process broadcasts the metadata to the other processes so that we avoid opening the file from many processes at once. If you use the non-‐MPI version of the library, just pass on an arbitrary integer value, which is not used at all. 6.2.3 adios_fclose int adios_fclose (ADIOS_FILE *fp) You are expected to close a file when you do not need it anymore. This function releases a lot of internal memory structures. 6.2.4 adios_gopen / adios_gopen_byid ADIOS_GROUP * adios_gopen (ADIOS_FILE *fp, const char * grpname) ADIOS_GROUP * adios_gopen_byid (ADIOS_FILE *fp, int grpid) You need to open a group to get access to its variables and attributes. You can open a group either by its name returned in the ADIOS_FILE struct’s group_namelist list of strings or by its index, which is the index of its name in this list of names. You can have several groups open at the same time. ADIOS_GROUP is a struct of uint64_t gh; Group handler int grpid; group index (0..ADIOS_FILE.groups_count-‐1) int vars_count; Number of variables in this adios group char ** var_namelist; Variable names in a char* array int Number of attributes in this adios group attrs_count; char ** attr_namelist; Attribute names in a char* array ADIOS_FILE * fp; pointer to the parent ADIOS_FILE struct 36 The arrays for the list of variable names and attribute name are allocated in the function and are freed in the group close function. Note that one can modify the ADIOS_GROUP’s namelists because they are not used in the discovery of the variables. However, in index-‐based queries below, the index of the variable is the index of the variable’s position in the original order of the list. If one sorts this list for ordered printouts, one need to remember the original indices of the variables or to identify the variables by name. 6.2.5 adios_gclose int adios_gclose (ADIOS_GROUP *gp) You need to close the group when you do not need it anymore. 6.2.6 adios_inq_var / adios_inq_var_byid ADIOS_VARINFO * adios_inq_var (ADIOS_GROUP *gp, const char * varname) ADIOS_VARINFO * adios_inq_var_byid (ADIOS_GROUP *gp, int varid) This function should be used if you want to discover the type and dimensionality of a variable or want to get the minimum/maximum/average/standard_deviation values without reading in the data. You can refer to the variable with its name (full path) in the ADIOS_GROUP struct’s var_namelist or by the index in that list. ADIOS_VARINFO structure is allocated in the function but there is no corresponding closing function, therefore user has to free the ADIOS_VARINFO* pointer yourself when you do not need it anymore by using the adios_free_varinfo() function. ADIOS_VARINFO is a struct of int grpid; group index (0..ADIOS_FILE.groups_count-‐1) int varid; variable index (0..ADIOS_GROUP.var_count-‐1) enum ADIOS_DATATYPES type; type of variable int ndim; number of dimensions, 0 for scalars uint64_t * dims; size of each dimension int timedim; -‐1: variable has no timesteps in file, >=0: which dimension is time void * value; value of a scalar variable, NULL for array. void * gmin; minimum value in an array variable. void * gmax; maximum value of an array variable void * gavg; average value of an array variable void * gstd_dev; standard deviation value of an array variable (over all timesteps, for scalars they are = value) void * mins; minimum per each timestep void * maxs; maximum per each timestep void * avgs; average per each timestep 37 void * std_dev; struct ADIOS_HIST { uint32_t num_breaks; double min; double max; double * breaks; uint32_t ** frequencies; uint32_t * gfrequencies; } *hist; standard deviation per each timestep (array of timestep elements) number of break points of the histogram minimum of binning boundary maximum of binning boundary break points of the histogram histogram values per timestep histogram values for all timesteps NULL if histogram binning interval was not formed correctly at write time For complex numbers, the statistics in ADIOS_VARINFO, like gmin, gavg, std_devs etc, are of base type double. They also have an additional dimension that stores the statistics for the magnitude, the real part, and the imaginary part of the complex number, individually. For example, gmin[0] holds the overall minimum value of the magnitude of the complex numbers. gmin[1] and gmin [2] contain the global minimums for the real and the imaginary parts, respectively. 6.2.7 adios_free_varinfo void adios_free_varinfo (ADIOS_VARINFO *cp) Frees up the ADIOS_VARINFO* structure returned by adios_inq_var(). 6.2.8 adios_read_var / adios_read_var_byid int64_t adios_read_var (ADIOS_GROUP const char const uint64_t const uint64_t void * * * * * gp, varname, start, count, data) int64_t adios_read_var_byid (ADIOS_GROUP * gp, int varid, const uint64_t * start, const uint64_t * count, void * data) This function is used to read in the content of a variable, or a subset of it. You need to allocate memory for receiving the data before calling this function. The subset (or the entire set) is defined by the start and count in each dimension. The start and count arrays must have as many elements as many dimensions the variable has (i.e. ADIOS_VARINFO.ndim). Start contains the starting offsets for each dimension and count contains the number of elements to read in a given dimension. If you want to read in the entire variable, start should be an array of zeros and count should equal to the dimensions of the variable. 38 Note that start and count is related to the number of elements in each dimension, not the number of bytes needed for storage. When allocating the data array, multiply the total number of elements with the size of one element. If you need to be generic in this calculation, you can use the adios_type_size() function to get the size of one element of a given type (cf. ADIOS_VARINFO.type). 6.2.9 adios_get_attr / adios_get_attr_byid int adios_get_attr (ADIOS_GROUP * const char * enum ADIOS_DATATYPES * int * void ** gp, attrname, type, size, data) int adios_get_attr_byid (ADIOS_GROUP * int enum ADIOS_DATATYPES * int * void ** gp, attrid, type, size, data) This function retrieves an attribute including its type, memory size and its value. An attribute can only be a scalar value or a string. Memory is allocated in the function to store the value. The allocated size is returned in the size argument. This function does not read the file usually. The attribute’s value is stored in the footer and is already in the memory after the file is opened. However, an attribute can refer to a scalar (or string) variable too. In this case, this function calls adios_read_var internally, so the file will be accessed to read in that scalar. 6.2.10 adios_type_to_string const char * adios_type_to_string (enum ADIOS_DATATYPES type) This function returns the name of a given type. 6.2.11 adios_type_size int adios_type_size(enum ADIOS_DATATYPES type, void *data) This function returns the memory size of one data element of an adios type. If the type is adios_string, and the second argument is the string itself, it returns strlen(data)+1. For other types, data is not used and the function returns the size occupied by one element. 6.3 Time series analysis API Description: ADIOS provides APIs to perform time-‐series analysis like correlation and covariance on statistics collected in the BP file. As described in Section 6.2.6, adios_inq_var populates characteristics, such as minimum, maximum, average, standard deviation values for an array for each timestep. The following analysis function can be used with ADIOS_VARINFO objects previously defined. This can be performed only for a variable that has a time index. 39 6.3.1 adios_stat_cor / adios_stat_cov This function calculates Pearson correlation/covariance of the characteristic data of vix and characteristic data of viy. double adios_stat_cor (ADIOS_VARINFO * vix, ADIOS_VARINFO * viy, char * characteristic, uint32_t time_start, uint32_t time_end, uint32_t lag) double adios_stat_cov (ADIOS_VARINFO * vix, ADIOS_VARINFO * viy, char * characteristic, uint32_t time_start, uint32_t time_end, uint32_t lag) Required: • vix -‐ an ADIOS_VARINFO object Optional: • viy -‐ either an ADIOS_VARINFO object or NULL • characteristics -‐ can be any of the following pre-‐computed statistics: "minimum" or "maximum" or "average" or "standard deviation" (alternatively, "min" or "max" or "avg" or "std_dev" can be given) • time_start -‐ specifies the start time from which correlation/covariance should be performed • time_end -‐ specifies the end time up to which correlation/covariance should be performed time_start and time_end should be within the time bounds of vix and viy with time_start < time_end If time_start and time_end = 0, the entire range of timesteps is considered. In this case, vix and viy should have the same number of timesteps. • lag -‐ if viy is NULL, and if lag is given, correlation is performed between the data specified by vix, and vix shifted by 'lag' timesteps. If viy is not NULL, lag is ignored. 6.4 Read Fortran API description The Fortran API does not deal with the structures of the C api rather it requires several arguments in the function calls. They are all implemented as subroutines 40 like the write Fortran API and the last argument is an integer variable to store the error code output of each function (0 meaning successful operation). An example code can be found in the source distribution as tests/bp_read/bp_read_f.F90. The most important thing to note is that some functions need integer*8 (scalar or array) arguments. Passing an integer*4 array from your code leads to fatal errors. Please, double check the arguments of the function calls. Due to the lack of structures and because the Fortran API does not allocate memory for them, you have to inquiry the file after opening it and to inquiry the group after opening it. You also have to inquiry an attribute to determine the memory size and allocate space for it before retrieving it. Where the API function returns a list of names (inquiry file or inquiry group), you have to provide enough space for them using the counts returned by the preceding open call. Here is the list of the Fortran subroutines. The GENERIC word indicates that you can use that function with any data type at the indicated argument. Since Fortran90 does not allow defining functions that can take any type of argument, we do not provide an F90 module for this API. The functions are actually defined in C and due to the lack of compiler checking, you can pass any type of array or variable where a GENERIC array is denoted. subroutine adios_errmsg (msg) character(*), intent(out) :: msg end subroutine subroutine adios_fopen (fp, fname, comm, groups_count, err) integer*8, intent(out) :: fp character(*), intent(in) :: fname integer, intent(in) :: comm integer, intent(out) :: groups_count integer, intent(out) :: err end subroutine subroutine adios_fclose (fp, err) integer*8, intent(in) :: fp integer, intent(out) :: err end subroutine subroutine adios_inq_file (fp, vars_count, attrs_count, tstart, ntsteps, gnamelist, err) integer*8, intent(in) :: fp integer, intent(out) :: vars_count integer, intent(out) :: attrs_count integer, intent(out) :: tstart 41 integer, intent(out) :: ntsteps character(*), dimension(*), intent(inout) :: gnamelist integer, intent(out) :: err end subroutine subroutine adios_gopen (fp, gp, grpname, vars_count, attrs_count, err) integer*8, intent(in) :: fp integer*8, intent(out) :: gp character(*), intent(in) :: grpname integer, intent(out) :: vars_count integer, intent(out) :: attrs_count integer, intent(out) :: err end subroutine subroutine adios_gclose (gp, err) integer*8, intent(in) :: gp integer, intent(out) :: err end subroutine subroutine adios_inq_group (gp, vnamelist, anamelist, err) integer*8, intent(in) :: gp character(*), dimension(*), intent(inout) :: vnamelist character(*), dimension(*), intent(inout) :: anamelist integer, intent(out) :: err end subroutine subroutine adios_inq_var (gp, varname, vartype, ndim, dims, timedim, err) integer*8, intent(in) :: gp character(*), intent(in) :: varname integer, intent(out) :: vartype integer, intent(out) :: ndim integer*8, dimension(*), intent(out) :: dims integer, intent(out) :: timedim integer, intent(out) :: err end subroutine subroutine adios_read_var (gp, varname, start, count, data, read_bytes) integer*8, intent(in) :: gp character(*), intent(in) :: varname integer*8, dimension(*), intent(in) :: start integer*8, dimension(*), intent(in) :: count GENERIC, dimension(*), intent(inout) :: data integer*8, intent(out) :: read_bytes ! read_bytes < 0 indicates error end subroutine subroutine adios_get_varminmax (gp, varname, value, gmin, gmax, mins, maxs, err) integer*8, intent(in) :: gp 42 character(*), intent(in) :: varname GENERIC, intent(out) :: value GENERIC, intent(out) :: gmin GENERIC, intent(out) :: gmax GENERIC, dimension(*), intent(inout) :: mins GENERIC, dimension(*), intent(inout) :: maxs integer, intent(out) :: err end subroutine subroutine adios_inq_attr (gp, attrname, attrtype, attrsize, err) integer*8, intent(in) :: gp character(*), intent(in) :: attrname integer, intent(out) :: attrtype integer, intent(out) :: attrsize integer, intent(out) :: err end subroutine subroutine adios_get_attr_int1 (gp, attrname, attr, err) integer*8, intent(in) :: gp character(*), intent(in) :: attrname GENERIC, dimension(*), intent(inout) :: attr integer, intent(out) :: err end subroutine 6.5 Compiling and linking applications In a C code, include the adios_read.h header file. In a Fortran 90 code, you do not need to include anything. It is strongly recommended to double check the integer parameters because the read API expects integer*8 arguments at several places and providing an integer will break your code and then debugging it proves to be very difficult. If you want to use the MPI version of the library, then link your (C or Fortran) application with -‐ladiosread. If you want to use the non-‐MPI version of the library, you need to compile your code with the –D_NOMPI option and link your application with -‐ladiosread_nompi. 7 BP file format 7.1 Introduction This chapter describes the file structure of BP, which is the ADIOS native binary file format, to aid in understanding ADIOS performance issues and how files convert from BP files to other scientific file formats, such as netCDF and HDF5. 43 To avoid the file size limitation of 2 gigabytes by using a signed 32-‐bit offset within its internal structure, BP format uses an unsigned 64-‐bit datatype as the file offset. Therefore, it is possible to write BP files that exceed 2 gigabytes on platforms that have large file support. By adapting ADIOS read routines based on the endianness indication in the file footer, BP files can be easily portable across different machines (e.g., between the Cray-‐XT4 and BlueGene). To aid in data selection, we have a low-‐overhead concept of data characteristics to provide an efficient, inexpensive set of attributes that can be used to identify data sets without analyzing large data content. As shown in Figure 15, the BP format comprises a series of process groups and the file footer. The remainder of this chapter describes each component in detail and helps the user to better understand (1) why BP is a self -‐describing and metadata-‐rich file format and (2) why it can achieve high I/O performance on different machine infrastructures. Figure 15. BP file structure 7.2 Footer One known limitation of the NetCDF format is that the file contents are stored in a header that is exactly big enough for the information provided at file creation. Any changes to the length of that data will require moving data. To avoid this cost, we choose to employ a foot index instead. We place our version identifier and the offset to the beginning of the index as the last few bytes of our file, making it simple to find the index information and to add new and different data to our files without affecting any data already written. 7.2.1 Version We reserve 4 bytes for the file version, in which the highest bit indicates endianness. Because ADIOS uses a fixed-‐size type for data, there is no need to store type size information in the footer. 44 7.2.2 Offsets of indices In BP format, we store three 8-‐byte file offsets right before the version word, which allows users or developers to quickly seek any of the index tables for process groups, variables, or attributes. 7.2.3 Indices 7.2.3.1 Characteristics Before we dive into the structures of the three index tables mentioned earlier, let’s first take a look what characteristic means in terms of BP file format. To be able to make a summary inspection of the data to determine whether it contains the feature of greatest interest, we developed the idea of data characteristics. The idea of data characteristics is to collect some simple statistical and/or analytical data during the output operation or later for use in identifying the desired data sets. Simple statistics like array minimum and maximum values can be collected without extra overhead as part of the I/O operation. Other more complex analytical measures like standard deviations or specialized measures particular to the science being performance by require more processing. As part of our BP format, we store these values not only as part of data payload, but also in our index. 7.2.3.2 PG Index table As shown in Figure 16, the process group (PG) index table encompasses the count and the total length of all the PGs as the first two entries. The rest of the tables contain a set of information for each PG, which contains the group name information, process ID, and time index. The Process ID specifies which process a group is written by. That process will be the rank value in the communicator if the MPI method is used. Most importantly, there is a file-‐offset entry for each PG, allowing a fast skip of the file in the unit of the process group. 45 Figure 16. Group index table 7.2.3.3 Variables index table The variables index table is composed of the total count of variables in the BP file, the size of variables index table, and a list of variable records. Each record contains the size of the record and the basic metadata to describe the variable. As shown in Figure 17, the metadata include the name of the variable, the name of the group the variable is associated with, the data type of the variable, and a series of characteristic features. The structure of each characteristic entry contains an offset value, which is addressed to the certain occurrence of the variable in the BP file. For instance, if n processes write out the variable “d” per time step, and m iterations have been completed during the whole simulation, then the variable will be written (m × n) times in the BP file that is produced. Accordingly, there will be the same number of elements in the list of characteristics. In this way, we can quickly retrieve the single dataset for all time steps or any other selection of time steps. This flexibility and efficiency also apply to a scenario in which a portion of records needs to be collected from a certain group of processes. 46 Figure 17. Variables index table 7.2.3.4 Attributes index table Since an attribute can be considered to be a special type of variable, its index table in BP format is organized in the same way as a variables index table and therefore supports the same types of features mentioned in the previous sections. 7.3 Process Groups One of the major concepts in BP format is what is called “process group” or PG. The BP file format encompasses a series of PG entries and the BP file footer. Each process group is the entire self-‐contained output from a single process and is written out independently into a contiguous disk space. In that way, we can enhance parallelism and reduce coordination among processes in the same communication group. The data diagram in Figure 18 illustrates the detailed content in each PG. 47 Figure 18. Process group structure 7.3.1 PG header 7.3.1.1 Unlimited dimension BP format allows users to define an unlimited dimension, which will be specified as the time-‐index in the XML file. Users can define variables having a dimension with undefined length, for which the variable can grow along that dimension. PG is a self-‐contained, independent data structure; the dataset in the local space per each time step is not reconstructed at the writing operations across the processes or at time steps. Theoretically, PGs can be appended to infinity; they can be added one after another no matter how many processes or time steps take place during the simulation. Thus ADIOS is able to achieve high I/O performance. 7.3.1.2 Transport methods One of the advantages of organizing output in terms of groups is to categorize all the variables based on their I/O patterns and logical relationships. It provides flexibility for each group to choose the optimized transport method according to the simulation environment and underlying hardware configuration or the transport methods used for a performance study without even changing the source code. In PG header structure, each entry in the method list has a method 48 ID and method parameters, such as system-‐tuning parameters or underneath driver selection. 7.3.2 Vars list 7.3.2.1 Var header 7.3.2.1.1 Dimensions structure Internal to bp is sufficient information to recreate any global structure and to place the local data into the structure. In the case of a global array, each process writes the size of the global array dimensions, specifies the local offsets into each, and then writes the local data, noting the size in each dimension. On conversion to another format, such as HDF5, this information is used to create hyperslabs for writing the data into the single, contiguous space. Otherwise, it is just read back in and used to note where the data came from. In this way, we can enhance parallelism and reduce coordination. All of our parallel writes occur independently unless the underlying transport specifically requires collective operations. Even in those cases, the collective calls are only for a full buffer write (assuming the transport was written appropriately) unless there is insufficient buffer space. As shown in Figure 18, the dimension structure contains a time index flag, which indicates whether this variable has an unlimited time dimension. Var_id is used to retrieve the dimension value if the dimension is defined as variable in the XML file; otherwise, the rank value is taken as the array dimension. 7.3.2.2 Payload Basic statistical characteristics give users the advantage for quick data inspection and analysis. In Figure 18, redundant information about characteristics is stored along with variable payload so that if the characteristics part in the file footer gets corrupted, it can still be recovered quickly. Currently, only simple statistical traits are saved in the file, but the characteristics structure will be easily expanded or modified according to the requirements of scientific applications or the analysis tools. 7.3.3 Attributes list The layout of the attributes list (see Figure 19) is very similar to that of the variables. However, instead of containing dimensional structures and physical data load, the attribute header has an is_var flag, which indicates either that the value of the attribute is referenced from a variable by looking up the var_id in the same group or that it is a static value defined in the XML file. 49 Figure 19. Attribute entry structure 50 8 Utilities 8.1 adios_lint We provide a verification tool, called adios_lint, which comes with ADIOS 1.2. It can help users to eliminate unnecessary semantic errors and to verify the integrity of the XML file. Use of adios_lint is very straightforward; enter the adios_lint command followed by the config file name. 8.2 bpls The bpls utility is used to list the content of a BP file or to dump arbitrary subarrays of a variable. By default, it lists the variables in the file including the type, name, and dimensionality. Here is the description of additional options (use bpls -‐h to print help on all options for this utility). -‐l Displays the global statistics associated with each array (minimum, maximum, average and standard deviation) and the value of each scalar. Note that the detailed listing does not have extra overhead of processing since this information is available in the footer of the BP file. -‐t When added to the -‐l option, displays the statistics associated with the variables for every timestep. -‐p Dumps the histogram binning intervals and their corresponding frequencies, if histograms were enabled while writing the bp file. This option generates a “ .gpl” file that can be given to the ‘gnuplot’ program as input. -‐a Lists attributes besides the variables -‐A Lists only the attributes -‐r Sorts the full listing by names. Name masks to list only a subset of the variables/attributes can be given like with the -‐ls command or as regular expressions (with –e option). -‐v Verbose. It prints some information about the file in the beginning before listing the variables. -‐S Dump byte arrays as strings instead of with the default numerical listing. 2D byte arrays are printed as a series of strings. Since bpls is written in C, the order of dimensions is reported with row-‐major ordering, i.e., if Fortran application wrote an NxM 2D variable, bpls reports it as an MxN variable. -‐d Dumps the values of the variables. A subset of a variable can be dumped by using start and count values for each dimension with –s and –c option, e.g., 51 –s “10,20,30” –c “10,10,10” reads in a 10x10x10 sub-‐array of a variable starting from the (10,20,30) element. Indices start from 0. As in Python, -‐1 denotes the last element of an array and negative values are handled as counts from backward. Thus, -‐s “-‐1,-‐1” –c “1,1” reads in the very last element of a 2D array, or –s “0,0” –c “1,-‐1” reads in one row of a 2D array. Or –s “1,1” –c “-‐2,-‐2” reads in the variable without the edge elements (row 0, colum 0, last row and last column). Time is handled as an additional dimension, i.e., if a 2D variable is written several times into the same BP file, bpls lists it as a 3D array with the time dimension being the first (slowest changing) dimension. In the example below, a 4 process application wrote a 4x4 array (each process wrote a 2x2 subset) with values from 0 to 15 once under the name /var/int_xy and 3 times under the name /var/int_xyt. $ bpls -latv g_2x2_2x2_t3.bp File info: of groups: 1 of variables: 11 of attributes: 7 time steps: 3 starting from 1 file size: 779 KB bp version: 1 endianness: Little Endian Group genarray: integer /dimensions/X scalar integer /dimensions/Y scalar integer /info/nproc scalar string /info/nproc/description attr integer /info/npx scalar string /info/npx/description attr in x dimension" integer /info/npy scalar string /info/npy/description attr in y dimension" integer /var/int_xy {4, 4} string /var/int_xy/description attr decomposition" integer /var/int_xyt {3, 4, string /var/int_xyt/description attr decomposition with time in 3rd dimension" Figure 20. bpls utility The content of /var/int_xy can be dumped with $ bpls g_2x2_2x2_t3.bp -d -n 4 var/int_xy integer /var/int_xy {4, 4} 52 = = = = = = 4 4 4 "Number of writers" 2 "Number of processors = 2 = "Number of processors = 0 / 15 = "2D array with 2D 4} = 0 / 15 = "3D array with 2D (0,0) (1,0) (2,0) (3,0) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 The “central” 2x2 subset of /var/int_xy can be dumped with $ bpls g_2x2_2x2_t3.bp -d -s "1,1" -c "2,2" -n 2 var/int_xy integer /var/int_xy {4, 4} slice (1:2, 1:2) (1,1) 5 6 (2,1) 9 10 The last element of /var/int_xyt for each timestep can be dumped with $ bpls g_2x2_2x2_t3.bp -d -s "0,-1,-1" -c "-1,1,1" -n 1 var/int_xyt integer /var/int_xyt {3, 4, 4} slice (0:2, 3:3, 3:3) (0,3,3) 15 (1,3,3) 15 (2,3,3) 15 8.3 bpdump The bpdump utility enables users to examine the contents of a bp file more closely to the actual BP format than with bpls and to display all the contents or selected variables in the format on the standard output. Each writing process’ output is printed separately. It dumps the bp file content, including the indexes for all the process groups, variables, and attributes, followed by the variables and attributes list of individual process groups (see Figure 21 ). bpdump [-‐d var|-‐-‐dump var] ======================================================== Process Groups Index: Group: temperature Process ID: 0 Time Name: Time: 1 Offset in File: 0 ======================================================== Vars Index: Var (Group) [ID]: /NX (temperature) [1] Datatype: integer Vars Characteristics: 20 Offset(46) Value(10) Var (Group) [ID]: /size (temperature) [2] Datatype: integer 53 Vars Characteristics: 20 Offset(77) Value(20) … Var (Group) [ID]: /rank (temperature) [3] Datatype: integer Vars Characteristics: 20 Offset(110) Value(0) … Var (Group) [ID]: /temperature (temperature) [4] Datatype: double Vars Characteristics: 20 Offset(143) Min(1.000000e-‐01) Max(9.100000e+00) Dims (l:g:o): (1:20:0,10:10:0) … ======================================================== Attributes Index: Attribute (Group) [ID]: /recorded-‐date (temperature) [5] Datatype: string Attribute Characteristics: 20 Offset(363) Value(Sep-‐19-‐2008) … Figure 21. bpdump utility 54 9 Converters To make BP files compatible with the popular file formats, we provide a series of converters to convert BP files to HDF5, NETCDF, or ASCII. As long as users give the required schema via the configuration file, the different converter tools currently in ADIOS have the features to translate intermediate BP files to the expected HDF5, NetCDF, or ASCII formats. 9.1 bp2h5 This converter, as indicated by its name, can convert BP files into HDF5 files. Therefore, the same postprocessing tools can be used to analyze or visualize the converted HDF5 files, which have the same data schema as the original ones. The converter can match the row-‐based or column-‐based memory layout for datasets inside the file based on which language the source codes are written in. If the XML file specifies global-‐bounds information, the individual sub-‐blocks of the dataset from different process groups will be merged into one global the dataset in HDF file. 9.2 bp2ncd The bp2ncd converter is used to translate bp files into NetCDF files. In Chap. 5, we describe the time-‐index as an attribute for adios-‐group. If the variable is time-‐ based, one of its dimensions needs to be specified by this time-‐index variable, which is defined as an unlimited dimension in the file into which it is to be converted. a NetCDF dimension has a name and a length. If the constant value is declared as a dimension value, the dimension in NetCDF will be named varname_n, in which varname is the name of the variable and n is the nth dimension for that variable. To make the name for the dimension value more meaningful, the users can also declare the dimension value as an attribute whose name can be picked up by the converter and used as the dimension name. Based on the given global bounds information in a BP file, the converter can also reconstruct the individual pieces from each process group and create the global space array in NetCDF. A final word about editing the XML file: the name string can contain only letters, numbers or underscores (“_”). Therefore, the attribute or variable name should conform to this rule. 9.3 bp2ascii Sometimes, scientists want to extract one variable with all the time steps or want to extract several variables at the same time steps and store the resulting data in ASCII format. The Bp2ascii converter tool allows users to accomplish those tasks. Bp2ascii bp_filename –v x1 … xn [–c/-‐r] –t m,n -‐v – specify the variables need to be printed out in ASCII file -‐c –print variable values for all the time steps in column 55 -‐r – print variable values for all the time steps in row -‐t – print variable values for time step m to n, if not defined, all the time steps will be printed out. 9.4 Parallel Converter Tools Currently, all of the converters mentioned above can only sequentially parse bp files. We will work on developing parallel versions of all of the converters for improved performance. As a result, the extra conversion cost to translate bp into the expected file format can be unnoticeable compared with the file transfer time. 56 10 Group read/write process In ADIOS 1.2, we provide a python script, which takes a configuration file name as an input argument and produces a series of preprocessing files corresponding to the individual adios-‐group in the XML file. Depending on which language (C or FORTRAN) is specified in XML, the python script either generates files gwrite_groupname.ch and gread_groupname.ch for C or files with extension .fh for Fortran. These files contain the size calculation for the group and automatically print adios_write calls for all the variables defined inside adios-‐ group. One need to use only the “#include filename.ch” statement in the source code between the pair of adios_open and adios_close. Users either type the following command line or incorporate it into a Makefile: python gpp.py 10.1 Gwrite/gread/read Below are a few example of the mapping from var element to adios_write/read: In adios-‐group “weather”, we have a variable declared in the following forms: 1) When the python command is executed, two files are produced, gwrite_weather.ch and gread_weather.ch. The gwrite_weather.ch command contains adios_write (adios_handle, “temperature”, t); while gread_weather.ch contains adios_read (adios_handle, “temperature”, t_read). 2) In this case, only the adios_write statement is generated in gwrite_weather.ch. The adios_read statement is not generated because the value of attribute read is set to “no”. 3) adios_write (adios_handle, “temperature”, temperature) adios_read (adios_handle, “temperature”, t_read). 4) 57 adios_write (adios_handle, “temperature”, t) adios_read (adios_handle, “temperature”, temperature) 10.2 Add conditional expression Sometimes, the adios_write routines are not perfectly written out one after another. There might be some conditional expressions or loop statements. The following example will show you how to address this type of issue via XML editing. Rerun the python command; the following statements will be generated in gwrite_weather.ch, if (mype==0) { adios_write (adios_handle, “temperature”, t) } gread_weather.ch has same condition expression added. 10.3 Dependency in Makefile Since we include the header files in the source, the users need to include the header files as a part of dependency rules in the Makefile. 58 11 C Programming with ADIOS This chapter focuses on how to integrate ADIOS into the users’ source code in C and how to write into separate files or a shared file from multiple processes in the same communication domain. These examples can be found in the source distribution under the examples/C/manual directory. In the following steps we will create programs that use ADIOS to write -‐ -‐ -‐ -‐ -‐ a metadata-‐rich BP file per process one large BP file with the arrays from all processes N files from P processes, where N << P the data of all processes as one global array into one file a global-‐array over several timesteps into one file The strength of the componentization of I/O in ADIOS allows us to switch between the first two modes by selecting a different transport method in a configuration file and run the program without recompiling it. 11.1 Non-‐ADIOS Program The starting programming example, shown in Figure 22, writes a double-‐ precision array t with size of NX into a separate file per process (the array is uninitialized in the examples). #include #include "mpi.h" #include "adios.h" int main (int argc, char ** argv) { char filename [256]; int rank; int NX = 10; double t[NX]; FILE * fp; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); sprintf (filename, "restart_%5.5d.dat", rank); fp = open (filename, "w"); fwrite ( &NX, sizeof(int), 1, fp); fwrite (t, sizeof(double), NX, fp); fclose (fp); MPI_Finalize (); return 0; 59 } Figure 22. Original program (examples/C/manual/1_nonadios_example.c). $ mpirun -np 4 1_nonadios_example $ ls restart_* restart_00000.dat restart_00001.dat restart_00003.dat restart_00002.dat 11.2 Construct an XML File In the example above, the program is designed to write a file for each process. There is a double-‐precision one-‐dimensional array called “t”. We also need to declare and write all variables that are used for dimensions (i.e. NX in our example). Therefore, our configuration file is constructed as shown in Figure 23. /* config.xml*/ Figure 23. Example config.xml file 11.3 Generate .ch file (s) The adios_group_size function and a set of adios_write functions can be automatically generated in gwrite_temperature.ch file by using the following python command: gpp.py config.xml The generated gwrite_temperature.ch file is shown in Figure 24. /* gwrite_temperature.ch */ adios_groupsize = 4 \ + 8 * (NX); 60 adios_group_size (adios_handle, adios_groupsize, &adios_totalsize); adios_write (adios_handle, "NX", &NX); adios_write (adios_handle, "temperature", t); Figure 24. Example gwrite_temperature.ch file 11.4 POSIX transport method (P writers, P subfiles + 1 metadata file) For our first program, we simply translate the program of Figure 22, so that all of the I/O operations are done with ADIOS routines. The POSIX method can be used to write out separate files for each processor in Figure 25. The changes to the original example are highlighted. We need to use an MPI communicator in adios_open() because the subprocesses need to know the rank to create unique subfile names. /*write Separate file for each process by using POSIX*/ #include #include "mpi.h" #include "adios.h" int main (int argc, char ** argv) { char filename [256]; int rank; int NX = 10; double t[NX]; /* ADIOS variables declarations for matching gwrite_temperature.ch */ int adios_err; uint64_t adios_groupsize, adios_totalsize; int64_t adios_handle; MPI_Comm * comm = MPI_COMM_WORLD; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); sprintf (filename, "restart.bp"); adios_init ("config.xml"); adios_open (&adios_handle, "temperature", filename, "w", &comm); #include "gwrite_temperature.ch" adios_close (adios_handle); adios_finalize (rank); MPI_Finalize (); return 0; } 61 Figure 25. Example adios program to write P files from P processors (examples/C/manual/2_adios_write.c) The POSIX method makes a directory to store all subfiles. As for the naming of the directory, it appends “.dir” to the name the file, e.g., restart.bp.dir. For each subfile, it appends the rank of the process (according to the supplied communicators) to the name of the file (here restart.bp), so for example process 2 will write a file restart.bp.dir/restart.bp.2. To facilitate reading of subfiles, the method also generates a global metadata file (restart.bp) which tracks all the variables in each subfile. $ mpirun -np 4 2_adios_write $ ls restart.bp restart.bp restart.bp.dir: restart.bp.0 restart.bp.1 restart.bp.2 restart.bp.3 $ bpls -lad restart.bp.dir/restart.bp.2 -n 10 integer double (0) string /NX scalar = 10 /temperature {10} = 20 / 29 20 21 22 23 24 25 26 27 28 29 /temperature/description attr = "Temperature array" 11.5 MPI-‐IO transport method (P writers, 1 file) Based on the same group description in the configure file and the header file (.ch) generated by python script, we can switch among different transport methods without changing or recompiling the source code. One entry change in the config.xml file can switch from POSIX to MPI: The MPI communicator is passed as an argument of adios_open(). Because it is defined as MPI_COMM_WORLD in the posix example already, the program does not need to be modified or recompiled. $ mpirun -np 4 2_adios_write 62 $ ls restart.bp restart.bp $ bpls -l restart.bp Group temperature: integer /NX double /temperature scalar = 10 {10} = 0 / 39 There are several ways to verify the binary results. We can either choose bpdump to display the content of the file or use one of the converters (bp2ncd, bp2h5, or bp2ascii), to produce the user’s preferred file format (NetCDF, HDF5 or ASCII, respectively) and use its dump utility to output the content in the standard output. Bpls cannot list the individual arrays written by the processes because the generic read API it uses does not support this (it can see only one of them as the size of /temperature suggest in the listing above). It is suggested to use global arrays (see example below) to present the data written by many processes as one global array, which then can be listed and any slice of it can be read/dumped. This example, however, can be used for checkpoint/restart files where the application would only read in data from the same number of processes as it was written (see next example). The transparent switch between the POSIX and MPI methods allows the user choose the better performing method for a particular system without changing the source code. 11.6 Reading data from the same number of processors Now let’s move to examples of how to read the data from BP or other files. Assuming that we still use the same configure file shown in Figure 23, the following steps illustrate how to easily change the code and xml file to read a variable. 1. add another variable adios_buf_size specifying the size for read. 2. call adios_open with “r” (read only) mode. 3. Insert #include “gread_temperature.ch” /*Read in data on same number of processors */ #include #include "mpi.h" #include "adios.h" int main (int argc, char ** argv) { char filename [256]; int rank; int NX = 10; double t[NX]; /* ADIOS variables declarations for matching gread_temperature.ch */ 63 } int adios_err; uint64_t adios_groupsize, adios_totalsize, adios_buf_size; int64_t adios_handle; MPI_Comm comm = MPI_COMM_WORLD; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); sprintf (filename, "restart.bp"); adios_init ("config.xml"); adios_open (&adios_handle, "temperature", filename, "r", &comm); #include "gread_temperature.ch" adios_close (adios_handle); adios_finalize (rank); MPI_Finalize (); return 0; Figure 26. Read in data generated by 2_adios_write using gread_temperature.ch (examples/C/manual/3_adios_read.c) The gread_temperature.ch file generated by gpp.py is the following: /* gread_temperature.ch */ adios_group_size (adios_handle, adios_groupsize, &adios_totalsize); adios_buf_size = 4; adios_read (adios_handle, "NX", &NX, adios_buf_size); adios_buf_size = NX; adios_read (adios_handle, "temperature", t, adios_buf_size); Figure 27. Example of a generated gread_temperature.ch file 11.7 Writing to Shared Files (P writers, N files) As the number of processes increases to tens or hundreds of thousands, the amount of files will increase by the same magnitude if we use the POSIX method or a single shared file may be too large if we use the MPI method. In this example we address a scenario in which multiple processes write to N files. In the following example (Figure 28), we write out N files from P processes. This is achieved by creating a separate communicator for N subsets of the processes using MPI_Comm_split(). #include #include "mpi.h" #include "adios.h" int main (int argc, char ** argv) 64 { char filename [256]; int rank, size; int NX = 10; int N = 3; double t[NX]; /* ADIOS variables declarations for matching gwrite_temperature.ch */ int adios_err; uint64_t adios_groupsize, adios_totalsize; int64_t adios_handle; MPI_Comm comm; /* int color, key; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); /* MPI_Comm_split partitions the world group into N disjointed subgroups, * the processes are ranked in terms of the argument key * a new communicator comm is returned for this specific grid configuration */ color = rank % N; key = rank / N; MPI_Comm_split (MPI_COMM_WORLD, color, key, &comm); /* every P/N processes write into the same file * there are N files generated. */ sprintf (filename, "restart_%5.5d.bp", color); adios_init ("config.xml"); adios_open (&adios_handle, "temperature", filename, "w", &comm); #include "gwrite_temperature.ch" adios_close (adios_handle); adios_finalize (rank); MPI_Finalize (); return 0; } Figure 28. Example ADIOS program writing N files from P processors (N) The reconstructed MPI communicator comm is passed as an argument of the adios_open() call. Therefore, in this example, each file is written by the processes in the same communication domain. 65 There is no need to change the XML file in this case because we are still using the MPI method. 11.8 Global Arrays If each process writes out a sub-‐array that belongs to the same global space, ADIOS provides the way to write out global information so the generic read API can see a single global array (and also the HDF5 or NetCDF file when using our converters). This example demonstrates how to write global arrays, where the number of processes becomes a separate dimension. Each process is writing the one dimensional temperature array of size NX and the result is a two dimensional array of size PxNX. Figure 29 shows how to define a global array in the XML file. Figure 29. Config.xml for a global array (examples/C/global-‐array/adios_global.xml) The variable is inserted into a … section. The global array’s global dimension is defined by the variables size and NX, available in all processes and all with the same value. The offset of a local array written by a process is defined using the rank variable, which is different on every process. The variable itself is defined as an 1xNX two dimensional array, although in the C code it is still a one dimensional array. The gwrite header file generated by gpp.py is the following: /* gwrite_temperature.ch */ adios_groupsize = 4 \ + 4 \ 66 + 4 \ + 8 * (1) * (NX); adios_group_size (adios_handle, adios_groupsize, &adios_totalsize); adios_write (adios_handle, "NX", &NX); adios_write (adios_handle, "size", &size); adios_write (adios_handle, "rank", &rank); adios_write (adios_handle, "temperature", t); Figure 30. gwrite header file generated from config.xml The program code is not very different from the one used in the above example. It needs to have the size and rank variables in the code defined (see examples/C/global-‐array/adios_global.c) 11.8.1 MPI-‐IO transport method (P writers, 1 file) $ mpirun -np 4 ./adios_global $ ls adios_global.bp adios_global.bp $ bpls -latd adios_global.bp -n 10 integer /NX scalar = 10 integer /rank scalar = 0 integer /size scalar = 4 double /temperature {4, 10} = 0 / 39 / 19.5 / 11.5434 {MIN / MAX / AVG / STD_DEV} (0,0) 0 1 2 3 4 5 6 7 8 9 (1,0) 10 11 12 13 14 15 16 17 18 19 (2,0) 20 21 22 23 24 25 26 27 28 29 (3,0) 30 31 32 33 34 35 36 37 38 39 string /temperature/description attr = "Global array written from 'size' processes" The bp2ncd utility can be used to convert the bp file to an NetCDF file: $ bp2ncd adios_global.bp $ ncdump adios_global.nc netcdf adios_global { dimensions: NX = 10 ; size = 4 ; rank = 1 ; variables: double temperature(size, NX) ; temperature:description = "Global array written from \'size\' processes" ; data: 67 temperature = 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 20, 21, 22, 23, 24, 25, 30, 31, 32, 33, 34, 35, 8, 9, 16, 17, 18, 19, 26, 27, 28, 29, 36, 37, 38, 39 ; } 11.8.2 POSIX transport method (P writers, P Subfiles + 1 Metadata file) To list variables output from POSIX transport, user only needs to specify the global metadata file (e.g., adios_global.bp) as a parameter to bpls, not each individual files (e.g., adios_global.bp.dir/adios_global.bp.0). The output of the POSIX and the MPI methods are equivalent from reading point of view. $ mpirun -np 4 ./adios_global $ ls adios_global.bp adios_global.bp $ bpls -latd adios_global.bp -n 10 integer /NX scalar = 10 integer /rank scalar = 0 integer /size scalar = 4 double /temperature {4, 10} = 0 / 39 / 19.5 / 11.5434 {MIN / MAX / AVG / STD_DEV} (0,0) 0 1 2 3 4 5 6 7 8 9 (1,0) 10 11 12 13 14 15 16 17 18 19 (2,0) 20 21 22 23 24 25 26 27 28 29 (3,0) 30 31 32 33 34 35 36 37 38 39 string /temperature/description attr = "Global array written from 'size' processes" The examples/C/global-‐array/adios_read_global.c program shows how to use the generic read API to read in the global array from arbitrary number of processes. 11.9 Writing Time-‐Index into a Variable The time-‐index allows the user to define a variable with an unlimited dimension, along which the variable can grow in time. Let’s suppose the user wants to write out temperature after a certain number of iterations. First, we add the “time-‐ index” attribute to the adios-‐group with an arbitrary name, e.g. “iter”. Next, we find the (global) variable temperature in the adios-‐group and add “iter” as an extra dimension for it; the record number for that variable will be stored every time it gets written out. Note that we do not need to change the dimensions and offsets in the global bounds, only the individual variable. Also note, that the time dimension must be the slowest changing dimension, i.e. in C, the first one and in Fortran, it must be the last one. /* config.xml*/68 Figure 31. Config.xml for a global array with time (examples/C/global-‐array-‐time/adios_globaltime.xml) The examples/C/global-‐array-‐time/adios_globaltime.c is similar to the previous example adios_global.c code. The only difference is that it has an iteration loop where each process writes out the data in each of its 13 iterations. $ mpirun -np 4 ./adios_read_globaltime $ bpls -la adios_globaltime.bp Group temperature: integer /NX scalar = 10 integer /size scalar = 4 integer /rank scalar = 0 double /temperature {13, 4, 10} = 100 / 1339 / 719.5 / 374.344 {MIN / MAX / AVG / STD_DEV} string /temperature/description attr = "Global array written from 'size' processes over several timesteps" A slice of two timesteps (6th and 7th), dumped with bpls: $ bpls adios_globaltime.bp -s "5,0,0" -c "2,-1,-1" -n 10 -d temperature double /temperature {13, 4, 10} slice (5:6, 0:3, 0:9) (5,0,0) (5,1,0) (5,2,0) 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 69 (5,3,0) (6,0,0) (6,1,0) (6,2,0) (6,3,0) 630 700 710 720 730 631 701 711 721 731 632 702 712 722 732 633 703 713 723 733 634 704 714 724 734 635 705 715 725 735 636 706 716 726 736 637 707 717 727 737 638 708 718 728 738 639 709 719 729 739 11.10 Reading statistics In Adios 1.2, statistics like minimum, maximum, average and standard deviation can be aggregated inexpensively. This section shows how these statistics can be accessed from the BP file. The examples/C/stat/stat_write.c is similar to the previous example adios_globaltime.c. It writes an additional variable “complex” of type adios_double_complex along with “temperature.” It also has histogram enabled for the variable “temperature.” Comparing it with the XML in the previous example, stat.xml has the following additions: /* stat.xml*/(Note, for Fortran, “iter” needs to be put in the end, i.e., dimension=”NX,1,iter”) Figure 32. Config.xml for creating histogram for an array variable (examples/C/stat/stat.xml) To include histogram calculation, only the XML file needs to be updated, and no change is required in the C code. The examples/C/stat/gwrite_stat.ch requires an 70 additional 8 * (2) * NX to be added to adios_groupsize and an adios_write (adios_handle, "complex", &c) to handle the complex numbers. $ mpirun -np 2 ./stat_write [1]: adios_stat.bp written successfully [0]: adios_stat.bp written successfully The examples/C/stat/stat_read.c shows how to read back the statistics from the bp file. First, the statistics need to be populated into an ADIOS_VARINFO object. This is done with the following set of commands. ADIOS_FILE * f = adios_fopen ("adios_stat.bp", comm); ADIOS_GROUP * g = adios_gopen (f, "temperature"); ADIOS_VARINFO * v = adios_inq_var (g, "temperature"); The object ‘v’ now contains all the statistical information for the variable “temperature.” To access the histogram for temperature, we need to access the ADIOS_HIST data structure inside the ADIOS_VARINFO object. The code below prints the break points and the interval frequencies for the global histogram. For ‘n’ break points there are ‘n + 1’ intervals. /* Break points */ for (j = 0; j < v->hist->num_breaks; j++) printf ("%lf ", v->hist->breaks[j]); /* Frequencies */ for (j = 0; j <= v->hist->num_breaks; j++) printf ("%d\t", v->hist->gfrequencies[j]); adios_free_varinfo(v); To access the statistics related to the variable “complex,” we need: v = adios_inq_var (g, "complex"); The code below describes how to print the minimum values of the magnitude, real and imaginary part of complex data at each timestep. For complex variables alone, all statistics need to be typecasted into a double format. double ** Cmin = (double **) v->mins; printf ("\nMagnitude Real Imaginary\n"); for (j = 0; v->ndim >= 0 && (j < v->dims[0]); j ++) printf ("%lf %lf %lf\n", Cmin[j][0], Cmin[j][1], Cmin[j][2]); adios_free_varinfo(v); 71 12 Developer Manual 12.1 Create New Transport Methods One of ADIOS’s important features is the componentization of transport methods. Users can switch among the typical methods that we support or even create their own methods, which can be easily plugged into our library. The following sections provide the procedures for adding the new transport method called “abc” into the ADIOS library. In this version of ADIOS, all the source files are located in /trunk/src/. 12.1.1 Add the new method macros in adios_transport_hooks.h The first file users need to examine is adios_transport_hooks.h, which basically defines all the transport methods and interface functions between detailed transport implementation and user APIs. In the file, we first find the line that defines the enumeration type Adios_IO_methods_datatype add the declaration of method ID ADIOS_METHOD_ABC, and, because we add a new method, update total number of transport methods ADIOS_METHOD_COUNT from 9 to 10. 1. enum Adios_IO_methods datatype enum ADIOS_IO_METHOD { ADIOS_METHOD_UNKNOWN = -‐2 ,ADIOS_METHOD_NULL = -‐1 ,ADIOS_METHOD_MPI = 0 …… ,ADIOS_METHOD_PHDF5 = 8 ADIOS_METHOD_ABC = 9 ,ADIOS_METHOD_COUNT = 9 ADIOS_METHOD_COUNT = 10 }; 2. Next, we need to declare the transport APIs for method “abc,” including init/finalize, open/close, should_buffer, and read/write. Similar to the other methods, we need to add FORWARD_DECLARE (abc) 3. Then, we add the mapping of the string name “abc” of the new transport method to the method ID -‐ ADIOS_METHOD_ABC, which has been already defined in enumeration type Adios_IO_methods_datatype. As the last parameter, “1” here means the method requires communications, or “0” if not. MATCH_STRING_TO_METHOD ("abc", ADIOS_METHOD_ABC, 1) 72 4. Lastly, we add the mapping of the string name needed in the initialization functions to the method ID, which will be used by adios_transport_struct variables defined in adios_internals.h. ASSIGN_FNS (abc, ADIOS_METHOD_ABC) 12.1.2 Create adios_abc.c In this section, we demonstrate how to implement different transport APIs for method “abc.” In adios_abc.c, we need to implement at least 11 required routines: 1. “adios_abc_init” allocates the method_data field in adios_method_struct to the user-‐defined transport data structure, such as adios_abc_data_struct, and initializes this data structure. Before the function returns, the initialization status can be set by statement “adios_abc_initialized = 1.” 2. “adios_abc_open” opens a file if there is only one processor writing to the file. Otherwise, this function does nothing; instead, we use adios_abc_should_buffer to coordinate the file open operations. 3. “adios_abc_should_buffer,” called by the “common_adios_group_size” function in adios.c, needs to include coordination of open operations if multiple processes are writing to the same file. 4. “adios_abc_write”, in the case of no buffering or overflow, writes data directly to disk. Otherwise, it verifies whether the internally recorded memory pointer is consistent with the vector variable’s address passed in the function parameter and frees that block of memory if it is not needed any more. 5. “adios_abc_read” associates the internal data structure’s address to the variable specified in the function parameter. 6. “adios_abc_close” simply closes the file if no buffering scheme is used. However, in general, this function performs most of the actual disk writing/reading the buffers to/from the file by one or more processors in the same communicator domain and then close the file. 7. “adios_abc_finalize” resets the initialization status back to 0 if it has been set to 1 by adios_abc_init. If you are developing asynchronous methods, the following functions need to be implemented as well; otherwise you can leave them as empty implementation. 8. adios_abc_get_write_buffer, 9. “adios_abc_end_iteration“ is a tick counter for the I/O routines to time how fast they are emptying the buffers. 10. “adios_abc_start_calculation” indicates that it is now an ideal time to do bulk data transfers because the code will not be performing I/O for a while. 73 11. “adios_abc_stop_calculation“ indicates that bulk data transfers should cease because the code is about to start communicating with other nodes. The following is One of the most important things that needs to be noted: fd-‐>shared_buffer = adios_flag_no, which means that the methods do not need a buffering scheme, such as PHDF5, and that data write out occurs immediately once adios_write returns. If fd-‐>shared_buffer = adios_flag_yes, the users can employ the self-‐defined buffering scheme to improve I/O performance. 12.1.3 A walk-‐through example Now let’s look at an example of adding an unbuffered POSIX method to ADIOS. According to the steps described above, we first open the header file -‐-‐ “adios_transport_hooks.h,” and add the following statements: • enum ADIOS_IO_METHOD { ADIOS_METHOD_UNKNOWN = -‐2 ,ADIOS_METHOD_NULL = -‐1 ,ADIOS_METHOD_MPI = 0 … ,ADIOS_METHOD_PROVENANCE = 8 // method ID for binary transport method ,ADIOS_METHOD_POSIX_ASCII_NB = 9 // total method number ,ADIOS_METHOD_COUNT = 10 }; • • FORWARD_DECLARE (posix_ascii_nb); MATCH_STRING_TO_METHOD ("posix_ascii_nb" , ADIOS_METHOD_ POSIX_ASCII_NB, 0) • ASSIGN_FNS (binary, ADIOS_METHOD_ POSIX_ASCII_NB) Next, we must create adios_posix_ascii_nb,c, which defines all the required routines listed in Sect. 12.1.2 The blue highlights below mark out the data structures and required functions that developers need to implement in the source code. static int adios_posix_ascii_nb_initialized = 0; struct adios_POSIX_ASCII_UNBUFFERED_data_struct { FILE *f; 74 uint64_t file_size; }; void adios_posix_ascii_nb _init (const char *parameters , struct adios_method_struct * method) { struct adios_POSIX_ASCII_UNBUFFERED_data_struct * md; if (!adios_posix_ascii_nb_initialized) { adios_posix_ascii_nb_initialized = 1; } method-‐>method_data = malloc ( sizeof(struct adios_POSIX_ASCII_UNBUFFERED_data_struct) ); md = (struct adios_POSIX_ASCII_UNBUFFERED_data_struct *) method-‐>method_data; md-‐>f = 0; md-‐>file_size = 0; } int adios_posix_ascii_nb _open (struct adios_file_struct * fd , struct adios_method_struct * method) { char * name; struct adios_POSIX_ASCII_UNBUFFERED_data_struct * p; struct stat s; p = (struct adios_POSIX_ASCII_UNBUFFERED_data_struct *) method-‐>method_data; name = malloc (strlen (method-‐>base_path) + strlen (fd-‐>name) + 1); sprintf (name, "%s%s", method-‐>base_path, fd-‐>name); if (stat (name, &s) == 0) p-‐>file_size = s.st_size; switch (fd-‐>mode) { case adios_mode_read: { p-‐>f = fopen (name, "r"); if (p-‐>f <= 0) { fprintf (stderr, "ADIOS POSIX ASCII UNBUFFERED: " "file not found: %s\n", fd-‐>name); free (name); return 0; } break; } case adios_mode_write: { p-‐>f = fopen (name, "w"); if (p-‐>f <= 0) 75 { fprintf (stderr, "adios_posix_ascii_nb_open " "failed for base_path %s, name %s\n" ,method-‐>base_path, fd-‐>name ); free (name); return 0; } break; } case adios_mode_append: { int old_file = 1; p-‐>f = fopen (name, "a"); if (p-‐>f <= 0) { fprintf (stderr, "adios_posix_ascii_nb_open" " failed for base_path %s, name %s\n" ,method-‐>base_path, fd-‐>name ); free (name); return 0; } break; } default: { fprintf (stderr, "Unknown file mode: %d\n", fd-‐>mode); free (name); return 0; } } free (name); return 0; } enum ADIOS_FLAG adios_posix_ascii_nb_should_buffer (struct adios_file_struct * fd ,struct adios_method_struct * method ,void * comm) { //in this case, we don’t use shared_buffer return adios_flag_no; } void adios_posix_ascii_nb_write (struct adios_file_struct * fd ,struct adios_var_struct * v ,void * data ,struct adios_method_struct * method ) { 76 struct adios_POSIX_ASCII_UNBUFFERED_data_struct * p; p = (struct adios_POSIX_ASCII_UNBUFFERED_data_struct *) method-‐>method_data; if (!v-‐>dimensions) { switch (v-‐>type) { case adios_byte: case adios_unsigned_byte: fprintf (p-‐>f,"%c\n", *((char *)data)); break; case adios_short: case adios_integer: case adios_unsigned_short: case adios_unsigned_integer: fprintf (p-‐>f,"%d\n", *((int *)data)); break; case adios_real: case adios_double: case adios_long_double: fprintf (p-‐>f,"%f\n", *((double *)data)); break; case adios_string: fprintf (p-‐>f,"%s\n", (char *)data); break; case adios_complex: fprintf (p-‐>f,"%f+%fi\n", *((float *)data),*((float *)(data+4))); break; case adios_double_complex: fprintf (p-‐>f,"%f+%fi\n", *((double *)data),*((double *)(data+8))); break; default: break; } } else { uint64_t j; int element_size = adios_get_type_size (v-‐>type, v-‐>data); printf("element_size: %d\n",element_size); uint64_t var_size = adios_get_var_size (v, fd-‐>group, v-‐>data)/element_size; switch (v-‐>type) { case adios_byte: case adios_unsigned_byte: for (j = 0;j < var_size; j++) fprintf (p-‐>f,"%c ", *((char *)(data+j))); printf("\n"); break; case adios_short: case adios_integer: 77 case adios_unsigned_short: case adios_unsigned_integer: for (j = 0;j < var_size; j++) fprintf (p-‐>f,"%d ", *((int *)(data+element_size*j))); printf("\n"); break; case adios_real: case adios_double: case adios_long_double: for (j = 0;j < var_size; j++) fprintf (p-‐>f,"%f ", * ( (double *)(data+element_size*j)) ); printf("\n"); break; case adios_string: for (j = 0;j < var_size; j++) fprintf (p-‐>f,"%s ", (char *)data); printf("\n"); break; case adios_complex: for (j = 0;j < var_size; j++) fprintf (p-‐>f, "%f+%fi ", *((float *)(data+element_size*j)) ,*((float *)(data+4+element_size*j)) ); printf("\n"); break; case adios_double_complex: for (j = 0;j < var_size; j++) fprintf (p-‐>f,"%f+%fi ", *((double *)(data+element_size*j)) ,*((double *)(data+element_size*j+8))); printf("\n"); break; default: break; } } } void adios_posix_ascii_nb_get_write_buffer (struct adios_file_struct * fd ,struct adios_var_struct * v ,uint64_t * size ,void ** buffer ,struct adios_method_struct * method) { *buffer = 0; } void adios_posix_ascii_nb_read (struct adios_file_struct * fd ,struct adios_var_struct * v, void * buffer ,uint64_t buffer_size 78 ,struct adios_method_struct * method ) { v-‐>data = buffer; v-‐>data_size = buffer_size; } int adios_posix_ascii_nb_close (struct adios_file_struct * fd , struct adios_method_struct * method) { struct adios_POSIX_ASCII_UNBUFFERED_data_struct * p; p = (struct adios_POSIX_ASCII_UNBUFFERED_data_struct *) method-‐>method_data; if (p-‐>f <= 0) { fclose (p-‐>f); } p-‐>f = 0; p-‐>file_size = 0; } void adios_posix_ascii_nb_finalize (int mype, struct adios_method_struct * method) { if (adios_posix_ascii_nb_initialized) adios_posix_ascii_nb_initialized = 0; } The binary transport method blocks methods for simplicity. Therefore, no special implementation for the three functions below is necessary and their function bodies can be left empty: adios_posix_ascii_nb_end_iteration (struct adios_method_struct * method) {} adios_posix_ascii_nb_start_calculation (struct adios_method_struct * method) {} adios posix_ascii_nb stop_calculation (struct adios_method_struct * method) {} Above, we have implemented the POSIX_ASCII_NB transport method. When users specify POSIX_ASCII_NB method in xml file, the users’ applications will generate ASCII files by using common ADIOS APIs. However, in order to achieve better I/O performance, a buffering scheme needs to be included into this example. 12.2 Profiling the Application and ADIOS There are two ways to get profiling information of ADIOS I/O operations. One way is for the user to explicitly insert a set of profiling API calls around ADIOS API calls in the source code. The other way is to link the user code with a renamed ADIOS library and an ADIOS API wrapper library. 79 12.2.1 Use profiling API in source code The profiling library called libadios_timing.a implements a set of profiling API calls. The user can use these API calls to wrap the ADIOS API calls in the source code to get profiling information. The adios-‐timing.h header file contains the declarations of those profiling functions. /* * initialize profiling * * Fortran interface */ int init_prof_all_(char *prof_file_name, int prof_file_name_size); /* * record open start time for specified group * * Fortran interface */ void open_start_for_group_(int64_t *gp_prof_handle, char *group_name, int *cycle, int *gp_name_size); /* * record open end time for specified group * * Fortran interface */ void open_end_for_group_(int64_t *gp_prof_handle, int *cycle); /* * record write start time for specified group * * Fortran interface */ void write_start_for_group_(int64_t *gp_prof_handle, int *cycle); /* * record write end time for specified group * * Fortran interface */ void write_end_for_group_(int64_t *gp_prof_handle, int *cycle); /* * record close start time for specified group 80 * * Fortran interface */ void close_start_for_group_(int64_t *gp_prof_handle, int *cycle); /* * record close end time for specified group * * Fortran interface */ void close_end_for_group_(int64_t *gp_prof_handle, int *cycle); /* * Report timing info for all groups * * Fortran interface */ int finalize_prof_all_(); /* * record start time of a simulation cycle * * Fortran interface */ void cycle_start_(int *cycle); /* * record end time of a simulation cycle * * Fortran interface */ void cycle_end_(int *cycle); An example of using these functions is given below. … ! initialization ADIOS CALL adios_init ("config.xml"//char(0)) ! initialize profiling library; the parameter specifies the file where profiling information is written CALL init_prof_all("log"//char(0)) … CALL MPI_Barrier(toroidal_comm, error ) 81 ! record start time of open ! group_prof_handle is an OUT parameter holding the handle for the group ‘output3d.0’ ! istep is iteration no. CALL open_start_for_group(group_prof_handle, "output3d.0"//char(0),istep) CALL adios_open(adios_handle, "output3d.0"//char(0), “w”//char(0)) ! record end time of open CALL open_end_for_group(group_prof_handle,istep) ! record start time of write CALL write_start_for_group(group_prof_handle,istep) #include "gwrite_output3d.0.fh" ! record end time of write CALL write_end_for_group(group_prof_handle,istep) ! record start time of close CALL cose_start_for_group(group_prof_handle,istep) CALL adios_close(adios_handle,adios_err) ! record end time of close CALL close_end_for_group(group_prof_handle,istep) … CALL adios_finalize (myid) ! finalize; profiling information are gathered and min/max/mean/var are calculated for each IO dump CALL finalize_prof() CALL MPI_FINALIZE(error) When the code is run, profiling information will be saved to the file ”./log” (specified in init_prof_all ()). Below is an example. Fri Aug 22 15:42:04 EDT 2008 I/O Timing results Operations : min max mean var cycle no 3 io count 0 # Open : 0.107671 0.108245 0.108032 0.000124 # Open start : 1219434228.866144 1219434230.775268 1219434229.748614 0.588501 # Open end : 1219434228.974225 1219434230.883335 1219434229.856646 0.588486 82 # Write : 0.000170 0.000190 0.000179 0.000005 # Write start : 1219434228.974226 1219434230.883336 1219434229.856647 0.588486 # Write end : 1219434228.974405 1219434230.883514 1219434229.856826 0.588484 # Close : 0.001608 0.001743 0.001656 0.000036 # Close start : 1219434228.974405 1219434230.883514 1219434229.856826 0.588484 # Close end : 1219434228.976040 1219434230.885211 1219434229.858482 0.588489 # Total : 0.109484 0.110049 0.109868 0.000137 cycle no 6 io count 1 # Open : 0.000007 0.000011 0.000009 0.000001 # Open start : 1219434240.098444 1219434242.007951 1219434240.981075 0.588556 # Open end : 1219434240.098452 1219434242.007962 1219434240.981083 0.588556 # Write : 0.000175 0.000196 0.000180 0.000004 # Write start : 1219434240.098452 1219434242.007962 1219434240.981083 0.588557 # Write end : 1219434240.098631 1219434242.008158 1219434240.981264 0.588558 # Close : 0.000947 0.003603 0.001234 0.000466 # Close start : 1219434240.098631 1219434242.008158 1219434240.981264 0.588558 # Close end : 1219434240.099665 1219434242.009620 1219434240.982498 0.588447 # Total : 0.001132 0.003789 0.001423 0.000466 The script “post_script.sh” extracts “open time”, “write time”, “close time”, and “total time” from the raw profiling results and saves them in separate files: open, write, close, and total, respectively. To compile the code, one should link the code with the –ladios_timing -ladios option. 12.2.2 Use wrapper library Another way to do profiling is to link the source code with a renamed ADIOS library and a wrapper library. The renamed ADIOS library implements “real” ADIOS routines, but all ADIOS public functions are renamed with a prefix “P”. For example, adios_open() is renamed as Padios_open(). The routine for parsing config.xml file is also changed to parse extra flags in config.xml file to turn profiling on or off. The wrapper library implements all adios pubic functions (e.g., adios_open, adios_write, adios_close) within each function. It calls the “real” function (Padios_xxx()) and measure the start and end time of the function call. There is an example wrapper library called libadios_profiling.a. Developers can implement their own wrapper library to customize the profiling. To use the wrapper library, the user code should be linked with –ladios_profiling –ladios. the wrapper library should precede the “real” ADIOS library. There is no need to put additional profiling API calls in the source code. The user can turn profiling on or off for each ADIOS group by setting a flag in the config.xml file. ... 83 13 Appendix Datatypes used in the ADIOS XML file size Signed type Unsigned type 1 byte, interger*1 unsigned byte, unsigned integer*1 2 short, integer*2 unsigned short, unsigned integer*2 4 integer, integer*4, real, real*4, float unsigned integer, unsigned integer*4 8 long, integer*8, real*8, double, long float, complex, complex*8 16 real*16, long double, double complex, complex*16 string ADIOS APIs List Function Purpose adios_init Load the XML configuration file creating internal representations of the various data types and defining the methods used for writing. Cleanup anything remaining before exiting the code Prepare a data type for subsequent calls to write data using the io_handle. Mode is one of “r” (read), “w” (write) and “a” (append). Commit all the write to disk, close the file and release adios file handle Passing the required buffer size to the transport layer and returned the total size back to the source code Submit a data element for writing. This does NOT actually perform the write in buffered mode. In the overflow case, this call writes to buffer directly. Submit a buffer space (var) for reading a data element into. This does NOT actually perform the read. Actual population of the buffer space will happen on the call to adios_close adios_finalize adios_open adios_close adios_group_size adios_write adios_read 84 adios_set_path adios_set_path_var adios_get_write_buffer adios_start_calculation adios_end_ calculation adios_end_iteration Set the HDF5-style path for all variables in a adios-group. This will reset whatever is specified in the XML file. Set the HDF-5-style path for the specified var in the group. This will reset whatever is specified in the XML file. For the given field, get a buffer that will be used at the transport level for it of the given size. If size == 0, then auto calculate the size based on what is known from the datatype in the XML file and any provided additional elements (such as array dimension elements). To return this buffer, just do a normal call to adios_write using the same io_handle, field_name, and the returned buffer. An indicator that it is now an ideal time to do bulk data transfers as the code will not be performing IO for a while. An indicator that it is no longer a good time to do bulk data transfers as the code is about to start doing communication with other nodes causing possible conflicts A tick counter for the IO routines to time how fast they are emptying the buffers. 85
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 94 PDF Version : 1.4 Title : ADIOS-UsersManual-1.2.1-01 Author : Podhorszki, Norbert Subject : Producer : Mac OS X 10.6.4 Quartz PDFContext Creator : Microsoft Word Create Date : 2010:08:25 16:40:17Z Modify Date : 2010:08:25 16:40:17Z Apple Keywords :EXIF Metadata provided by EXIF.tools