Paralution User Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 141

Download
Open PDF In Browser	View PDF

User manual Version 1.1.0 January 19, 2016 3 PARALUTION - User Manual This document is licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License http://creativecommons.org/licenses/by-nc-nd/3.0/legalcode This project is funded by Imprint: PARALUTION Labs UG (haftungsbeschränkt) & Co. KG Am Hasensprung 6, 76571 Gaggenau, Germany Handelsregister: Amtsgericht Mannheim, HRA 706051 Vertreten durch PARALUTION Labs Verwaltungs UG (haftungsbeschränkt) Am Hasensprung 6, 76571 Gaggenau, Germany Handelsregister: Amtsgericht Mannheim, HRB 721277 Geschäftsführer: Dimitar Lukarski, Nico Trost www.paralution.com Copyright c 2015-2016 4 PARALUTION - User Manual 1 Introduction 1.1 Overview . . . . . . . . . . . 1.2 Purpose of this User Manual 1.3 API Documentation . . . . . 1.4 Dual License . . . . . . . . . 1.4.1 Basic Versions . . . . 1.4.2 Commercial Versions . 1.5 Version Nomenclature . . . . 1.6 Features Nomenclature . . . . 1.7 Cite . . . . . . . . . . . . . . 1.8 Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 10 11 11 11 12 12 12 13 2 Installation 2.1 Linux/Unix-like OS . . . . 2.1.1 Makefile . . . . . . 2.1.2 CMake . . . . . . . 2.1.3 Shared Library . . 2.2 Windows OS . . . . . . . 2.2.1 OpenMP backend 2.2.2 CUDA backend . . 2.2.3 OpenCL backend . 2.3 Mac OS . . . . . . . . . . 2.4 Supported Compilers . . . 2.5 Simple Test . . . . . . . . 2.6 Compilation and Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 15 15 16 16 16 17 18 18 18 19 19 3 Basics 3.1 Design and Philosophy . . . . . . . 3.2 Operators and Vectors . . . . . . . 3.2.1 Local Operators/Vectors . . 3.2.2 Global Operators/Vectors . 3.3 Functionality on the Accelerators . 3.4 Initialization of the Library . . . . 3.4.1 Thread-core Mapping . . . 3.4.2 OpenMP Threshold Size . . 3.4.3 Disable the Accelerator . . 3.4.4 MPI and Multi-accelerators 3.5 Automatic Object Tracking . . . . 3.6 Verbose Output . . . . . . . . . . . 3.7 Verbose Output and MPI . . . . . 3.8 Debug Output . . . . . . . . . . . 3.9 File Logging . . . . . . . . . . . . . 3.10 Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 22 22 22 24 25 25 25 26 26 26 26 26 26 4 Single-node Computation 4.1 Introduction . . . . . . . 4.2 Code Structure . . . . . 4.3 Value Type . . . . . . . 4.4 Complex Support . . . . 4.5 Allocation and Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 27 27 28 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 PARALUTION - USER MANUAL 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 Matrix Formats . . . . . . . . . . . . . . 4.6.1 Coordinate Format – COO . . . 4.6.2 Compressed Sparse Row/Column 4.6.3 Diagonal Format – DIA . . . . . 4.6.4 ELL Format . . . . . . . . . . . 4.6.5 HYB Format . . . . . . . . . . . 4.6.6 Memory Usage . . . . . . . . . . 4.6.7 Backend support . . . . . . . . . I/O . . . . . . . . . . . . . . . . . . . . . Access . . . . . . . . . . . . . . . . . . . Raw Access to the Data . . . . . . . . . Copy CSR Matrix Host Data . . . . . . Copy Data . . . . . . . . . . . . . . . . Object Info . . . . . . . . . . . . . . . . Copy . . . . . . . . . . . . . . . . . . . . Clone . . . . . . . . . . . . . . . . . . . Assembling . . . . . . . . . . . . . . . . Check . . . . . . . . . . . . . . . . . . . Sort . . . . . . . . . . . . . . . . . . . . Keying . . . . . . . . . . . . . . . . . . . Graph Analyzers . . . . . . . . . . . . . Basic Linear Algebra Operations . . . . Local Stencils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Format – CSR/CSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 30 31 31 31 32 32 32 33 34 35 35 35 35 36 37 38 38 38 39 40 41 5 Multi-node Computation 5.1 Introduction . . . . . . . . . . . . . . . 5.2 Code Structure . . . . . . . . . . . . . 5.3 Parallel Manager . . . . . . . . . . . . 5.4 Global Matrices and Vectors . . . . . . 5.4.1 Value Type and Formats . . . . 5.4.2 Matrix and Vector Operations 5.4.3 Asynchronous SpMV . . . . . . 5.5 Communication Channel . . . . . . . . 5.6 I/O . . . . . . . . . . . . . . . . . . . . 5.6.1 File Organization . . . . . . . . 5.6.2 Parallel Manager . . . . . . . . 5.6.3 Matrices . . . . . . . . . . . . . 5.6.4 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 45 45 45 45 45 45 45 46 47 47 6 Solvers 6.1 Code Structure . . . . . . . . 6.2 Iterative Linear Solvers . . . 6.3 Stopping Criteria . . . . . . . 6.4 Building and Solving Phase . 6.5 Clear function and Destructor 6.6 Numerical Update . . . . . . 6.7 Fixed-point Iteration . . . . . 6.8 Krylov Subspace Solvers . . . 6.8.1 CG . . . . . . . . . . . 6.8.2 CR . . . . . . . . . . . 6.8.3 GMRES . . . . . . . . 6.8.4 FGMRES . . . . . . . 6.8.5 BiCGStab . . . . . . . 6.8.6 IDR . . . . . . . . . . 6.8.7 FCG . . . . . . . . . . 6.8.8 QMRCGStab . . . . . 6.8.9 BiCGStab(l) . . . . . 6.9 Deflated PCG . . . . . . . . . 6.10 Chebyshev Iteration . . . . . 6.11 Mixed-precision Solver . . . . 6.12 Multigrid Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 51 51 52 53 53 53 54 54 54 54 55 55 55 55 56 56 56 57 57 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PARALUTION - USER MANUAL 7 6.12.1 Geometric Multigrid 6.12.2 Algebraic Multigrid 6.13 Direct Linear Solvers . . . . 6.14 Eigenvalue Solvers . . . . . 6.14.1 AMPE-SIRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 61 61 61 7 Preconditioners 7.1 Code Structure . . . . . . . . . . . . . . . . . . . . . . . 7.2 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Multi-colored (Symmetric) Gauss-Seidel and SOR . . . 7.4 Incomplete LU with levels – ILU(p) . . . . . . . . . . . 7.5 Incomplete Cholesky – IC . . . . . . . . . . . . . . . . . 7.6 Incomplete LU with threshold – ILUT(t,m) . . . . . . . 7.7 Power(q)-pattern method – ILU(p,q) . . . . . . . . . . . 7.8 Multi-Elimination ILU . . . . . . . . . . . . . . . . . . . 7.9 Diagonal Preconditioner for Saddle-Point Problems . . . 7.10 Chebyshev Polynomial . . . . . . . . . . . . . . . . . . . 7.11 FSAI(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 SPAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13 Block Preconditioner . . . . . . . . . . . . . . . . . . . . 7.14 Additive Schwarz and Restricted Additive Schwarz – AS 7.15 Truncated Neumann Series Preconditioner (TNS) . . . . 7.16 Variable Preconditioner . . . . . . . . . . . . . . . . . . 7.17 CPR Preconditioner . . . . . . . . . . . . . . . . . . . . 7.18 MPI Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and RAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 63 64 65 65 65 66 66 66 67 67 68 68 70 71 72 72 72 8 Backends 8.1 Backend and Accelerators . . . . 8.2 Copy . . . . . . . . . . . . . . . . 8.3 CloneBackend . . . . . . . . . . . 8.4 Moving Objects To and From the 8.5 Asynchronous Transfers . . . . . 8.6 Systems without Accelerators . . . . . . . . . . . . . . . . . . . . . . . Accelerator . . . . . . . . . . . . . . 9 Advanced Techniques 9.1 Hardware Locking . . . . . . . . . . . . . . 9.2 Time Restriction/Limit . . . . . . . . . . . 9.3 OpenCL Kernel Encryption . . . . . . . . . 9.4 Logging Encryption . . . . . . . . . . . . . 9.5 Encryption . . . . . . . . . . . . . . . . . . 9.6 Memory Allocation . . . . . . . . . . . . . . 9.6.1 Allocation Problems . . . . . . . . . 9.6.2 Memory Alignment . . . . . . . . . . 9.6.3 Pinned Memory Allocation (CUDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 73 73 74 75 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 77 78 78 78 78 78 78 78 10 Plug-ins 10.1 FORTRAN . . . . 10.2 OpenFOAM . . . . 10.3 Deal.II . . . . . . . 10.4 Elmer . . . . . . . 10.5 Hermes / Agros2D 10.6 Octave/MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 79 80 81 83 84 84 11 Remarks 11.1 Performance 11.2 Accelerators 11.3 Plug-ins . . 11.4 Correctness 11.5 Compilation 11.6 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 88 88 88 88 89 . . . . . . . . . . . . . . . . . . . . . . . . 8 12 Performance Benchmarks 12.1 Single-Node Benchmarks . . . 12.1.1 Hardware Description 12.1.2 BLAS1 and SpMV . . 12.1.3 CG Solver . . . . . . . 12.2 Multi-Node Benchmarks . . . PARALUTION - USER MANUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Graph Analyzers 14 Functionality Table 14.1 Backend Support . . . . . . . . . . . . 14.2 Matrix-Vector Multiplication Formats 14.3 Local Matrices and Vectors . . . . . . 14.4 Global Matrices and Vectors . . . . . . 14.5 Local Solvers and Preconditioners . . . 14.6 Global Solvers and Preconditioners . . 91 91 91 92 92 95 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Code Examples 15.1 Preconditioned CG . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Multi-Elimination ILU Preconditioner with CG . . . . . . . . . 15.3 Gershgorin Circles+Power Method+Chebyshev Iteration+PCG 15.4 Mixed-precision PCG Solver . . . . . . . . . . . . . . . . . . . . 15.5 PCG Solver with AMG . . . . . . . . . . . . . . . . . . . . . . 15.6 AMG Solver with External Smoothers . . . . . . . . . . . . . . 15.7 Laplace Example File with 4 MPI Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 103 103 104 104 105 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with Chebyshev Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 109 111 113 117 120 123 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Change Log 133 Bibliography 139 Chapter 1 Introduction 1.1 Overview PARALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism, targeting modern processors and accelerators including multi/many-core CPU and GPU platforms. The main goal of this project is to provide a portable library for iterative sparse methods on state of the art hardware. Figure 1.1 shows the PARALUTION framework as middle-ware between different parallel backends and application specific packages. API: C++ FORTRAN Multi-core CPU OpenMP Plug-ins: MATLAB/Octave, Deal.II, OpenFOAM, Hermes, Elmer, ... GPU CUDA/OpenCL Xeon Phi OpenMP Figure 1.1: The PARALUTION library – middleware between hardware and problem specific packages. The major features and characteristics of the library are: • Various backends: – Host – designed for multi-core CPU, based on OpenMP – GPU/CUDA – designed for NVIDIA GPU – OpenCL – designed for OpenCL-compatible devices (NVIDIA GPU, AMD GPU, CPU, Intel MIC) – OpenMP(MIC) – designed for Intel Xeon Phi/MIC • Multi-node/accelerator support – the library supports multi-node and multi-accelerator configurations via MPI layer. • Easy to use – the syntax and the structure of the library provide easy learning curves. With the help of the examples, anyone can try out the library – no knowledge in CUDA, OpenCL or OpenMP required. • No special hardware/library required – there are no hardware or library requirements to install and run PARALUTION. If a GPU device and CUDA, OpenCL, or Intel MKL are available, the library will use them. • Most popular operating systems: – Unix/Linux systems (via cmake/Makefile and gcc/icc) – MacOS (via cmake/Makefile and gcc/icc) – Windows (via Visual Studio) 9 10 CHAPTER 1. INTRODUCTION • Various iterative solvers: – Fixed-Point iteration – Jacobi, Gauss-Seidel, Symmetric-Gauss Seidel, SOR and SSOR – Krylov subspace methods – CR, CG, BiCGStab, BiCGStab(l), GMRES, IDR, QMRCGSTAB, Flexible CG/GMRES – Deflated PCG – Mixed-precision defect-correction scheme – Chebyshev iteration – Multigrid – geometric and algebraic • Various preconditioners: – Matrix splitting – Jacobi, (Multi-colored) Gauss-Seidel, Symmetric Gauss-Seidel, SOR, SSOR – Factorization – ILU(0), ILU(p) (based on levels), ILU(p,q) (power(q)-pattern method) and Multielimination ILU (nested/recursive), ILUT (based on threshold), IC(0) – Approximate Inverse - Chebyshev matrix-valued polynomial, SPAI, FSAI and TNS – Diagonal-based preconditioner for Saddle-point problems – Block-type of sub-preconditioners/solvers – Additive Schwarz and Restricted Additive Schwarz – Variable type of preconditioners • Generic and robust design – PARALUTION is based on a generic and robust design allowing expansion in the direction of new solvers and preconditioners, and support for various hardware types. Furthermore, the design of the library allows the use of all solvers as preconditioners in other solvers, for example you can easily define a CG solver with a Multi-elimination preconditioner, where the last-block is preconditioned with another Chebyshev iteration method which is preconditioned with a multi-colored Symmetric GaussSeidel scheme. • Portable code and results – all code based on PARALUTION is portable and independent of GPU/CUDA, OpenCL or MKL. The code will compile and run everywhere. All solvers and preconditioners are based on a single source code, which delivers portable results across all supported backends (variations are possible due to different rounding modes on the hardware). The only difference which you can see for a hardware change is the performance variation. • Support for several sparse matrix formats – Compressed Sparse Row (CSR), Modified Compressed Sparse Row (MCSR), Dense (DENSE), Coordinate (COO), ELL, Diagonal (DIA), Hybrid format of ELL and COO (HYB) formats. • Plug-ins – the library provides a plug-in for the CFD package OpenFOAM [44] and for the finite element library Deal.II [8, 9]. The library also provides a FORTRAN interface and a plug-in for MATLAB/Octave [36, 2]. 1.2 Purpose of this User Manual The purpose of this document is to present the PARALUTON library step-by-step. This includes the installation process, internal structure and design, application programming interface (API) and examples. The change log of the software is also presented here. All related documentation (web site information, user manual, white papers, doxygen) follows the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License [12]. A copy of the license can be found in the library package. 1.3 API Documentation The most important library’s functions are presented in this document. The library’s full API (references) are documented via the automatic documentation system - doxygen [55]. The references are available on the PARALUTION web site. 1.4. DUAL LICENSE 11 Basic GPLv3 version Single - node/ Multi - node/ accelerator accelerator version version Basic GPLv3 Single node Commercial license Advanced solvers Full complex support More advanced solvers MPI communication layer Figure 1.2: The three PARALUTION versions 1.4 Dual License The PARALUTION library is distributed within a dual license model. Three different version can be obtained – a GPLv3 version which is free of charge and two commercial versions. 1.4.1 Basic Versions The basic version of the code is released under the GPL v3 license [18]. It provides the core components of the single node functionality of the library. This version of the library is free. Please, note that due to GPLv3 license model, any code which uses the library must be released as Open Source and it should have compliance to the GPLv3. 1.4.2 Commercial Versions The user can purchase a commercial license which will grant the ability to embed the code into a (commercial) closed binary package. In addition these versions come with more features and with support. The only restriction imposed to the user is that she/he is not allowed to further distribute the source-code. Single-node/accelerator Version In addition to the basic version, the commercial license features: • License to embed the code into a commercial/non-opensource product • Full complex numbers support on Host, CUDA and OpenCL • More iterative solvers - QMRCGStab, BiCGStab(l), FCG • More AMG schemes - Ruge-Stüben, Pairwise • More preconditioners - VariablePreconditioners Multi-node/accelerator Version In addition to the commercial basic version, the multi-node license features: • License to embed the code into a commercial/non-opensource product • Same functionality as for the single-node/accelerator version • MPI layer for HPC clusters 12 CHAPTER 1. INTRODUCTION • Global AMG solver (Pairwise) Note Please note, that the multi-node version does not support Windows/Visual Studio. 1.5 Version Nomenclature Please note the following compatibility policy with respect to the versioning. The version number x.y.z represents: x is the major (increases when modifications in the library have been made and/or a new API has been introduced), y is the minor (increases when new functionality (algorithms, schemes, etc) has been added, possibly small/no modification of the API) and z is the revision (increases due to bugfixing or performance improvement). The alpha and beta versions are denoted with a and b, typically these are pre-released versions. As mentioned above there are three versions of PARALUTION, each release has the same version numbering plus an additional suffix for each type. They are abbreviated with ”B” for the Basic version (free under GPLv3), with ”S” for the single node/accelerator commercial and with ”M” for the multi-node/accelerator commercial version. 1.6 Features Nomenclature The functions described in this user manual follows the following nomenclature: • ValueType – type of values can be double (D), float (F), integer (I), and complex (double and float). The particular bit representation (8,16,32,64bit) depends on your compiler and operating system. • Computation – on which backend the computation can be performed, the abbreviation follows: Host backend with OpenMP (H); CUDA backend (C); OpenCL backend (O); Xeon Phi backend with OpenMP (X). • Available – in which version this functionality is available: Basic/GPLv3 (B); Single-node/accelerator commercial version (S); Multi-node/accelerator commercial version (M). The following example states that the function has double, float, integer and complex support; it can be performed on Host backend with OpenMP, CUDA backend, OpenCL backend, Xeon Phi backend with OpenMP; and it is available in all three versions of the library. ValueType D,F,I,C Computation H,C,O,X Available B,S,M Solvers and preconditioners split the computation into a Building phase and Solving phase which can have different computation backend performance. In the following example the solver/preconditioner support double, float and complex; the building phase can be performed only on the Host with OpenMP or on CUDA; the solving phase can be performed on all other backends; it is available only in the commercial versions of the library. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available S,M Note The full complex support is available only in the commercial version, check Section 4.4. 1.7 Cite If you like to cite the PARALUTION library you can do it as you like with citing our web site. Please specify the version of the software or/and date of accessing our web page, like that: 1 2 3 4 5 6 @misc{ p a r a l u t i o n , a u t h o r=”{PARALUTION Labs }” , t i t l e =” {PARALUTION vX .Y. Z} ” , y e a r=” 20XX” , n o t e = {\ u r l { h t t p : //www. p a r a l u t i o n . com/}} } 1.8. WEBSITE 1.8 Website The official web site of the library (including all (free and commercial) versions) is http://www.paralution.com 13 14 CHAPTER 1. INTRODUCTION Chapter 2 Installation PARALUTION can be compiled under Linux/Unix, Windows and Mac systems. Note Please check for additional remarks Sections 11.5. 2.1 Linux/Unix-like OS After downloading and unpacking the library, the user needs to compile it. We provide two compilation configurations – cmake and Makefile. 2.1.1 Makefile In this case, the user needs to modify the Makefile which contains the information about the available compilers. By default PARALUTION will only use gcc [20] compiler (no GPU support). The user can switch to icc [23] with or without MKL [24] support. To compile with GPU support, the user needs to uncomment the corresponding CUDA1 [43] lines in the Makefile. The same procedure needs to be followed for the OpenCL [29] and for the OpenMP(MIC) backend. Note During the compilation only one backend can be selected, i.e. if a GPU is available the user needs to select either CUDA or OpenCL support. The default installation process can be summarized in the following lines: 1 2 3 4 5 6 7 8 wget h t t p : //www. p a r a l u t i o n . com/ download / p a r a l u t i o n −x . y . z . t a r . gz t a r z x v f p a r a l u t i o n −x . y . z . t a r . gz cd p a r a l u t i o n −x . y . z / s r c make a l l make i n s t a l l where x.y.z is the version of the library. Note Please note, that the multi-node version of PARALUTION can only be compiled using CMake. 2.1.2 CMake The compilation with cmake [30] is easier to handle – all compiler specifications are determined automatically. The compilation process can be performed by 1 2 3 4 5 6 7 8 9 10 11 wget h t t p : //www. p a r a l u t i o n . com/ download / p a r a l u t i o n −x . y . z . t a r . gz t a r z x v f p a r a l u t i o n −x . y . z . t a r . gz cd p a r a l u t i o n −x . y . z mkdir b u i l d cd b u i l d cmake . . make 1 NVIDIA CUDA, when mentioned also includes CUBLAS and CUSPARSE 15 16 CHAPTER 2. INSTALLATION where x.y.z is the version of the library. Advanced compilation can be performed with cmake -i .., you need this option to compile the library with OpenMP(MIC) backend. The priority during the compilation process of the backends are: CUDA, OpenCL, MIC You can also choose specific options via the command line, for example CUDA: 1 2 3 4 5 6 7 cd p a r a l u t i o n −x . y . z mkdir b u i l d cd b u i l d cmake −DSUPPORT CUDA=ON −DSUPPORT OMP=ON . . make −j For the Intel Xeon Phi, OpenMP(MIC) backend: 1 2 3 4 5 6 7 cd p a r a l u t i o n −x . y . z mkdir b u i l d cd b u i l d cmake −DSUPPORT MIC=ON −DSUPPORT CUDA=OFF −DSUPPORT OCL=OFF make −j .. Note ptk file is generated in the build directory when using cmake. 2.1.3 Shared Library Both compilation processes produce a shared library file libparalution.so. Ensure that the library object can be found in your library path. If you do not copy the library to a specific location you can add the path under Linux in the LD LIBRARY PATH variable. 1 e x p o r t LD LIBRARY PATH=$LD LIBRARY PATH: ˜ / p a r a l u t i o n −x . y . z / b u i l d / l i b / 2.2 Windows OS This section will introduce a step-by-step guide to compile and use the PARALUTION library in a Windows based environment. Note Please note, that the multi-node version does not support Windows/Visual Studio. 2.2.1 OpenMP backend PARALUTION with OpenMP backend comes with Visual Studio Project files that support Visual Studio 2010, 2012 and 2013. PARALUTION is built as a static library during this step-by-step guide. • Open Visual Studio and navigate to File\Open Project. Figure 2.1: Open an existing Visual Studio project. • Load the corresponding PARALUTION project file, located in the visualstudio\paralution omp directory. The PARALUTION and CG projects should appear. • Right-click the PARALUTION project, and start to build the library. Once finished, Visual Studio should report a successful built. • Next, repeat the building procedure with the CG example project. Once finished, a successful built should be reported. 2.2. WINDOWS OS 17 Figure 2.2: Build the PARALUTION library. Figure 2.3: Visual Studio output for a successful built of the static PARALUTION library. Figure 2.4: Build the PARALUTION CG example. Figure 2.5: Visual Studio output for a successful built of the PARALUTION CG example. • Finally, the CG executable should be located within the visualstudio\paralution omp\Release directory. Note For testing, Windows 7 and Visual Studio Express 2013 has been used. Note OpenMP support can be enabled/disabled in the project properties. Navigate to C++\Language for the corresponding switch. Note The current version of PARALUTION does not support MPI (i.e. the M-version of the library) under Windows. 2.2.2 CUDA backend PARALUTION with CUDA backend comes with Visual Studio Project files that support Visual Studio 2010, 2012 and 2013. Please follow the same steps as for the OpenMP backend compilation but using the visualstudio\paralution cuda directory. 18 2.2.3 CHAPTER 2. INSTALLATION OpenCL backend PARALUTION with OpenCL backend comes with Visual Studio Project files that support Visual Studio 2010, 2012 and 2013. Please follow the same steps as for the OpenMP backend compilation but using the visualstudio\paralution ocl directory. Additionally, the OpenCL include directory and OpenCL library directory need to be specified within Visual Studio, as illustrated in Figures 2.6 and 2.7. Figure 2.6: Setting up Visual Studio OpenCL include directory. 2.3 Mac OS To compile PARALUTION under Mac, please follow the Linux/Unix-like OS instruction for the CMake compilation. 2.4 Supported Compilers The library has been tested with the following compilers: cmake gcc/g++ icc (MKL) CUDA Intel OpenCL NVIDIA OpenCL AMD OpenCL Visual Studio MPICH OpenMPI Intel MPI 2.8.12.2; 3.0.2; 3.1.3; 3.2.0 4.4.7; 4.6.3; 4.8.2 12.0; 13.1; 14.0.4; 15.0.0 5.0; 5.5; 6.0; 6.5; 7.0; 7.5 1.2 1.2 1.2; 2.0 2010, 2012, 2013 3.1.3 1.5.3; 1.6.3; 1.6.5; 1.8.4 4.1.2; 5.0.1 B,S,M B,S,M B,S,M B,S,M B,S,M B,S,M B,S,M B,S M M M Note Please note, that CUDA has limited ICC support. Note Please note, that Intel MPI >= 5.0.0 is only supported by CMAKE >= 3.2.0. 2.5. SIMPLE TEST 19 Figure 2.7: Setting up Visual Studio OpenCL library directory. 2.5 Simple Test You can test the installation by running a CG solver on a Laplace matrix [38]. After compiling the library you can perform the CG solver test by executing: 1 2 3 4 5 6 7 cd p a r a l u t i o n −x . y . z cd b u i l d / b i n wget f t p : //math . n i s t . gov /pub/ MatrixMarket2 / Harwell−Boeing / l a p l a c e / g r 3 0 3 0 . mtx . gz g z i p −d g r 3 0 3 0 . mtx . gz . / cg g r 3 0 3 0 . mtx 2.6 Compilation and Linking After compiling the PARALUTION library, the user need to specify the include and the linker path to compile a program. 1 g++ −O3 −Wall −I / path / t o / p a r a l u t i o n −x . y . z / b u i l d / i n c −c main . cpp −o main . o 2 g++ −o main main . o −L/ path / t o / p a r a l u t i o n −x . y . z / b u i l d / l i b / −l p a r a l u t i o n ”Compilation and linking” Before the execution of a program which has been compiled with PARALUTION, the library path need to be added to the environment variables, similar to 1 e x p o r t LD LIBRARY PATH=$LD LIBRARY PATH: ˜ / p a r a l u t i o n −x . y . z / b u i l d / l i b / When compiling with MPI, library and program need to be compiled using mpic++ or the corresponding MPI C++ compiler. 20 CHAPTER 2. INSTALLATION Chapter 3 Basics 3.1 Design and Philosophy The main idea of the PARALUTION objects is that they are separated from the actual hardware specification. Once you declare a matrix, a vector or a solver they are initially allocated on the host (CPU). Then every object can be moved to a selected accelerator by a simple move-to-accelerator function. The whole execution mechanism is based on run-time type information (RTTI) which allows you to select where and how you want to perform the operations at run time. This is in contrast to the template-based libraries which need this information at compile time. The philosophy of the library is to abstract the hardware-specific functions and routines from the actual program which describes the algorithm. It is hard and almost impossible for most of the large simulation software based on sparse computation to adapt and port their implementation in order to use every new technology. On the other hand, the new high performance accelerators and devices have the capability to decrease the computational time significantly in many critical parts. This abstraction layer of the hardware specific routines is the core of PARALUTION’s design, it is built to explore fine-grained level of parallelism suited for multi/many-core devices. This is in contrast to most of the parallel sparse libraries available which are mainly based on domain decomposition techniques. Thus, the design of the iterative solvers the preconditioners is very different. Another cornerstone of PARALUTION is the native support of accelerators - the memory allocation, transfers and specific hardware functions are handled internally in the library. PARALUTION helps you to use accelerator technologies but does not force you to use them. As you can see later in this chapter, even if you offload your algorithms and solvers to the accelerator device, the same source code can be compiled and executed in a system without any accelerators. 3.2 Operators and Vectors The main objects in PARALUTION are linear operators and vectors. All objects can be moved to an accelerator at run time – a structure diagram is presented in Figure 3.1. Currently, we support GPUs by CUDA (NVIDIA) and OpenCL (NVIDIA, AMD) backends, and we provide OpenMP MIC backend for the Intel Xeon Phi. Operators/Vectors Solvers/Preconditioners Host Dynamic Switch New Backend Accelerator OpenMP Intel MIC OpenMP GPU, Accelerators OpenCL GPU CUDA Figure 3.1: Host and backends structure for different hardware 21 22 CHAPTER 3. BASICS The linear operators are defined as local or global matrices (i.e. on a single node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). Local Matrix Global Vector Operators Local Vector Vectors Global Matrix Local Stencil Figure 3.2: Operator and vector classes The only template parameter of the operators and vectors is the data type (ValueType). The operator data type could be float or double, while the vector data type can be float, double or int (int is used mainly for the permutation vectors). In the current version, cross ValueType object operations are not supported. Each of the objects contains a local copy of the hardware descriptor created by the init platform() function. This allows the user to modify it according to his needs and to obtain two or more objects with different hardware specifications (e.g. different amount of OpenMP threads, CUDA block sizes, etc). 3.2.1 Local Operators/Vectors By Local Operators/Vectors we refer to Local Matrices and Stencils, and to Local Vectors. By Local we mean the fact they stay on a single system. The system can contain several CPUs via UMA or NUMA memory system, it can contain an accelerator. 3.2.2 Global Operators/Vectors By Global Operators/Vectors we refer to Global Matrix and to Global Vectors. By Global we mean the fact they can stay on a single or multiple nodes in a network. For this this type of computation, the communication is based on MPI. 3.3 Functionality on the Accelerators Naturally, not all routines and algorithms can be performed efficiently on many-core systems (i.e. on accelerators). To provide full functionality the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not the object is moved to the host and the routine is computed there. This guarantees that your code will run (maybe not in the most efficient way) with any accelerator regardless of the available functionality for it. 3.4 Initialization of the Library The body of a PARALUTION code is very simple, it should contain the header file and the namespace of the library. The program must contain an initialization call, which will check and allocate the hardware, and a finalizing call which will release the allocated hardware. 1 #i n c l u d e

2 3 u s i n g namespace p a r a l u t i o n ; 4 5 i n t main ( i n t argc , c h a r ∗ argv [ ] ) { 6 7 init paralution () ; 8 9 info paralution () ; 10 11 // . . . 12 3.4. INITIALIZATION OF THE LIBRARY 13 stop paralution () ; 14 15 } ”Initialize and shutdown PARALUTION” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 32 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e Number o f GPU d e v i c e s i n t h e system : 2 PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : GPU(CUDA) OpenMP t h r e a d s : 3 2 S e l e c t e d GPU d e v i c e : 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 0 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 1 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ”An example output for info paralution() on a GPU (CUDA) system” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 32 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e Number o f OpenCL d e v i c e s i n t h e system : 2 PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : OpenCL OpenMP t h r e a d s : 3 2 S e l e c t e d OpenCL p l a t f o r m : 0 S e l e c t e d OpenCL d e v i c e : 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− P l a t f o r m number : 0 P l a t f o r m name : NVIDIA CUDA De vi c e number : 0 De vi c e name : T e s l a K20c De vi c e type : GPU totalGlobalMem : 4799 MByte c l o c k R a t e : 705 OpenCL v e r s i o n : OpenCL 1 . 1 CUDA −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− P l a t f o r m number : 0 P l a t f o r m name : NVIDIA CUDA De vi c e number : 1 De vi c e name : T e s l a K20c De vi c e type : GPU totalGlobalMem : 4799 MByte 23 24 CHAPTER 3. BASICS 29 c l o c k R a t e : 705 30 OpenCL v e r s i o n : OpenCL 1 . 1 CUDA 31 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ”An example output for info paralution() on a GPU (OpenCL) system” 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 16 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e MIC backed i s i n i t i a l i z e d PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : MIC(OpenMP) OpenMP t h r e a d s : 1 6 Number o f MIC d e v i c e s i n t h e system : 2 S e l e c t e d MIC d e v i c e s : 0 ”An example output for info paralution() on Intel Xeon Phi (OpenMP) system” This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 2 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : None OpenMP t h r e a d s : 2 ”An example output for info paralution() on system without accelerator” The init paralution() function defines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations. For control the library provides the following functions • select device paralution(int dev) – this is a unified function which select a specific device. If you have compiled the library with a backend and for this backend there are several available cards you can use this function to select a particular one. This function works for all backends (CUDA, OpenCL, Xeon Phi). • set omp threads paralution(int nthreads) – with this function the user can set the number of OpenMP threads. This function has to be called after the init paralution(). • set gpu cuda paralution(int ndevice) – in a multi-GPU system, the user can select a specific GPU by this function. This function has to be called before the init paralution(). • set ocl paralution(int nplatform, int ndevice) – in a multi-platform/accelerator system, the user can select a specific platform and device by this function. This function has to be called before the init paralution(). 3.4.1 Thread-core Mapping The number of threads which PARALUTION will use can be set via the set omp threads paralution() function or by the global OpenMP environment variable (for Unix-like OS this is OMP NUM THREADS ). During the initialization phase, the library provides affinity thread-core mapping based on: • if the number of cores (including hyperthreading cores) is greater or equal than two times the number of threads – then all the threads can occupy every second core ID (e.g. 0, 2, 4, ...). This is to avoid having two threads working on the same physical core when hyperthreading is enabled. • if the number of threads is less or equal to the number of cores (including hyperthreading), and the previous clause is false. Then the threads can occupy every core ID (e.g. 0, 1, 2, 3, ...). • if non of the above criteria is matched – the default thread-core mapping is used (typically set by the OS). Note The thread-core mapping is available only for Unix-like OS. For Windows OS, the thread-core mapping is selected by the operating system. Note The user can disable the thread affinity by calling set omp affinity(false) (and enable it with set omp affinity(true)), before initializing the library (i.e. before init paralution()). 3.4. INITIALIZATION OF THE LIBRARY 3.4.2 25 OpenMP Threshold Size Whenever you want to work on a small problem, you might observe that the OpenMP Host backend is (slightly) slower than using no OpenMP. This is mainly attributed to the small amount of work which every thread should perform and the large overhead of forking/joining threads. This can be avoid by the OpenMP threshold size parameter in PARALUTION. The default threshold is set to 10, 000, which means that all matrices under (and equal) this size will use only one thread (disregarding the number of OpenMP threads set in the system). The threshold can be modified via the function set omp threshold(). 3.4.3 Disable the Accelerator If you want to disable the accelerator (without recompiling the code), you need to call disable accelerator paralution() function before the init paralution(). 3.4.4 MPI and Multi-accelerators When initializing the library with MPI, the user need to pass the rank of the MPI process as well as the number of accelerators which are available on each node. Basically, in this way the user can specify the mapping of MPI process and accelerators – the allocated accelerator will be rank % num dev per node. Thus the user can run two MPI process on systems with two GPUs by specifying the number of devices to 2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 #i n c l u d e

#i n c l u d e u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { M PI I n it (& argc , &argv ) ; MPI Comm comm = MPI COMM WORLD; int int MPI MPI num processes ; rank ; Comm size (comm, &n u m p r o c e s s e s ) ; Comm rank (comm, &rank ) ; i n i t p a r a l u t i o n ( rank , 2 ) ; info paralution () ; // . . . stop paralution () ; } ”Initialize and shutdown PARALUTION with MPI” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 The d e f a u l t OS t h r e a d a f f i n i t y c o n f i g u r a t i o n w i l l be used Number o f GPU d e v i c e s i n t h e system : 2 The d e f a u l t OS t h r e a d a f f i n i t y c o n f i g u r a t i o n w i l l be used PARALUTION v e r M0. 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : GPU(CUDA) OpenMP t h r e a d s : 1 S e l e c t e d GPU d e v i c e : 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 0 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 26 16 17 18 19 20 21 22 23 24 25 26 CHAPTER 3. BASICS −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 1 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− MPI rank : 0 MPI s i z e : 2 ”An example output for info paralution() with 2 MPI processes” 3.5 Automatic Object Tracking By default, after the initialization of the library, it tracks all objects and releasing the allocated memory in them when the library is stopped. This ensure large memory leaks when the objects are allocated but not freed. The user can disable the tracking by editing src/utils/def.hpp. 3.6 Verbose Output PARALUTION provides different levels of output messages. They can be set in the file src/utils/def.hpp before the compilation of the library. By setting a higher level, the user will obtain more detailed information about the internal calls and data transfers to and from the accelerators. 3.7 Verbose Output and MPI To avoid all MPI processes to print information on the screen the default configuration is that only RANK 0 outputs information on the screen. The user can change the RANK or allow all RANK to print by modifying src/utils/def.hpp. If file logging is enabled, all ranks write into corresponding log files. 3.8 Debug Output You can also enable debugging output which will print almost every detail in the program, including object constructor/destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers and preconditioners. The debug flag can be set in src/utils/def.hpp. When enabled, additional assert()s are being checked during the computation. This might decrease the performance of some operations. 3.9 File Logging All output can be logged into a file, the file name will be paralution-XXXX.log, where XXX will be a counter in milliseconds. To enable the file you need to edit src/utils/def.hpp. 3.10 Versions For checking the version in your code you can use the PARALUTION’s pre-defined macros. 1 2 3 4 5 PARALUTION VER MAJOR −− g i v e s t h e v e r s i o n major v a l u e PARALUTION VER MINOR −− g i v e s t h e v e r s i o n minor v a l u e PARALUTION VER REV −− g i v e s t h e v e r s i o n r e v i s i o n PARALUTION VER PRE −− g i v e s pre−r e l e a s e s ( a s ”ALPHA” o r ”BETA” ) The final PARALUTION VER gives the version as 10000major + 100minor + revision. The different type of versions (Basic/GPLv3; Single-node; Multi-node) are defined as PARALUTION VER TYPE B for Basic/GPLv3; S for Single-node; and M for Multi-node. Chapter 4 Single-node Computation 4.1 Introduction In this chapter we describe all base objects (matrices, vectors and stencils) for computation on single-node (shared-memory) systems. A typical configuration is presented on Figure 4.1. Accelerator (e.g. GPU) Bus (PCI-E) Host (CPU) Figure 4.1: A typical single-node configuration, where gray-boxes represent the cores, blue-boxes represent the memory, arrows represent the bandwidth The compute node contains none, one or more accelerators. The compute node could be any kind of sharedmemory (single, dual, quad CPU) system. Note that the memory of the accelerator and of the host can be physically different. 4.2 Code Structure The Data is an object, pointing to the BaseMatrix class. The pointing is coming from either a HostMatrix or an AcceleratorMatrix. The AcceleratorMatrix is created by an object with an implementation in each of the backends (CUDA, OpenCL, Xeon Phi) and a matrix format. Switching between host or accelerator matrix is performed in the LocalMatrix class. The LocalVector is organized in the same way. Each matrix format has its own class for the host and for each accelerator backend. All matrix classes are derived from the BaseMatrix which provides the base interface for computation as well as for data accessing, see Figure4.4. The GPU (CUDA backend) matrix structure is presented in Figure 4.5, all other backend follows the same organization. 4.3 Value Type The value (data) type of the vectors and the matrices is defined as a template. The matrix can be of type float (32-bit), double (64-bit) and complex (64/128-bit). The vector can be float (32-bit), double (64-bit), complex 27 28 CHAPTER 4. SINGLE-NODE COMPUTATION Local Matrix/Vector New Backend Dynamic Switch Host Data Accelerator OpenMP Intel MIC OpenMP Data Data GPU, Accelerators OpenCL Data GPU CUDA Data Figure 4.2: Local Matrices and Vectors paralution::ParalutionObj paralution::ParalutionObj paralution::BaseParalution< ValueType > paralution::BaseParalution< ValueType > paralution::Operator< ValueType > paralution::Vector< ValueType > paralution::LocalMatrix< ValueType > paralution::LocalVector< ValueType > Figure 4.3: LocalMatrix and Local Vector paralution::BaseMatrix< ValueType > paralution::AcceleratorMatrix< ValueType > paralution::HostMatrix< ValueType > paralution::GPUAcceleratorMatrix< ValueType > paralution::HostMatrixBCSR< ValueType > paralution::MICAcceleratorMatrix< ValueType > paralution::HostMatrixCOO< ValueType > paralution::OCLAcceleratorMatrix< ValueType > paralution::HostMatrixCSR< ValueType > paralution::HostMatrixDENSE< ValueType > paralution::HostMatrixDIA< ValueType > paralution::HostMatrixELL< ValueType > paralution::HostMatrixHYB< ValueType > paralution::HostMatrixMCSR< ValueType > Figure 4.4: BaseMatrix (64/128-bit) and int (32/64-bit). The information about the precision of the data type is shown in the Print() function. 4.4 Complex Support PARALUTION supports complex computation in all functions due to its internal template structure. The host implementation is based on the std::complex. In binary, the data is the same also for the CUDA and for the OpenCL backend. The complex support of the backends with respect to the versions is presented in Table 4.1. 4.5. ALLOCATION AND FREE 29 paralution::BaseMatrix< ValueType > paralution::AcceleratorMatrix< ValueType > paralution::GPUAcceleratorMatrix< ValueType > paralution::GPUAcceleratorMatrixBCSR< ValueType > paralution::GPUAcceleratorMatrixCOO< ValueType > paralution::GPUAcceleratorMatrixCSR< ValueType > paralution::GPUAcceleratorMatrixDENSE< ValueType > paralution::GPUAcceleratorMatrixDIA< ValueType > paralution::GPUAcceleratorMatrixELL< ValueType > paralution::GPUAcceleratorMatrixHYB< ValueType > paralution::GPUAcceleratorMatrixMCSR< ValueType > Figure 4.5: GPUAcceleratorMatrix (CUDA) B S M Host Yes Yes Yes CUDA No Yes Yes OpenCL No Yes Yes Xeon Phi No No No Table 4.1: Complex support 4.5 Allocation and Free The allocation functions require a name of the object (this is only for information purposes) and corresponding size description for vector and matrix objects. 1 L o c a l V e c t o r vec ; 2 3 vec . A l l o c a t e ( ”my v e c t o r ” , 1 0 0 ) ; 4 vec . C l e a r ( ) ; ”Vector allocation/free” 1 2 3 4 5 6 7 LocalMatrix mat ; mat . AllocateCSR ( ”my c s r matrix ” , 4 5 6 , 1 0 0 , 1 0 0 ) ; // nnz , rows , columns mat . C l e a r ( ) ; mat . AllocateCOO ( ”my coo matrix ” , 2 0 0 , 1 0 0 , 1 0 0 ) ; // nnz , rows , columns mat . C l e a r ( ) ; ”Matrix allocation/free” 4.6 Matrix Formats Matrices where most of the elements are equal to zero are called sparse. In most practical applications the number of non-zero entries is proportional to the size of the matrix (e.g. typically, if the matrix A is RN ×N then the number of elements are of order O(N )). To save memory, we can avoid storing the zero entries by introducing a structure corresponding to the non-zero elements of the matrix. PARALUTION supports sparse CSR, MCSR, COO, ELL, DIA, HYB and dense matrices (DENSE). To illustrate the different format, let us consider the following matrix in Figure 4.6. Here the matrix is A ∈ R5×5 with 11 non-zero entries. The indexing in all formats described below are zero based (i.e. the index values starts at 0, not with 1). 30 CHAPTER 4. SINGLE-NODE COMPUTATION 0 0 1 1 2 2 3 4 11 1 3 4 2 5 6 7 3 8 4 9 10 Figure 4.6: A sparse matrix example Note The functionality of every matrix object is different and depends on the matrix format. The CSR format provides the highest support for various functions. For a few operations an internal conversion is performed, however, for many routines an error message is printed and the program is terminated. Note In the current version, some of the conversions are performed on the host (disregarding the actual object allocation - host or accelerator). 1 2 3 4 5 6 7 mat . ConvertToCSR ( ) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n CSR format mat . Apply ( x , &y ) ; 1 2 3 4 5 6 7 mat . ConvertTo (CSR) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n CSR format mat . Apply ( x , &y ) ; mat . ConvertToELL ( ) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n ELL format mat . Apply ( x , &y ) ; ”Conversion between matrix formats” mat . ConvertTo (ELL) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n ELL format mat . Apply ( x , &y ) ; ”Conversion between matrix formats (alternative)” 4.6.1 Coordinate Format – COO The most intuitive sparse format is the coordinate format (COO). It represent the non-zero elements of the matrix by their coordinates, we need to store two index arrays (one for row and one for column indexing) and the values. Thus, our example matrix will have the following structure: row 0 0 0 1 1 2 2 2 3 4 4 col 0 1 3 1 2 1 2 3 3 3 4 val 1 2 11 3 4 5 6 7 8 9 10 Figure 4.7: Sparse matrix in COO format 4.6.2 Compressed Sparse Row/Column Format – CSR/CSC One of the most popular formats in many scientific codes is the compressed sparse row (CSR) format. In this format, we do not store the whole row indices but we only save the offsets to positions. Thus, we can easily jump to any row and we can access sequentially all elements there. However, this format does not allow sequential accessing of the column entries. 4.6. MATRIX FORMATS 31 row offset 0 3 5 8 9 11 col 0 1 3 1 2 1 2 3 3 3 4 val 1 2 11 3 4 5 6 7 8 9 10 Figure 4.8: Sparse matrix in CSR format Analogy to this format is the compressed sparse column (CSC), where we represent the offsets by the column – it is clear that we can traverse column elements sequentially. In many finite element (difference/volumes) applications, the diagonal elements are non-zero. In such cases, we can store them at the beginning of the value arrays. This format is often referred as modified compressed sparse row/column (MCSR/MCSC). 4.6.3 Diagonal Format – DIA If all (or most) of the non-zero entries belong to a few diagonals of matrix, we can store them with the responding offsets. In our example, we have 4 diagonal, the main diagonal (denoted with 0 offset) is fully occupied while the others contain zero entries. dig val -1 0 1 3 * 1 2 11 0 3 4 0 5 6 7 * 0 8 0 * 9 10 * * Figure 4.9: Sparse matrix in DIA format Please note, that the values in this format are stored as array with size D × ND , where D is the number of diagonals in the matrix and ND is the number of elements in the main diagonal. Since, not all values in this array are occupied - the not accessible entries are denoted with star, they correspond to the offsets in the diagonal array (negative values represent offsets from the beginning of the array). 4.6.4 ELL Format The ELL format can be seen as modification of the CSR, where we do not store the row offsets. Instead, we have a fixed number of elements per row. 4.6.5 HYB Format As you can notice the DIA and ELL cannot represent efficiently completely unstructured sparse matrix. To keep the memory footprint low DIA requires the elements to belong to a few diagonals and the ELL format needs fixed number of elements per row. For many applications this is a too strong restriction. A solution to this issue is to represent the more regular part of the matrix in such a format and the remaining part in COO format. The HYB format, implemented in PARALUTION, is a mix between ELL and COO, where the maximum elements per row for the ELL part is computed by nnz/num row. 32 CHAPTER 4. SINGLE-NODE COMPUTATION col val 0 1 3 1 2 11 1 2 * 3 4 * 1 2 3 5 6 7 3 * * 8 * * 3 4 * 9 10 * Figure 4.10: Sparse matrix in ELL format 4.6.6 Memory Usage The memory footprint of the different matrix formats is presented in the following table. Here, we consider a N × N matrix where the number of non-zero entries are denoted with N N Z. Format Structure Values Dense COO CSR ELL DIA – 2 × NNZ N + 1 + NNZ M ×N D N ×N NNZ NNZ M ×N D × ND For the ELL matrix M characterizes the maximal number of non-zero elements per row and for the DIA matrix D defines the number of diagonals and ND defines the size of the main diagonal. 4.6.7 Backend support CSR COO ELL DIA HYB DENSE BCSR 4.7 Host Yes Yes1 Yes Yes Yes Yes No CUDA Yes Yes Yes Yes Yes Yes No OpenCL Yes Yes Yes Yes Yes Yes2 No MIC/Xeon Phi Yes Yes1 Yes Yes Yes No No I/O The user can read and write matrix files stored in Matrix Market Format [37]. 1 LocalMatrix mat ; 2 mat . ReadFileMTX ( ” my matrix . mtx” ) ; 3 mat . WriteFileMTX ( ” my matrix . mtx” ) ; ”I/O MTX Matrix” Matrix files in binary format are also supported for the compressed sparse row storage format. 1 LocalMatrix mat ; 2 mat . ReadFileCSR ( ” my matrix . c s r ” ) ; 3 mat . WriteFileCSR ( ” my matrix . c s r ” ) ; ”I/O CSR Binary Matrix” 1 Serial 2 Basic version version 4.8. ACCESS 33 The binary format stores the CSR relevant data as follows 1 2 3 4 5 6 out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) &nrow , s i z e o f ( IndexType ) ) ; &n c o l , s i z e o f ( IndexType ) ) ; &nnz , s i z e o f ( IndexType ) ) ; r o w o f f s e t , ( nrow+1)∗ s i z e o f ( IndexType ) ) ; c o l , nnz ∗ s i z e o f ( IndexType ) ) ; val , nnz ∗ s i z e o f ( ValueType ) ) ; ”CSR Binary Format” The vector can be read or written via ASCII formatted files 1 L o c a l V e c t o r vec ; 2 3 vec . ReadFileASCII ( ” my vector . dat ” ) ; 4 vec . WriteFileASCII ( ” my vector . dat ” ) ; ”I/O ASCII Vector” 4.8 Access ValueType D,F,I,C Computation H Available B,S,M The elements in the vector can be accessed via [] operators when the vector is allocated on the host. In the following example, a vector is allocated with 100 elements and initialized with 1 for all odd elements and −1 for all even elements. 1 2 3 4 5 6 7 8 L o c a l V e c t o r vec ; vec . A l l o c a t e ( ” v e c t o r ” , 1 0 0 ) ; vec . Ones ( ) ; f o r ( i n t i =0; i <100; i=i +2) vec [ i ] = −1; ”Vector element access” Note Accessing elements via the [] operators is slow. Use this for debugging only. There is no direct access to the elements of matrices due to the sparsity structure. Matrices can be imported by a copy function, for CSR matrix this is CopyFromCSR() and CopyToCSR(). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 // A l l o c a t e t h e CSR matrix i n t ∗ r o w o f f s e t s = new i n t [ 1 0 0 + 1 ] ; i n t ∗ c o l = new i n t [ 3 4 5 ] ; ValueType ∗ v a l = new ValueType [ 3 4 5 ] ; // f i l l t h e CSR matrix ... // C r e a t e a PARALUTION matrix LocalMatrix mat ; // Import matrix t o PARALUTION mat . AllocateCSR ( ”my matrix ” , 3 4 5 , 1 0 0 , 1 0 0 ) ; mat . CopyFromCSR( r o w o f f s e t s , c o l , v a l ) ; // Export matrix from PARALUTION // t h e r o w o f f s e t s , c o l , v a l have t o be a l l o c a t e d mat . CopyToCSR( r o w o f f s e t s , c o l , v a l ) ; ”Matrix access” 34 CHAPTER 4. SINGLE-NODE COMPUTATION 4.9 Raw Access to the Data ValueType D,F,I,C Computation H,C Available B,S,M For vector and matrix objects, you can have direct access to the raw data via pointers. You can set already allocated data with the SetDataPtr() function. 1 L o c a l V e c t o r vec ; 2 3 ValueType ∗ p t r v e c = new ValueType [ 2 0 0 ] ; 4 5 vec . SetDataPtr (& p t r v e c , ” v e c t o r ” , 2 0 0 ) ; ”Set allocated data to a vector” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 // A l l o c a t e t h e CSR matrix i n t ∗ r o w o f f s e t s = new i n t [ 1 0 0 + 1 ] ; i n t ∗ c o l = new i n t [ 3 4 5 ] ; ValueType ∗ v a l = new ValueType [ 3 4 5 ] ; // f i l l t h e CSR matrix ... // C r e a t e a PARALUTION matrix LocalMatrix mat ; // S e t t h e CSR matrix i n PARALUTION mat . SetDataPtrCSR(& r o w o f f s e t s , &c o l , &val , ”my matrix ” , 345 , 100 , 100) ; ”Set allocated data to a CSR matrix” With LeaveDataPtr() you can obtain the raw data from the object. This will leave the object empty. 1 2 3 4 5 6 7 L o c a l V e c t o r vec ; ValueType ∗ p t r v e c = NULL; vec . A l l o c a t e ( ” v e c t o r ” , 1 0 0 ) ; vec . LeaveDataPtr(& p t r v e c ) ; ”Get (steal) the data from a vector” 1 2 3 4 5 6 7 8 9 10 11 12 13 // C r e a t e a PARALUTION matrix LocalMatrix mat ; // A l l o c a t e and f i l l t h e PARALUTION matrix mat ... // D e f i n e e x t e r n a l CSR s t r u c t u r e i n t ∗ r o w o f f s e t s = NULL; i n t ∗ c o l = NULL; ValueType ∗ v a l = NULL; mat . LeaveDataPtrCSR(& r o w o f f s e t s , &c o l , &v a l ) ; ”Get (steal) the data from a CSR matrix” After calling the SetDataPtr*() functions (for Vectors or Matrices), the passed pointers will be set to NULL. 4.10. COPY CSR MATRIX HOST DATA 35 Note If the object is allocated on the host then the pointers from the SetDataPtr() and LeaveDataPtr() will be on the host, if the vector object is on the accelerator then the data pointers will be on the accelerator. Note If the object is moved to and from the accelerator then the original pointer will be invalid. Note Never rely on old pointers, hidden object movement to and from the accelerator will make them invalid. Note Whenever you pass or obtain pointers to/from a PARALUTION object, you need to use the same memory allocation/free functions, please check the source code for that (for Host src/utils/allocate free.cpp and for Host/CUDA src/base/gpu/gpu allocate free.cu) 4.10 Copy CSR Matrix Host Data ValueType D,F,C Computation H,C,O Available B,S,M If the CSR matrix data pointers are only accessible as constant, the user can create a PARALUTION matrix object and pass const CSR host pointers by using the CopyFromHostCSR() function. PARALUTION will then allocate and copy the CSR matrix on the corresponding backend, where the original object was located at. 4.11 Copy Data ValueType D,F,I,C Computation H,C Available B,S,M The user can copy data to and from a local vector via the CopyFromData() and CopyToData() functions. The vector must be allocated before with the corresponding size of the data. 4.12 Object Info Information about the object can be printed with the info() function 1 mat . i n f o ( ) ; 2 vec . i n f o ( ) ; ”Vector/Matrix information” 1 L o c a l M a t r i x name=G 3 c i r c u i t . mtx ; rows =1585478; c o l s =1585478; nnz =7660826; p r e c =64 b i t ; asm=no ; format=CSR ; backends={CPU(OpenMP) , GPU(CUDA) } ; c u r r e n t=CPU(OpenMP) 2 3 L o c a l V e c t o r name=x ; s i z e =900; p r e c =64 b i t ; h o s t backend={CPU(OpenMP) } ; a c c e l e r a t o r backend={No A c c e l e r a t o r } ; c u r r e n t=CPU(OpenMP) ”Vector/Matrix information” In this example, the matrix has been loaded, stored in CSR format, double precision, the library is compiled with CUDA support and no MKL, and the matrix is located on the host. The vector information is coming from another compilation of the library with no OpenCL/CUDA/MKL support. 4.13 Copy ValueType D,F,I,C Computation H,C,O,X Available B,S,M All matrix and vector objects provide a CopyFrom() and a CopyTo() function. The destination object should have the same size or be empty. In the latter case the object is allocated at the source platform. Note This function allows cross platform copying - one of the objects could be allocated on the accelerator backend. 36 1 2 3 4 5 6 7 8 9 10 11 12 13 CHAPTER 4. SINGLE-NODE COMPUTATION L o c a l V e c t o r vec1 , vec2 ; // A l l o c a t e and i n i t vec1 and vec2 // . . . vec1 . MoveToAccelerator ( ) ; // now vec1 i s on t h e a c c e l e r a t o r ( i f any ) // and vec2 i s on t h e h o s t // we can copy vec1 t o vec2 and v i c e v e r s a vec1 . CopyFrom ( vec2 ) ; ”Vector copy” Note For vectors, the user can specify source and destination offsets and thus copy only a part of the whole vector into another vector. Note When copying a matrix - the source and destination matrices should be in the same format. 4.14 Clone ValueType D,F,I,C Computation H,C,O,X Available B,S,M The copy operators allow you to copy the values of the object to another object, without changing the backend specification of the object. In many algorithms you might need auxiliary vectors or matrices. These objects can be cloned with the function CloneFrom(). 1 2 3 4 5 6 7 8 9 10 L o c a l V e c t o r vec ; // a l l o c a t e and i n i t vec ( h o s t o r a c c e l e r a t o r ) // . . . L o c a l V e c t o r tmp ; // tmp w i l l have t h e same v a l u e s // and i t w i l l be on t h e same backend a s vec tmp . CloneFrom ( vec ) ; ”Clone” If the data of the object needs to be kept, then you can use the CloneBackend() function to copy (clone) only the backend. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 L o c a l V e c t o r vec ; LocalMatrix mat ; // a l l o c a t e and i n i t vec , mat ( h o s t o r a c c e l e r a t o r ) // . . . L o c a l V e c t o r tmp ; // tmp and vec w i l l have t h e same backend a s mat tmp . CloneBackend ( mat ) ; vec . CloneBackend ( mat ) ; // t h e matrix−v e c t o r m u l t i p l i c a t i o n w i l l be performed // on t h e backend s e l e c t e d i n mat mat . Apply ( vec , &tmp ) ; ”Clone backend” 4.15. ASSEMBLING 4.15 37 Assembling ValueType D,F,I,C Computation H (CSR-only) Available B,S,M In many codes based on finite element, finite difference or finite volume, the user need to assemble the matrix before solving the linear system. For this purpose we provide an assembling function – the user need to pass three arrays, representing the row and column index and the value; similar function is provided for the vector assembling. Ai,j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X vi,j . (4.1) LocalMatrix mat ; L o c a l V e c t o r r h s ; // rows i n t i [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 4 , 3 , 3 } ; // c o l s i n t j [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 2 , 2 , 5 , 2 , 2 } ; // v a l u e s ValueType v [ 1 1 ] = { 2 . 3 , 3 . 5 , 4 . 7 , 6 . 3 , 0 . 4 , 4 . 3 , 6 . 2 , 4 . 4 , 4 . 6 , 0 . 7 , 4 . 8 } ; mat . Assemble ( i , j , v , // t u p l e ( i , j , v ) 11 , // s i z e o f t h e i n p u t a r r a y ”A” ) ; // name o f t h e matrix r h s . Assemble ( i , a , // a s s e m b l i n g data 11 , // s i z e o f t h e i n p u t a r r a y ” r h s ” ) ; // name o f t h e v e c t o r ”Matrix assembling” In this case the function will determine the matrix size, the number of non-zero elements, as well as the non-zero pattern and it will assemble the matrix in CSR format. If the size of the matrix is know, then the user can pass this information to the assembling function and thus it will avoid a loop for checking the indexes before the actual assembling procedure. 1 mat . Assemble ( i , j , a , 1 1 , ”A” , 6 , 6 ) ; ”Matrix assembling with known size” The matrix function has two implementations – serial and OpenMP parallel (host only). In many cases, one might want to assemble the matrix in a loop by modifying the original index pattern – typically for time-dependent/non-linear problems. In this case, the matrix does not need to be fully assembled, the user can use the AssembleUpdate() function to provide the new values. 1 2 3 4 5 6 7 8 9 10 11 12 13 L o c a l V e c t o r x , r h s ; LocalMatrix mat ; i n t i [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 4 , 3 , 3 } ; i n t j [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 2 , 2 , 5 , 2 , 2 } ; ValueType a [ 1 1 ] = { 2 . 3 , 3 . 5 , 4 . 7 , 6 . 3 , 0 . 4 , 4 . 3 , 6 . 2 , 4 . 4 , 4 . 6 , 0 . 7 , 4 . 8 } ; mat . Assemble ( i , j , a , 1 1 , ”A” ) ; r h s . Assemble ( i , a , 1 1 , ” r h s ” ) ; x . A l l o c a t e ( ”x” , mat . g e t n c o l ( ) ) ; mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; 38 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 CHAPTER 4. SINGLE-NODE COMPUTATION x . MoveToAccelerator ( ) ; GMRES, L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; f o r ( i n t k=0; k <10; k++) { mat . Z e r o s ( ) ; x . Zeros () ; rhs . Zeros () ; // Modify t h e a s s e m b l i n g data a [ 4 ] = k∗k ∗ 3 . 3 ; a [ 8 ] = k∗k − 1 . 3 ; a [ 1 0 ] = k∗k / 2 . 0 ; a[3] = k; mat . AssembleUpdate ( a ) ; r h s . Assemble ( i , a , 1 1 , ” r h s ” ) ; l s . R e s e t O p e r a t o r ( mat ) ; l s . S o l v e ( rhs , &x ) ; } x . MoveToHost ( ) ; ”Several matrix assembling with static pattern” If the information for assembling is available (for calling the AssembleUpdate() function), then in the info() function will print asm=yes For the implementation of the assembly function we use an adopted and modified code based on [16]. Performance results for various cases can be found in [17]. 4.16 Check ValueType D,F,I,C Computation H (CSR-only) Available B,S,M Checks, if the object contains valid data via the Check() function. For vectors, the function checks if the values are not infinity and not NaN (not a number). For the matrices, this function checks the values and if the structure of the matrix is correct. 4.17 Sort ValueType D,F,I,C Computation H (CSR-only) Available B,S,M Sorts the column values in a CSR matrix via the Sort() function. 4.18 Keying ValueType D,F,I,C Computation H (CSR-only) Available S,M 4.19. GRAPH ANALYZERS 39 Typically it is hard to compare if two matrices have the same (structure and values) or they just have the same structure. To do this, we provide a key function which generates three keys, for the row index, column index and for the values. An example is presented in the following Listing. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LocalMatrix mat ; mat . ReadFileMTX ( s t d : : s t r i n g ( argv [ 1 ] ) ) ; l o n g i n t row key ; long int col key ; long int val key ; mat . Key ( row key , col key , val key ) ; s t d : : c o u t << ”Row key = ” << row key << s t d : : e n d l << ” Col key = ” << c o l k e y << s t d : : e n d l << ” Val key = ” << v a l k e y << s t d : : e n d l ; ”Generating a matrix key” 4.19 Graph Analyzers ValueType D,F,I,C Computation H (CSR-only) Available B,S,M The following functions are available for analyzing the connectivity in graph of the underlying sparse matrix. • (R)CMK reordering • Maximal independent set • Multi-coloring • ZeroBlockPermutation • Connectivity ordering All graph analyzing functions return a permutation vector (int type) which is supposed to be used with the Permute() and PermuteBackward() functions in the matrix and vector classes. The CMK (Cuthill–McKee) and RCMK (Reverse Cuthill–McKee) orderings minimize the bandwidth of a given sparse matrix. 1 L o c a l V e c t o r cmk ; 2 3 mat .RCMK(&cmk ) ; 4 mat . Permute ( cmk ) ; ”CMK ordering” The Maximal independent set (MIS) algorithm finds a set with maximal size which contains elements that do not depend on other elements in this set. 1 2 3 4 5 6 L o c a l V e c t o r mis ; int size ; mat . MaximalIndependentSet ( s i z e , &mis ) ; mat . Permute ( mis ) ; ”MIS ordering” The Multi-coloring (MC) algorithm builds a permutation (coloring the matrix) in such way that no two adjacent nodes in the sparse matrix are having the same color. 40 1 2 3 4 5 6 7 8 9 CHAPTER 4. SINGLE-NODE COMPUTATION L o c a l V e c t o r mc ; i n t num colors ; i n t ∗ b l o c k c o l o r s = NULL; mat . M u l t i C o l o r i n g ( num colors , &b l o c k c o l o r s , &mc) ; mat . Permute (mc) ; ”MC ordering” For saddle-point problems (i.e. matrices with zero-diagonal entries) the ZeroBlockPermutation maps all zero-diagonal elements to the last block of the matrix. 1 2 3 4 5 6 L o c a l V e c t o r zbp ; int size ; mat . ZeroBlockPermutation ( s i z e , &zbp ) ; mat . Permute ( zbp ) ; ”Zero-Block-Permutation” Connectivity ordering returns a permutation which sorts the matrix by non-zero entries per row. 1 L o c a l V e c t o r conn ; 2 3 mat . C o n n e c t i v i t y O r d e r (&conn ) ; 4 mat . Permute ( conn ) ; ”Connectivity ordering” Visualization of the graph analyzers can be found in Chapter 13. 4.20 Basic Linear Algebra Operations There are more functions and routines which matrix and vector objects can perform - for further specifications and API, check the doxygen documentation. Matrix objects can also perform • Matrix-vector multiplication x := Ay – Apply() • Matrix-vector multiplication and addition x := x + Ay – ApplyAdd() • Symmetric permutation – Permute(); PermuteBackward() • Sub-matrix / diagonal / L / U extractions – ExtractSubMatrix(); ExtractSubMatrices(); ExtractDiagonal(); ExtractInverseDiagonal(); ExtractU(); ExtractL(); ExtractColumnVector() ; ExtractRowVector() • Sparse triangular solvers (L,U,LL,LU) – ; LSolve(); LLSolve(); LUSolve() • Factorizations – ILU(0) – ILU with no fill-ins – ILU0() – ILU(p) – based on levels – ILUpFactorize(); – ILU(p,q) – based on power(q)-pattern method – via ILUpFactorize() – ILUT – based on threshold ILUTFactorize() – IC0 – Incomplete Cholesky with no fill-ins – ICFactorize() • FSAI computation – FSAI() • Symbolic power – compute the pattern of |A|p , where |.| = ai,j – SymbolicPower() • Sparse matrix-matrix addition (with the same or different sparsity patterns) – MatrixAdd() • Sparse matrix-matrix multiplication – MatrixMult() • Sparse matrix multiplication with a diagonal matrix (left and right multiplication) – DiagonalMatrixMultL(); DiagonalMatrixMultR() 4.21. LOCAL STENCILS 41 • Compute the Gershgorin circles (eigenvalue approximations) – Gershgorin() • Compress (i.e. delete elements) under specified threshold, this function also updates the structure – Compress() • Transpose – Transpose() • Create a restriction matrix via restriction map – CreateFromMap() • Scale with scalar all values / diagonal / off-diagonal – Scale(); ScaleDiagonal(); ScaleOffDiagonal() • Add a scalar to all values / diagonal / off-diagonal – AddScalar(); AddScalarDiagonal(); AddScalarOffDiagonal() Vector objects can also perform • Dot product (scalar product) – Dot() • Scaling – Scale() • Vector updates of types: x := x + αy, x := y + αx, x := αx + βy, x := αx + βy + γz – ScaleAdd(); AddScale(); ScaleAddScale(); ScaleAdd2() • L1 , L2 and L∞ -norm – Asum(); Norm(); Amax() • Sum – Reduce() • Point-wise multiplication of two vector (element-wise multiplication) – PointWiseMult() • Permutations (forward and backward) – Permute(); PermuteBackward() • Copy within specified permutation – CopyFromPermute(); CopyFromPermuteBackward() • Restriction/prolongation via map – Restriction(); Prolongation() • Initialized the vector with random values – SetRandom() • Power – Power() • Copy from double or float vector – CopyFromDouble(); CopyFromFloat() 4.21 Local Stencils Instead of representing the linear operator as a matrix, the user can use stencil operators. The stencil operator has to be defined manually in each class and backend. This is a necessary step in order to have good performance in terms of spatial-block technique and compiler optimization. paralution::ParalutionObj paralution::BaseParalution< ValueType > paralution::Operator< ValueType > paralution::LocalStencil< ValueType > Figure 4.11: Code structure of a host LocalStencil 1 2 3 4 5 6 7 8 9 10 11 12 #i n c l u d e #i n c l u d e #i n c l u d e

u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { init paralution () ; info paralution () ; 42 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 } CHAPTER 4. SINGLE-NODE COMPUTATION L o c a l V e c t o r x ; L o c a l V e c t o r r h s ; L o c a l S t e n c i l s t e n c i l ( Laplace2D ) ; s t e n c i l . S e t G r i d ( 1 0 0 ) ; // 100 x100 x . A l l o c a t e ( ”x” , s t e n c i l . g e t n r o w ( ) ) ; rhs . Alloc ate ( ” rhs ” , s t e n c i l . get nrow ( ) ) ; // L i n e a r S o l v e r CG, L o c a l V e c t o r , d o u b l e > l s ; r h s . Ones ( ) ; x . Zeros () ; l s . SetOperator ( s t e n c i l ) ; l s . Build ( ) ; stencil . info () ; double tick , tack ; tick = paralution time () ; l s . S o l v e ( rhs , &x ) ; tack = paralution time ( ) ; s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; stop paralution () ; return 0; ”Example of 2D Laplace stencil” Chapter 5 Multi-node Computation 5.1 Introduction In this chapter we describe all base objects (matrices and vectors) for computation on multi-node (distributedmemory) systems. Two typical configurations are presented on Figure 5.1 and Figure 5.2. Network Figure 5.1: An example for a multi-node configurations, all nodes are connected via network – single socket system with a single accelerator. Network Figure 5.2: An example for a multi-node configurations, all nodes are connected via network – dual socket systems with two accelerators attached to each node To each compute node, one or more accelerators can be attached. The compute node could be any kind of shared-memory (single, dual, quad CPU) system, details on a single-node can be found in Figure 4.1. Note that the memory of the accelerator and of the host are physically different. All nodes can communicate with each other via network. For the communication channel between the nodes (and between the accelerators on single or multiple nodes) we use the MPI library. Additional libraries for communication can be added on request. 43 44 CHAPTER 5. MULTI-NODE COMPUTATION The PARALUTION library supports non-overlapping type of distribution, where the computational domain is split into several sub-domain with the corresponding information about the boundary and ghost layers. An example is shown on Figure 5.3. The square box domain is distributed into four sub-domains. Each subdomain belongs to a process P 0, P 1, P 2 and P 3. P0 P1 P0 P1 P2 P3 P2 P3 Figure 5.3: An example for domain distribution To perform a sparse matrix-vector (SpMV) multiplication, each process need to multiply its own portion of the domain and update the corresponding ghost elements. For P 0 this multiplication reads Ax AI xI + AG xG = y, (5.1) = yI , (5.2) where I stands for interior and G stands for ghost, the xG is a vector with three sections coming from P 1, P 2 and P 3. The whole ghost part of the global vector is used mainly for the SpMV product and this part of the vector does not play any role in the computation of all vector-vector operations. 5.2 Code Structure Each object contains two local sub-objects. The global matrix stores the interior and the ghost matrix via local objects. The global vector also stores its data via two local objects. In addition to the local data, the global objects have information about the global communication via the parallel manager – see Figure 5.4. Local Matrix/Vector New Backend Host Dynamic Switch Data/Mapping Data/Mapping Accelerator OpenMP Intel MIC OpenMP Data/Mapping Data/Mapping GPU, Accelerators OpenCL Data/Mapping GPU CUDA Data/Mapping Interior Local Matrix/Vector Ghost Local Matrix/Vector Mapping Figure 5.4: Global Matrices and Vectors 5.3 Parallel Manager The parallel manager class handles the communication and the mapping of the global matrix. Each global matrix and vector need to be initialized with a valid parallel manager in order to perform any operation. For many distributed simulation, the underlying matrix is already distributed. This information need to be passed to the parallel manager. 5.4. GLOBAL MATRICES AND VECTORS 45 The required information is: • Global size • Local size of the interior/ghost for each rank • Communication pattern (what information need to be sent to whom) 5.4 Global Matrices and Vectors The global matrices and vectors store their data via two local objects. For the global matrix, the interior can be access via the GetInterior() and GetGhost() functions, which point to two valid local matrices. The same function names exist for the the global vector, which point to two local vector objects. 5.4.1 Value Type and Formats The value type and the formats can be set in similar manner as in the local matrix. The supported formats are identical to the local case. 5.4.2 Matrix and Vector Operations Most functions which are available for the local vectors can be performed on the global vector objects as well. For the global matrix, only the SpMV function is supported but all other functions can be performed on the interior of the matrix (without the couplings via the ghost layer). 5.4.3 Asynchronous SpMV To minimize latency and to increase scalability the PARALUTION library supports asynchronous sparse matrixvector multiplication. The implementation of the SpMV starts with asynchronous transfer of the needed ghost buffers while at the same time it computes the interior matrix. When the computation of the interior SpMV is done, the ghost transfer is synchronized and the ghost SpMV is performed. To minimize the PCI-E bus, the OpenCL and CUDA implementation provide a special packaging technique for transferring all ghost data into a contiguous memory buffer. 5.5 Communication Channel For the communication channel we use MPI but this can be extended, on request, with some other communication mechanics. 5.6 I/O The user can store and load the full data from and to files. For a solver, the needed data would be: • Parallel manager • Sparse matrix • Vector The reading/writing of the data can be done fully in parallel without any communication. To visualize the data we use 4 × 4 grid which is distributed on 4 MPI processes (organized in 2 × 2), each local matrix store the local unknowns (with local index) – see Figure 5.5. The data associated with RANK 0 is presented on Figure 5.6. 5.6.1 File Organization When the parallel manager, global matrix or global vector are writing to a file, the main file (passed as a file name to this function) will contain information for all files on all ranks. 1 2 3 4 pm. dat . rank . 0 pm. dat . rank . 1 pm. dat . rank . 2 pm. dat . rank . 3 ”Parallel manager (main) file with 4 MPI ranks” 46 CHAPTER 5. MULTI-NODE COMPUTATION 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 Figure 5.5: An example of 4 × 4 grid, distributed in 4 domains (2 × 2) 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 RANK = 0 GLOBAL_SIZE = 16 LOCAL_SIZE = 4 GHOST_SIZE = 4 BOUNDARY_SIZE = 4 NUMBER_OF_RECEIVERS = 2 NUMBER_OF_SENDERS = 2 RECEIVERS_RANK = {1, 2} SENDERS_RANK = {1, 2} RECEIVERS_INDEX_OFFSET = {0, 2, 4} SENDERS_INDEX_OFFSET = {0, 2, 4} BOUNDARY_INDEX = {1, 3, 2, 3} Figure 5.6: An example of 4 MPI process and the data associate with RANK 0 1 2 3 4 5 6 7 8 matrix . mtx . i n t e r i o r . rank . 0 matrix . mtx . g h o s t . rank . 0 matrix . mtx . i n t e r i o r . rank . 1 matrix . mtx . g h o s t . rank . 1 matrix . mtx . i n t e r i o r . rank . 2 matrix . mtx . g h o s t . rank . 2 matrix . mtx . i n t e r i o r . rank . 3 matrix . mtx . g h o s t . rank . 3 ”Matrix (main) file with 4 MPI ranks” 1 2 3 4 r h s . dat . rank . 0 r h s . dat . rank . 1 r h s . dat . rank . 2 r h s . dat . rank . 3 ”Vector (main) file with 4 MPI ranks” Example for the data in each file can be found in Section 15.7. 5.6.2 Parallel Manager The data for each rank can be seen as two type of information – one for receiving data from the neighbors, which is presented on Figure 5.7. There the RANK0 needs information what type of data will be received, and from whom. The second needed data is the sending information – RANK0 needs to know where and what to send, see Figure 5.8. Receiving data – RANK0 requires: • Total size of the received information (GHOST SIZE – integer value). • Number of MPI ranks which will send data to RANK0 (NUMBER OF RECEIVERS – integer value). • Which are the MPI ranks sending the data (RECEIVERS RANK – integer array). • How the received data (from each rank) will be stored in the ghost vector (RECEIVERS INDEX OFFSET – integer array), in this example the first 30 elements will be received from P1 [0, 2) and the second 30 from P2 [2, 4) Sending data – RANK0 requires: 5.6. I/O 47 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 RANK = 0 GLOBAL_SIZE = 16 LOCAL_SIZE = 4 GHOST_SIZE = 4 BOUNDARY_SIZE = 4 NUMBER_OF_RECEIVERS = 2 NUMBER_OF_SENDERS = 2 RECEIVERS_RANK = {1, 2} SENDERS_RANK = {1, 2} RECEIVERS_INDEX_OFFSET = {0, 2, 4} SENDERS_INDEX_OFFSET = {0, 2, 4} BOUNDARY_INDEX = {1, 3, 2, 3} Figure 5.7: An example of 4 MPI process, RANK 0 receives data, the associated data is marked in bold) 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 RANK = 0 GLOBAL_SIZE = 16 LOCAL_SIZE = 4 GHOST_SIZE = 4 BOUNDARY_SIZE = 4 NUMBER_OF_RECEIVERS = 2 NUMBER_OF_SENDERS = 2 RECEIVERS_RANK = {1, 2} SENDERS_RANK = {1, 2} RECEIVERS_INDEX_OFFSET = {0, 2, 4} SENDERS_INDEX_OFFSET = {0, 2, 4} BOUNDARY_INDEX = {1, 3, 2, 3} Figure 5.8: An example of 4 MPI process, RANK 0 sends data, the associated data is marked in bold • Total size of the sending information (BOUNDARY SIZE – integer value). • Number of MPI ranks which will receive data from RANK0 (NUMBER OF SENDERS – integer value). • Which are the MPI ranks receiving the data (SENDERS RANK – integer array). • How the sending data (from each rank) will be stored in the sending buffer (SENDERS INDEX OFFSET – integer array), in this example the first 30 elements will be sent to P1 [0, 2) and the second 30 to P2 [2, 4) • The elements which need to be send (BOUNDARY INDEX – integer array). In this example the data which needs to be send to P1 and to P2 is the ghost layer marked as ghost P0. The vertical stripe needs to be send to P1 and the horizontal stripe to P2. The numbering of local unknowns (in local indexing) for P1 (the vertical stripes) are 1, 2 (size of 2) and they are stored in the BOUNDARY INDEX. After 2 elements, the elements for P2 are stored, they are 2, 3 (2 elements). 5.6.3 Matrices For each rank, two matrices are used – interior and ghost matrix. They can be stored in separate files, one for each matrix. The file format could be Matrix Market (MTX) or binary. 5.6.4 Vectors For each rank, the vector object needs only the local (interior) values. They are stored into a single file. The file could be ASCII or binary. 48 CHAPTER 5. MULTI-NODE COMPUTATION Chapter 6 Solvers In this chapter, all linear solvers are presented. Most of the iterative solvers can be performed on linear operators LocalMatrix, LocalStencil and GlobalMatrix – i.e. the iterative solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI. The only exception is the AMG (Algebraic Multigrid) which has two versions (one for the Local and one for the Global type of computation). The only pure local solvers (the one which does not support global/MPI operations) are the mixed-precision defect-correction solver, all direct solvers and the eigenvalue solvers. All solvers need three template parameters – Operators, Vectors and Scalar type. There are three possible combinations • LocalMatrix, LocalVector, float or double • LocalStencil, LocalVector, float or double • GlobalMatrix, GlobalVector, float or double where the Operators/Vectors need to use the same ValueType as the scalar for the solver. 6.1 Code Structure Variable Preconditioner Deﬂated PCG Block-Jacobi AS/RAS Solvers Multigrid AMG/GMG CPR Mixed-Precision Defect-Correction MultiElimination ILU Iterative Liner Solvers Chebyshev Iteration Fixed-Iteration Schemes Preconditioners Saddle-point Iteration Control ILU/IC CG, BiCGStab, GMRES, IDR, CR FSAI/SPAI Chebyshev MC-GS/SGS/SOR/SSOR Power(q)-pattern ILU Figure 6.1: Solver’s classes hierarchy The base class Solver is purely virtual, it provides an interface for: 49 50 CHAPTER 6. SOLVERS • SetOperator() – set the operator A - i.e. the user can pass the matrix here • Build() – build the solver (including preconditioners, sub-solvers, etc), the user need to specify the operator first • Solve() – solve the system Ax = b, the user need to pass a right-hand-side b and a vector x where the solution will be obtained • Print() – shows solver information • ReBuildNumeric() – only re-build the solver numerically (if possible), • MoveToHost() and MoveToAccelerator() – offload the solver (including preconditioners and sub-solvers) to the host/accelerator. The computation of the residual can be based on different norms (L1 , L2 and L∞ ). This is controlled by the function SetResidualNorm(). paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::IterativeLinearSolver< OperatorType, VectorType, ValueType > paralution::BaseMultiGrid< OperatorType, VectorType, ValueType > paralution::BiCGStab< OperatorType, VectorType, ValueType > paralution::BiCGStabl< OperatorType, VectorType, ValueType > paralution::CG< OperatorType, VectorType, ValueType > paralution::Chebyshev< OperatorType, VectorType, ValueType > paralution::CR< OperatorType, VectorType, ValueType > paralution::FCG< OperatorType, VectorType, ValueType > paralution::FGMRES< OperatorType, VectorType, ValueType > paralution::FixedPoint< OperatorType, VectorType, ValueType > paralution::GMRES< OperatorType, VectorType, ValueType > paralution::IDR< OperatorType, VectorType, ValueType > paralution::QMRCGStab< OperatorType, VectorType, ValueType > Figure 6.2: Iterative Solvers paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::IterativeLinearSolver< OperatorType, VectorType, ValueType > paralution::BaseMultiGrid< OperatorType, VectorType, ValueType > paralution::MultiGrid< OperatorType, VectorType, ValueType > Figure 6.3: Multi-Grid paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::DirectLinearSolver< OperatorType, VectorType, ValueType > paralution::Inversion< OperatorType, VectorType, ValueType > paralution::LU< OperatorType, VectorType, ValueType > Figure 6.4: Direct Solvers paralution::QR< OperatorType, VectorType, ValueType > 6.2. ITERATIVE LINEAR SOLVERS 6.2 51 Iterative Linear Solvers The iterative solvers are controlled by an iteration control object, which monitors the convergence properties of the solver - i.e. maximum number of iteration, relative tolerance, absolute tolerance, divergence tolerance. The iteration control can also record the residual history and store it in an ASCII file. • Init(), InitMinIter(), InitMaxIter(), InitTol() – initialize the solver and set the stopping criteria • RecordResidualHistory(), RecordHistory() – start the recording of the residual and store it into a file • Verbose() – set the level of verbose output of the solver (0 – no output, 2 – detailed output, including residual and iteration information) • SetPreconditioner() – set the preconditioning 6.3 Stopping Criteria All iterative solvers are controlled based on: • Absolute stopping criteria, when the |rk |Lp < epsABS • Relative stopping criteria, when the |rk |Lp /|r1 |Lp ≤ epsREL • Divergence stopping criteria, when the |rk |Lp /|r1 |Lp ≥ epsDIV • Maximum number of iteration N , when k = N where k is the current iteration, rk is the residual for the current iteration k (i.e. rk = b − Axk ), r1 is the starting residual (i.e. r1 = b − Axinitguess ). In addition, the minimum number of iteration M can be specified. In this case, the solver will not stop to iterate before k ≥ M . 1 2 3 4 5 6 7 8 9 10 11 12 CG, L o c a l V e c t o r , ValueType > l s ; l s . I n i t ( 1 e −10 , 1 e −8, 1 e +8, 10000) ; // // // // abs tol rel tol div tol max iter l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Setting different stopping criteria in a CG solver” The Lp is the norm used for the computation, where p could be 1, 2 and ∞. The norm computation can be set via SetResidualNorm() function with 1 for L1 , 2 for L2 and 3 for L∞ . For the computation with the L∞ , the index of maximum value can be obtained via the GetAmaxResidualIndex() function. If this function is called and the L∞ has not been selected then this function will return −1. The reached criteria can be obtained via GetSolverStatus(), which will return: • 0 – if no criteria has been reached yet • 1 – if absolute tolerance has been reached • 2 – if relative tolerance has been reached • 3 – if divergence tolerance has been reached • 4 – if maximum number of iteration has been reached 1 2 3 4 5 6 7 8 CG, L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . SetResidualNorm ( 1 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”CG solver with L1 norm” 52 1 2 3 4 5 6 7 8 9 10 11 12 CHAPTER 6. SOLVERS CG, L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . SetResidualNorm ( 3 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; s t d : : c o u t << ” Index o f t h e L \ i n f t y = ” << l s . GetAmaxResidualIndex ( ) << s t d : : end ; s t d : : c o u t << ” S o l v e r s t a t u s = ” << l s . G e t S o l v e r S t a t u s ( ) << s t d : : e n d l ; ”CG solver with L∞ norm” 6.4 Building and Solving Phase Each iterative solver consists of a building step and a solving step. During the building step all necessary auxiliary data is allocated and the preconditioner is constructed. After that the user can call the solving procedure, the solving step can be called several times. When the initial matrix associated with the solver is on the accelerator, the solver will try to build everything on the accelerator. However, some preconditioners and solvers (such as FSAI and AMG) need to be constructed on the host and then they are transferred to the accelerator. If the initial matrix is on the host and we want to run the solver on the accelerator then we need to move the solver to the accelerator as well as the matrix, the right-hand-side and the solution vector. Note that if you have a preconditioner associate with the solver, it will be moved automatically to the accelerator when you move the solver. In the following Listing we present these two scenarios. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CG, L o c a l V e c t o r , ValueType > l s ; MultiColoredILU, L o c a l V e c t o r , ValueType > p ; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CG, L o c a l V e c t o r , ValueType > l s ; MultiColoredILU, L o c a l V e c t o r , ValueType > p ; mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; x . MoveToAccelerator ( ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; // The s o l v e r and t h e p r e c o n d i t i o n e r w i l l be c o n s t r u c t e d on t h e a c c e l e r a t o r l s . Build ( ) ; // The s o l v i n g phase i s performed on t h e a c c e l e r a t o r l s . S o l v e ( rhs , &x ) ; ”An example for a preconditioend CG where the building phase and the solution phase are performed on the accelerator” l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; // The s o l v e r and t h e p r e c o n d i t i o n e r w i l l be c o n s t r u c t e d on t h e h o s t l s . Build ( ) ; // Now we move a l l o b j e c t s ( i n c l u d i n g t h e s o l v e r and t h e p r e c o n d i t i o n e r // t o t h e a c c e l e r a t o r mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; 6.5. CLEAR FUNCTION AND DESTRUCTOR 15 16 17 18 19 20 53 x . MoveToAccelerator ( ) ; l s . MoveToAccelerator ( ) ; // The s o l v i n g phase i s performed on t h e a c c e l e r a t o r l s . S o l v e ( rhs , &x ) ; ”An example for a preconditioned CG where the building phase is performed on the host and the solution phase is performed on the accelerator” 6.5 Clear function and Destructor The Clear() function clears all the data which is in the solver including the associated preconditioner. Thus, the solver is not anymore associated with this preconditioner. Note that the preconditioner is not deleted (via destructor) only a Clear() is called. When the destructor of the solver class is called, it automatically call the Clear() function. Be careful, when declaring your solver and preconditioner in different places - we highly recommend to manually call the Clear() function of the solver and not to relay on destructor of the solver. 6.6 Numerical Update ValueType D,F,C Building phase H,C Available S,M Some preconditioners require two phases in the their construction: an algebraic (e.g. compute a pattern or structure) and a numerical (compute the actual values). In cases, where the structure of the input matrix is a constant (e.g. Newton-like methods) it is not necessary to fully re-construct the preconditioner. In this case, the user can apply a numerical update to the current preconditioner and passed the new operator. This function is called ReBuildNumeric(). If the preconditioner/solver does not support the numerical update, then a full Clear() and Build() will be performed. 6.7 Fixed-point Iteration Fixed-point iteration is based on additive splitting of the matrix as A = M + N , the scheme reads xk+1 = M −1 (b − N xk ). (6.1) It can also be reformulated as a weighted defect correction scheme xk+1 = xk − ωM −1 (Axk − b). (6.2) The inversion of M can be performed by preconditioners (Jacobi, Gauss-Seidel, ILU, etc) or by any type of solvers. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 FixedPoint , L o c a l V e c t o r , ValueType > l s ; 2 FixedPoint , L o c a l S t e n c i l , ValueType > l s ; 3 FixedPoint , GlobalVector , ValueType > l s ; ”Fixed-Point declaration” 1 FixedPoint , L o c a l V e c t o r , ValueType > f p ; 2 J a c o b i , L o c a l V e c t o r , ValueType > p ; 3 4 f p . S e t O p e r a t o r ( mat ) ; 54 5 6 7 8 9 10 CHAPTER 6. SOLVERS fp . SetPreconditioner (p) ; fp . SetRelaxation ( 1 . 3 ) ; fp . Build ( ) ; f p . S o l v e ( rhs , &x ) ; ”Fixed-point iteration based on Jacobi preconditioner - Jacobi iteration” 6.8 Krylov Subspace Solvers Krylov subspace solvers are iterative methods based on projection. The implemented solvers in PARALUTION are CG, CR, BiCGStab, BiCGStab(l), QMRCGStab, GMRES, as well as Flexible CG and Flexible GMRES. Details of the methods can be found in [46, 10, 15, 48, 54]. 6.8.1 CG CG (Conjugate Gradient) is a method for solving symmetric and positive definite matrices (SPD). The method can be preconditioned where the approximation should also be SPD. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 CG, L o c a l V e c t o r , ValueType > l s ; 2 CG, L o c a l S t e n c i l , ValueType > l s ; 3 CG, GlobalVector , ValueType > l s ; ”CG declaration” 6.8.2 CR CR (Conjugate Residual) is a method for solving symmetric and semi-positive definite matrices. The method can be preconditioned where the approximation should also be SPD or semi-SPD. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 CR, L o c a l V e c t o r , ValueType > l s ; 2 CR, L o c a l S t e n c i l , ValueType > l s ; 3 CR, GlobalVector , ValueType > l s ; ”CR declaration” 6.8.3 GMRES GMRES (Generalized Minimal Residual Method) is a method for solving non-symmetric problems. The pure GMRES solvers is based on restarting technique. The default size of the Krylov subspace for the GMRES is set to 30, it can be modify by SetBasisSize() function. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 GMRES, L o c a l V e c t o r , ValueType > l s ; 2 GMRES, L o c a l S t e n c i l , ValueType > l s ; 3 GMRES, GlobalVector , ValueType > l s ; ”GMRES declaration” 6.8. KRYLOV SUBSPACE SOLVERS 6.8.4 55 FGMRES FGMRES (Flexible Generalized Minimal Residual Method) is a method for solving non-symmetric problems. The Flexible GMRES solvers is based on a window shifting of the Krylov subspace. The default size of the Krylov subspace for the GMRES is set to 30, it can be modify by SetBasisSize() function. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 FGMRES, L o c a l V e c t o r , ValueType > l s ; 2 FGMRES, L o c a l S t e n c i l , ValueType > l s ; 3 FGMRES, GlobalVector , ValueType > l s ; ”FGMRES declaration” 6.8.5 BiCGStab BiCGStab (Bi-Conjugate Gradient Stabilized) is a method for solving non-symmetric problems. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 BiCGStab, L o c a l V e c t o r , ValueType > l s ; 2 BiCGStab, L o c a l S t e n c i l , ValueType > l s ; 3 BiCGStab, GlobalVector , ValueType > l s ; ”BiCGStab declaration” 6.8.6 IDR IDR (Induced Dimension Reduction) is a method for solving non-symmetric problems. The dimension of the shadow space for the IDR(s) method can be set by SetShadowSpace() function, the default value is 4. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 IDR, L o c a l V e c t o r , ValueType > l s ; 2 IDR, L o c a l S t e n c i l , ValueType > l s ; 3 IDR, GlobalVector , ValueType > l s ; ”IDR Solver” Note The orthogonal system in IDR method is based on random numbers, thus it is normal to obtain slightly different number of iterations every time you run the program. 6.8.7 FCG FCG (Flexible Conjugate Gradient) is a method for solving symmetric and positive definite matrices (SPD). The method can be preconditioned where the approximation should also be SPD. For additional informations see [41]. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M 1 FCG, L o c a l V e c t o r , ValueType > l s ; 2 FCG, L o c a l S t e n c i l , ValueType > l s ; 3 FCG, GlobalVector , ValueType > l s ; ”FCG declaration” 56 6.8.8 CHAPTER 6. SOLVERS QMRCGStab QMRCGStab is a method for solving non-symmetric problems. More details are given in [34]. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M 1 QMRCGStab, L o c a l V e c t o r , ValueType > l s ; 2 QMRCGStab, L o c a l S t e n c i l , ValueType > l s ; 3 QMRCGStab, GlobalVector , ValueType > l s ; ”QMRCGStab declaration” 6.8.9 BiCGStab(l) BiCGStab(l) (Bi-Conjugate Gradient Stabilized) is a method for solving non-symmetric problems. The degree l can be set via the function SetOrder(). More details can be found in [19]. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M 1 BiCGStabl, L o c a l V e c t o r , ValueType > l s ; 2 BiCGStabl, L o c a l S t e n c i l , ValueType > l s ; 3 BiCGStabl, GlobalVector , ValueType > l s ; ”BiCGStab(l) declaration” Example 1 2 3 4 5 6 7 8 9 CG, L o c a l V e c t o r , ValueType > l s ; MultiColoredILU, L o c a l V e c t o r , ValueType > p ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Preconditioned CG solver with ILU(0 1)” 6.9 Deflated PCG The Deflated Preconditioned Conjugate Gradient Algorithm (DPCG) is a two-level preconditioned CG algorithm to iteratively solve an ill-conditioned linear system, see [51]. ValueType Building phase Solving phase Available D,F H H,C,O,X B (GPLv3 only) Deflation attempts to remove the bad eigenvalues from the spectrum of the preconditioned matrix M −1 A. The preconditioned system can be ’deflated’ such that it is transformed into M −1 P Ax̂ = M −1 P b, (6.3) where P := I − AQ, Q := ZE −1 Z T and E := Z T AZ. Z is called the deflation sub-space matrix. Its columns are approximations of the eigenvectors corresponding to the bad eigenvalues remaining in the spectrum of the preconditioned system given in M −1 Ax = M −1 b. There can be many choices for Z. In the current implementation, we are using a construction scheme tuned for matrices arising from discretizing the pressure-correction equation in two-phase flow, details are discussed in [50]. 6.10. CHEBYSHEV ITERATION 1 2 3 4 5 6 7 57 DPCG, L o c a l V e c t o r , ValueType > l s ; // S e t t h e number o f d e f l a t i o n v e c t o r s l s . SetNVectors ( 2 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Deflated PCG” Note The number of the deflated vectors can be set via SetNVectors() function Note The dimension of the problem is hard-coded in the MakeZ COO(), MakeZ CSR() function in the solver code (dpcg.cpp, see function Build()) 6.10 Chebyshev Iteration The Chebyshev iteration (also known as acceleration scheme) is similar to the CG method but it requires the minimum and the maximum eigenvalue of the matrix. Additional information can be found in [10]. ValueType D,F,C 1 2 3 4 5 6 7 Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M Chebyshev, L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . S e t ( lambda min , lambda max ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Chebyshev iteration” 6.11 Mixed-precision Solver The library provides mixed-precision solvers based on defect-correction scheme. The current implementation of the library is based on host correction in double precision and accelerator computation in single precision. The solver is based on the following scheme: xk+1 = xk + A−1 rk , (6.4) where the computation of the residual rk = b − Axk and of the update xk+1 = xk + dk are performed on the host in double precision. The computation of the residual system Adk = rk is performed on the accelerator in single precision. In addition to the setup functions of the iterative solver, the user need to specify the inner (for Adk = rk ) solver. ValueType D-F 1 2 3 4 5 6 7 8 9 10 Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M MixedPrecisionDC, L o c a l V e c t o r , double , LocalMatrix , L o c a l V e c t o r , f l o a t > mp; CG, L o c a l V e c t o r , f l o a t > cg ; MultiColoredILU, L o c a l V e c t o r , f l o a t > p ; // s e t u p a l o w e r t o l f o r t h e i n n e r s o l v e r cg . S e t P r e c o n d i t i o n e r ( p ) ; cg . I n i t ( 1 e −5, 1 e −2, 1 e +20 , 100000) ; 58 11 12 13 14 15 16 17 18 CHAPTER 6. SOLVERS // s e t u p t h e mixed−p r e c i s i o n DC mp. S e t O p e r a t o r ( mat ) ; mp. S e t ( cg ) ; mp. B u i l d ( ) ; mp. S o l v e ( rhs , &x ) ; ”Mixed-precision solver” 6.12 Multigrid Solver The library provides algebraic multigrid as well as a skeleton for geometric multigrid methods. The BaseMultigrid class itself is not constructing the data for the method. It contains the solution procedure for V, W, K-cycles, for details see [53]. The AMG has two different versions for Local (non-MPI) and for Global (MPI) type of computations. 6.12.1 Geometric Multigrid ValueType D,F,C Building phase - Solving phase H,C,O,X Available B,S,M For the geometric multgrid the user need to pass all information for each level and for its construction. This includes smoothing step, prolongation/restriction, grid traversing and coarsest level solver. This data need to be passed to the solver: • Restriction and prolongation operations – they can be performed in two ways, based on Restriction() and Prolongation() function of the LocalVector class, or by matrix-vector multiplication. This is configured by a set function. • Smoothers – they can be of any iterative linear solver’s type. Valid options are Jacobi, Gauss-Seidel, ILU, etc. using a FixedPoint iteration scheme with pre-defined number of iterations. The smoothers could also be a solver such as CG, BiCGStab, etc. • Coarse grid solver – could be of any iterative linear solver type. The class also provides mechanisms to specify where the coarse grid solver has to be performed on the host or on the accelerator. The coarse grid solver can be preconditioned. • Grid scaling – computed based on a L2 norm ratio. • Operator matrices - the operator matrices need to be passed on each grid level • All objects need to be passed already initialized to the multigrid class. 6.12.2 Algebraic Multigrid Plain and Smoothed Aggregation ValueType D,F,C Building phase H Solving phase H,C,O,X Available B,S,M The algebraic multigrid solver (AMG) is based on the Multigrid class. The coarsening is obtained by different aggregation techniques. Currently, we support interpolation schemes based on aggregation [49] and smoothed aggregation [56]. The smoothers could be constructed inside or outside of the class. Detailed examples are given in the examples section. When building the AMG if not additional information is set, the solver is built based on its default values. 1 AMG, L o c a l V e c t o r , ValueType > l s ; 2 3 l s . S e t O p e r a t o r ( mat ) ; 6.12. MULTIGRID SOLVER 59 4 5 l s . Build ( ) ; 6 7 l s . S o l v e ( rhs , &x ) ; ”AMG as a standalone solver” All parameters can in the AMG can be set externally, including smoothers and coarse-grid solver - any type of solvers can be used. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 LocalMatrix mat ; // L i n e a r S o l v e r AMG, L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; // c o u p l i n g s t r e n g t h l s . SetCouplingStrength (0.001) ; // number o f unknowns on c o a r s e s t l e v e l l s . SetCoarsestLevel (300) ; // i n t e r p o l a t i o n type f o r g r i d t r a n s f e r o p e r a t o r s l s . S e t I n t e r p o l a t i o n ( SmoothedAggregation ) ; // R e l a x a t i o n parameter f o r smoothed i n t e r p o l a t i o n a g g r e g a t i o n l s . SetInterpRelax ( 2 . / 3 . ) ; // Manual s moo t h er s l s . SetManualSmoothers ( t r u e ) ; // Manual c o u r s e g r i d s o l v e r l s . SetManualSolver ( true ) ; // g r i d t r a n s f e r s c a l i n g l s . SetScaling ( true ) ; l s . BuildHierarchy () ; i n t l e v e l s = l s . GetNumLevels ( ) ; // Smoother f o r each l e v e l I t e r a t i v e L i n e a r S o l v e r , L o c a l V e c t o r , ValueType > ∗∗sm = NULL; MultiColoredGS, L o c a l V e c t o r , ValueType > ∗∗ g s = NULL; // Coarse Grid S o l v e r CG, L o c a l V e c t o r , ValueType > c g s ; c g s . Verbose ( 0 ) ; sm = new I t e r a t i v e L i n e a r S o l v e r , L o c a l V e c t o r , ValueType >∗[ l e v e l s − 1 ] ; g s = new MultiColoredGS, L o c a l V e c t o r , ValueType >∗[ l e v e l s − 1 ] ; // P r e c o n d i t i o n e r f o r t h e c o a r s e g r i d s o l v e r // MultiColoredILU, L o c a l V e c t o r , ValueType > p; 44 // cgs−>S e t P r e c o n d i t i o n e r ( p ) ; 45 46 f o r ( i n t i =0; i , L o c a l V e c t o r , ValueType > ∗ f p ; 48 f p = new FixedPoint , L o c a l V e c t o r , ValueType >; 49 fp−>S e t R e l a x a t i o n ( 1 . 3 ) ; 50 sm [ i ] = f p ; 51 60 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 CHAPTER 6. SOLVERS g s [ i ] = new MultiColoredGS, L o c a l V e c t o r , ValueType >; g s [ i ]−>SetPrecondMatrixFormat (ELL) ; sm [ i ]−> S e t P r e c o n d i t i o n e r ( ∗ g s [ i ] ) ; sm [ i ]−>Verbose ( 0 ) ; } ls ls ls ls ls ls ls . SetOperatorFormat (CSR) ; . SetSmoother (sm) ; . SetSolver ( cgs ) ; . SetSmootherPreIter (1) ; . SetSmootherPostIter (2) ; . I n i t ( 1 e −10 , 1 e −8, 1 e +8, 1 0 0 0 0 ) ; . Verbose ( 2 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”AMG with manual settings” The AMG can be used also as a preconditioner within a solver. 1 2 3 4 5 6 7 8 9 10 11 12 13 CG, L o c a l V e c t o r , ValueType > l s ; // AMG P r e c o n d i t i o n e r AMG, L o c a l V e c t o r , ValueType > p ; p . InitMaxIter (2) ; p . Verbose ( 0 ) ; l s . SetPreconditioner (p) ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”AMG as a preconditioner” Ruge-Stueben The classic Ruge-Stueben coarsening algorithm is implemented following the [49]. ValueType D,F,C Building phase H Solving phase H,C,O,X Available S,M The solver provides high-efficiency in terms of complexity of the solver (i.e. number of iterations). However, most of the time it has a higher building step and requires higher memory usage. Pair-wise The pairwise aggregation scheme is based on [42]. ValueType D,F,C Building phase H Solving phase H,C,O,X Available S,M The pairwise AMG delivers very efficient building phase which is suitable for Poisson-like equation. Most of the time it requires K-cycle for the solving phase to provide low number of iterations. 6.13. DIRECT LINEAR SOLVERS 61 Global AMG (MPI) The global AMG is based on a variation of the pairwise aggregation scheme, using the MPI standard for inter-node communication. ValueType D,F,C Building phase H Solving phase H,C,O,X Available M The building and the solving phase are fully MPI parallel. This solver works well with updating technique suitable for time-dependent problems (i.e. time-stepping schemes) which decrease significantly the building phase. 6.13 Direct Linear Solvers The library provides three direct methods – LU, QR and full inversion (based on QR decomposition). The user can pass a sparse matrix, internally it will be converted to dense and then the selected method will be applied. These methods are not very optimal and due to the fact that the matrix is converted in a dense format, these methods should be used only for very small matrices. 1 2 3 4 5 6 I n v e r s i o n , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”A direct solver” Note These methods works only for Local-type of problems (no distributed problems). 6.14 Eigenvalue Solvers 6.14.1 AMPE-SIRA The Shift Invert Residual Arnoldi (SIRA) is an eigenvalue solver which solves parts of eigenvalues of a matrix. It computes the closest eigenvalues to zero of a symmetric positive definite (SPD) matrix. Details can be found in [27, 28, 32, 33]. ValueType D,D-F 1 2 3 4 5 6 7 8 9 10 11 12 13 Building phase H,C Solving phase H,C,O Available B (GPLv3 only) SIRA< LocalMatrix , L o c a l V e c t o r , d o u b l e > s i r a ; s i r a . S e t O p e r a t o r ( mat ) ; // S e t t h e number o f t a r g e t e i g e n v a l u e s s i r a . SetNumberOfEigenvalues ( 3 ) ; // S e t p r e c o n t i d i o n e r o f i n n e r l i n e a r system MultiColoredILU, L o c a l V e c t o r , d o u b l e > p ; s i r a . SetInnerPreconditioner (p) ; s i r a . Build ( ) ; s i r a . S o l v e (& e i g e n v a l u e s ) ; ”SIRA solver with 3 target eigenvalues” An adaptive mixed precision (AMP) scheme can be performed on the SIRA method, where the stopping criteria in the inner loop is inspired by Hochstenbach and Notay (HN criteria, see [22]), the declaration takes two different value type. 62 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CHAPTER 6. SOLVERS SIRA< LocalMatrix , L o c a l V e c t o r , double , LocalMatrix , L o c a l V e c t o r , f l o a t > a m p e s i r a ; a m p e s i r a . S e t O p e r a t o r ( mat ) ; a m p e s i r a . SetNumberOfEigenvalues ( 3 ) ; // S e t p r e c o n t i d i o n e r o f i n n e r l i n e a r system MultiColoredILU, L o c a l V e c t o r , d o u b l e > p1 ; MultiColoredILU, L o c a l V e c t o r , f l o a t > p2 ; a m p e s i r a . S e t I n n e r P r e c o n d i t i o n e r ( p1 , p2 ) ; ampe sira . Build ( ) ; a m p e s i r a . I n i t ( 1 e −10 , 2 0 0 , 0 . 8 ) ; a m p e s i r a . S o l v e (& e i g e n v a l u e s ) ; ”SIRA solver with adaptive mixed precision scheme” The Init(1e−10, 200, 0.8) after ampe sira.Build() is an optional function to configure the iteration controller in SIRA solver. All the parameters remain default when the value is set zero. The first argument of Init() sets the absolute tolerance for eigenvalues, where the default value is 1e − 7. The second argument sets the maximun number of outer loop iteration for solving one eigenvalue. The default value is 300, and the final number would be value set × number of target eigenvalues. The third argument sets the maximum number of inner loop iteration. The default number is 100000. If the value is set on (0, 1], then the maximum number would be val × dimension of operator, otherwise it would be set directly the value when it is on (1, ∞). If there is a special starting vector for SIRA solver, say vec, one can assign the vector to SIRA in the first argument in sira.Solve(), that is sira.Solve(vec, &eigenvalues). Note The stopping criteria of SIRA without mixed precision scheme is by default a constant tolerance. If one wants to apply the HN criteria, add sira.SetInnerStoppingCriterion(1) before sira.Build(). Note The preconditioner in the inner linear system of SIRA is set by default FSAI level 1. Chapter 7 Preconditioners In this chapter, all preconditioners are presented. All preconditioners support local operators. They can be used as a global preconditioner via block-jacobi scheme which works locally on each interior matrix. To provide fast application, all preconditioners require extra memory to keep the approximated operator. 7.1 Code Structure The preconditioners provide a solution to the system M z = r, where either the solution z is directly computed by the approximation scheme or it is iterativily obtained with z = 0 initial guess. Variable Preconditioner Deﬂated PCG Block-Jacobi AS/RAS Solvers Multigrid AMG/GMG CPR Mixed-Precision Defect-Correction MultiElimination ILU Iterative Liner Solvers Chebyshev Iteration Fixed-Iteration Schemes Preconditioners Saddle-point Iteration Control ILU/IC CG, BiCGStab, GMRES, IDR, CR FSAI/SPAI Chebyshev MC-GS/SGS/SOR/SSOR Power(q)-pattern ILU Figure 7.1: Solver’s classes hierarchy 7.2 Jacobi The Jacobi preconditioner is the simplest parallel preconditioner, details can be found in [46, 10, 15]. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M 63 64 CHAPTER 7. PRECONDITIONERS paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::Preconditioner< OperatorType, VectorType, ValueType > paralution::AIChebyshev< OperatorType, VectorType, ValueType > paralution::AS< OperatorType, VectorType, ValueType > paralution::BlockJacobi< OperatorType, VectorType, ValueType > paralution::BlockPreconditioner< OperatorType, VectorType, ValueType > paralution::CPR< OperatorType, VectorType, ValueType > paralution::DiagJacobiSaddlePointPrecond< OperatorType, VectorType, ValueType > paralution::FSAI< OperatorType, VectorType, ValueType > paralution::GS< OperatorType, VectorType, ValueType > paralution::IC< OperatorType, VectorType, ValueType > paralution::ILU< OperatorType, VectorType, ValueType > paralution::ILUT< OperatorType, VectorType, ValueType > paralution::Jacobi< OperatorType, VectorType, ValueType > paralution::MultiColored< OperatorType, VectorType, ValueType > paralution::MultiElimination< OperatorType, VectorType, ValueType > paralution::SGS< OperatorType, VectorType, ValueType > paralution::SPAI< OperatorType, VectorType, ValueType > paralution::TNS< OperatorType, VectorType, ValueType > paralution::VariablePreconditioner< OperatorType, VectorType, ValueType > Figure 7.2: Preconditioners 1 J a c o b i , L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”Jacobi preconditioner” 7.3 Multi-colored (Symmetric) Gauss-Seidel and SOR ValueType D,F,C Building phase H,C, Solving phase H,C,O,X Available B,S,M The additive preconditioners are based on the splitting of the original matrix. Higher parallelism in solving the forward and backward substitution is obtained by performing a multi-colored decomposition. Details can be found in [35, 46]. 1 MultiColoredSGS, L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”Multi-colored symmetric Gauss-Seidel preconditioner” 7.4. INCOMPLETE LU WITH LEVELS – ILU(P ) 1 2 3 4 5 6 65 MultiColoredGS, L o c a l V e c t o r , ValueType > p ; p . SetRelaxation (1 .6 ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”Multi-colored SOR with relaxation parameter 1.6” 7.4 Incomplete LU with levels – ILU(p) ILU(p) factorization based on p-levels. Details can be found in [46]. ValueType D,F,C 1 2 3 4 5 Building phase H Solving phase H,C Available B,S,M ILU, L o c a l V e c t o r , ValueType > p ; p . Set (1) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”ILU(1) preconditioner - based on levels” 7.5 Incomplete Cholesky – IC IC factorization without additional fill-ins. Details are given in [46]. ValueType D,F,C Building phase H Solving phase H,C Available B,S,M 1 IC, L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”IC preconditioner” Note This implementation is still experimental and it is highly recommended to use the ILU preconditioner instead. 7.6 Incomplete LU with threshold – ILUT(t,m) Incomplete LU (ILUT(t,m)) factorization based on threshold (t) and maximum number of elements per row (m). Details can be found in [46]. The preconditioner can be initialized with the threshold value only or with threshold and maximum number of elements per row. ValueType D,F,C 1 2 3 4 5 Building phase H Solving phase H,C Available B,S,M ILUT, L o c a l V e c t o r , ValueType > p ; p . Set ( 0 . 0 1 ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”ILUT(0.01) preconditioner” Note This implementation is still experimental. 66 CHAPTER 7. PRECONDITIONERS 7.7 Power(q)-pattern method – ILU(p,q) ILU(p,q) is based on the ILU(p) factorization with a power(q)-pattern method, the algorithm can be found in[35]. This method provides a higher degree of parallelism of forward and backward substitution compared to the standard ILU(p) preconditioner. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M Note If the preconditioner is initialized with only the first argument (p) then q is taken to be p + 1. 1 2 3 4 5 MultiColoredILU, L o c a l V e c t o r , ValueType > p ; p . Set (1 , 2) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”ILU(1 2) preconditioner - based on power(q)-pattern method” 7.8 Multi-Elimination ILU The multi-elimination ILU preconditioner is based on the following decomposition D F D F I 0 A= = × , E C ED−1 I 0 Â (7.1) where Â = C − ED−1 F . To make the inversion of D easier, we permute the preconditioning before the factorization with a permutation P to obtain only diagonal elements in D. The permutation here is based on a maximal independent set. Details can be found in [46]. This procedure can be applied to the block matrix Â, in this way we can perform the factorization recursively. In the last level of the recursion, we need to provide a solution procedure. By the design of the library, this can be any kind of solver. In the following example we build a preconditioned CG solver with a multi-elimination preconditioner defined with 2 levels and without drop-off tolerance. The last block of preconditioner is solved using a Jacobi preconditioner. ValueType D,F,C 1 2 3 4 5 6 7 8 9 10 11 Building phase H,C Solving phase H,C,O Available B,S,M CG, L o c a l V e c t o r , ValueType > cg ; M u l t i E l i m i n a t i o n , L o c a l V e c t o r , ValueType > p ; J a c o b i , L o c a l V e c t o r , ValueType > j p ; p . Set ( j p , 2) ; cg . S e t O p e r a t o r ( mat ) ; cg . S e t P r e c o n d i t i o n e r ( p ) ; cg . B u i l d ( ) ; cg . S o l v e ( rhs , &x ) ; ”PCG solver with multi-elimination ILU preconditioner and Jacobi” 7.9 Diagonal Preconditioner for Saddle-Point Problems Consider the following saddle-point problem A= K E F 0 . (7.2) 7.10. CHEBYSHEV POLYNOMIAL 67 For such problems we can construct a diagonal Jacobi-like preconditioner of type K 0 P = 0 S (7.3) with S = ED−1 F , where D are the diagonal elements of K. The matrix S is fully constructed (via sparse matrix-matrix multiplication). The preconditioner needs to be initialized with two external solvers/preconditioners – one for the matrix K and one for the matrix S. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M 1 GMRES, L o c a l V e c t o r , ValueType > gmres ; 2 3 D i a g J a c o b i S a d d l e P o i n t P r e c o n d , L o c a l V e c t o r , ValueType > p ; 4 MultiColoredILU, L o c a l V e c t o r , ValueType > p k ; 5 MultiColoredILU, L o c a l V e c t o r , ValueType > p s ; 6 7 p . Set ( p k , p s ) ; 8 9 gmres . S e t O p e r a t o r ( mat ) ; 10 gmres . S e t B a s i s S i z e ( 5 0 ) ; 11 gmres . S e t P r e c o n d i t i o n e r ( p ) ; 12 gmres . B u i l d ( ) ; 13 14 gmres . S o l v e ( rhs , &x ) ; ”GMRES solver with diagonal Jacobi-like preconditioner for Saddle-Point problems” 7.10 Chebyshev Polynomial The Chebyshev approximate inverse matrix is based on the work [14]. ValueType D,F,C 1 2 3 4 5 6 7 Building phase H,C Solving phase H,C,O,X Available B,S,M Chebyshev, L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . S e t ( lambda min , lambda max ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Chebyshev polynomial preconditioner” 7.11 FSAI(q) The Factorized Sparse Approximate Inverse preconditioner computes a direct approximation of M −1 by minimizing the Frobenius norm ||I − GL||F where L denotes the exact lower triangular part of A and G := M −1 . The q FSAI preconditioner is initialized by q, based on the sparsity pattern of |A| [35]. However, it is also possible to supply external sparsity patterns in form of the LocalMatrix class. Further details on the algorithm are given in [31]. 68 CHAPTER 7. PRECONDITIONERS ValueType D,F,C 1 2 3 4 5 Building phase H Solving phase H,C,O,X Available B,S,M FSAI, L o c a l V e c t o r , ValueType > p ; p . Set (2) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”FSAI(2) preconditioner” Note The FSAI(q) preconditioner is only suited for SPD matrices. 7.12 SPAI The SParse Approximate Inverse algorithm is an explicitly computed preconditioner for general sparse linear systems. In its current implementation, only the sparsity pattern of the system matrix is supported. The SPAI computation is based on the minimization of the Frobenius norm ||AM − I||F . Details can be found in [21]. ValueType D,F,C Building phase H Solving phase H,C,O,X Available B,S,M 1 SPAI, L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”SPAI preconditioner” Note The SPAI implementation is still experimental. The current version is based on a original static matrix pattern (similar to ILU0). 7.13 Block Preconditioner When handling vector fields, typically one can try to use different preconditioners/solvers for the different blocks. For such problems, the library provides a block-type preconditioner. This preconditioner builds the following block-type matrix  Ad  B1 P =  . Z1 0 Bd . Z2 . . . .  0 0   .  Zd (7.4) The solution of P can be performed in two ways. It can be solved by block-lower-triangular sweeps with inversion of the blocks Ad ...Zd and with a multiplication of the corresponding blocks, this is set by the function SetLSolver() (which is the default solution scheme). Alternatively, it can be used only with an inverse of the diagonal Ad ...Zd (such as Block-Jacobi type) by using the function SetDiagonalSolver(). ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M 1 GMRES, L o c a l V e c t o r , ValueType > l s ; 2 3 B l o c k P r e c o n d i t i o n e r , L o c a l V e c t o r , 4 ValueType > p ; 5 S o l v e r , L o c a l V e c t o r , ValueType > ∗∗ p2 ; 6 7.13. BLOCK PRECONDITIONER 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 int n = 2; int ∗ size ; s i z e = new i n t [ n ] ; p2 = new S o l v e r , L o c a l V e c t o r , ValueType >∗[n ] ; f o r ( i n t i =0; i , L o c a l V e c t o r , ValueType > ∗mc ; mc = new MultiColoredILU, L o c a l V e c t o r , ValueType >; p2 [ i ] = mc ; } p . S e t ( n , s i z e , p2 ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Block Preconditioner for two MC-ILU” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 GMRES, L o c a l V e c t o r , ValueType > l s ; B l o c k P r e c o n d i t i o n e r , L o c a l V e c t o r , ValueType > p ; S o l v e r , L o c a l V e c t o r , ValueType > ∗∗ p2 ; int n = 2; int ∗ size ; s i z e = new i n t [ n ] ; p2 = new S o l v e r , L o c a l V e c t o r , ValueType >∗[n ] ; // Block 0 s i z e [ 0 ] = mat . g e t n r o w ( ) / n ; MultiColoredILU, L o c a l V e c t o r , ValueType > ∗mc ; mc = new MultiColoredILU, L o c a l V e c t o r , ValueType >; 18 p2 [ 0 ] = mc ; 19 20 // Block 1 21 s i z e [ 1 ] = mat . g e t n r o w ( ) / n ; 22 AMG, L o c a l V e c t o r , ValueType > ∗amg ; 23 amg = new AMG, L o c a l V e c t o r , ValueType >; 24 amg−>I n i t M a x I t e r ( 2 ) ; 25 amg−>Verbose ( 0 ) ; 26 p2 [ 1 ] = amg ; 27 28 29 p . S e t ( n , s i z e , p2 ) ; 30 p . S e t D i a g o n a l S o l v e r ( ) ; 31 32 l s . S e t O p e r a t o r ( mat ) ; 33 l s . S e t P r e c o n d i t i o n e r ( p ) ; 69 70 CHAPTER 7. PRECONDITIONERS 34 35 l s . B u i l d ( ) ; 36 37 l s . S o l v e ( rhs , &x ) ; ”Block Preconditioner for MC-ILU and AMG” 7.14 Additive Schwarz and Restricted Additive Schwarz – AS and RAS As a preconditioning technique, we can decompose the linear system Ax = b into small sub-problems based on Ai = RiT ARi (7.5) where Ri are restriction operators. Thus, we can define: • Additive Schwarz (AS) – the restriction operators produce sub-matrices which overlap. This leads to contributions from two preconditioners on the overlapped area, see Figure 7.3 (left). Thus these contribution sections are scaled by 1/2. • Restricted Additive Schwarz (RAS) – this is a mixture of the pure block-Jacobi and the additive Schwarz scheme. Again, the matrix A is decomposed into squared sub-matrices. The sub-matrices are large as in the additive Schwartz approach – they include overlapped areas from other blocks. After we solve the preconditioning sub-matrix problems, we provide solutions only to the non-overlapped area, see Figure 7.3 (right). ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M M1 M1 M2 M2 M3 M3 M4 M4 Figure 7.3: Example of a 4 block-decomposed matrix – Additive Schwarz (AS) with overlapping preconditioner (left) and Restricted Additive Schwarz (RAS) preconditioner (right) Details can be found in [46, 11, 45, 47, 52]. The library provides Additive Schwarz (called AS ) and Restricted Additive Schwarz (called RAS ) preconditioners. For both preconditioners, the user need to to specify the number of blocks, the size of the overlapping region and the type of preconditioners which should be used on the blocks. 1 2 3 4 5 6 7 8 9 // L i n e a r S o l v e r GMRES, L o c a l V e c t o r , ValueType > l s ; // P r e c o n d i t i o n e r // AS, L o c a l V e c t o r , ValueType > p ; RAS, L o c a l V e c t o r , ValueType > p ; // Second l e v e l p r e c o n d i t i o n e r s S o l v e r , L o c a l V e c t o r , ValueType > ∗∗ p2 ; 7.15. TRUNCATED NEUMANN SERIES PRECONDITIONER (TNS) 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 71 int n = 3; int ∗ size ; s i z e = new i n t [ n ] ; p2 = new S o l v e r , L o c a l V e c t o r , ValueType >∗[n ] ; f o r ( i n t i =0; i , L o c a l V e c t o r , ValueType > ∗mc ; mc = new MultiColoredILU, L o c a l V e c t o r , ValueType >; p2 [ i ] = mc ; } p . Set (n , 20 , p2 ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”(Restricted) Additive Schwarz defined with 3 blocks and 10 overlapping elements” 7.15 Truncated Neumann Series Preconditioner (TNS) The Truncated Neumann Series Preconditioner (TNS) is based on M −1 = K T D−1 K, (7.6) where K = (I − LD−1 + (LD−1 )2 ), D is the diagonal matrix of A and L is strictly the lower triangular part of A. The preconditioner can be computed in two forms - explicitly and implicitly. In the implicit form, the full construction of M is performed via matrix-matrix operations. Whereas in the explicit from, the application of the preconditioner is based on matrix-vector operations only. The matrix format for the stored matrices can be specified. ValueType D,F,C 1 2 3 4 5 6 7 8 9 10 11 12 Building phase H,C Solving phase H,C,O,X Available B,S,M CG, L o c a l V e c t o r , d o u b l e > l s ; TNS, L o c a l V e c t o r , d o u b l e > p ; // E x p l i c i t o r i m p l i c i t // p . S e t ( f a l s e ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; 72 13 CHAPTER 7. PRECONDITIONERS l s . S o l v e ( rhs , &x ) ; ”Truncated Neumann Series Preconditioner (TNS)” 7.16 Variable Preconditioner The Variable Preconditioner can hold a selection of preconditioners. In this way any type of preconditioners can be combined. As example, the variable preconditioner can combine Jacobi, GS and ILU – then, the first iteration of the iterative solver will apply Jacobi, the second iteration will apply GS and the third iteration will apply ILU. After that, the solver will start again with Jacobi, GS, ILU. It is important to be used with solvers which can handle such type of preconditioners. ValueType D,F 7.17 Building phase H,C,O Solving phase H,C,O Available S,M CPR Preconditioner This preconditioner is still under development and it is not part of the official distributions The CPR (Constrained Pressure Residual) preconditioner is a special type of preconditioner for reservoir simulations. The preconditioner contains two (or three) stage sub-preconditioners. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M The user has the ability to select any combination of the two (or three) sub-preconditioners, as well as to use the default ILU-like and AMG configuration. 7.18 MPI Preconditioners The MPI preconditioners are designed to work in parallel on all MPI processes. In this way, any type of preconditioner can be wrapped in a Block-Jacobi type, where the preconditioner will be applied locally on each interior matrix. ValueType D,F,C Building phase H,C,O Solving phase H,C,O Available M Chapter 8 Backends The support of accelerator devices is embedded in the structure of PARALUTION. The primary goal is to use this technology whenever possible to decrease the computational time. Note Not all functions are ported and presented on the accelerator backend. This limited functionality is natural since not all operations can be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc). 8.1 Backend and Accelerators Currently, the library supports CUDA GPU [43], OpenCL [29] and MIC [26] devices. Due to its design the library can be extended to support Cilk [40], Intel TBB [25] backends. The extension of the library will not reflect the algorithms which are based on it. If a particular function is not implemented for the used accelerator, the library will move the object to the host and compute the routine there. In this a case warning message of level 2 will be printed. For example if we use an OpenCL backend and we want to perform an ILU(0) factorization which is currently not available, the library will move the object to the host, perform the routine there and print the following warning message: 1 ∗∗∗ warning : L o c a l M a t r i x : : I L U 0 F a c t o r i z e ( ) i s performed on t h e h o s t ”Warning message for performing the ILU(0) factorization on the host” 8.2 Copy All matrices and vectors have a CopyFrom() function which can be used to transfer data from and to the accelerator. 1 2 3 4 5 6 7 8 9 10 11 12 13 L o c a l V e c t o r vec1 , vec2 ; // vec1 i s on t h e h o s t vec1 . MoveToHost ( ) ; // vec2 i s on t h e a c c e l r a t o r vec2 . MoveToAcclerator ( ) ; // copy vec2 t o vec1 vec1 . CopyFrom ( vec2 ) ; // copy vec1 t o vec2 vec2 . CopyFrom ( vec1 ) ; ”Copying data to and from the accelerator” 8.3 CloneBackend When creating new objects, often the user has to ensure that it is allocated on the same backend as already existing objects. This can be achieved via the CloneBackend function. For example, consider a matrix mat and a vector vec. If a SpMV operation should be performed, both objects need to be on the same backend. This can 73 74 CHAPTER 8. BACKENDS be achieved by calling vec.CloneBackend(mat). In this way, the vector will have the same backend as the matrix. Analoguely, mat.CloneBackend(vec) can be called. Then, the matrix will end up with the same backend as the vector. 8.4 Moving Objects To and From the Accelerator All object in PARALUTION can be moved to the accelerator and to the host. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 LocalMatrix mat ; L o c a l V e c t o r vec1 , vec2 ; // P e r f o r m i n g t h e matrix−v e c t o r m u l t i p l i c a t i o n on h o s t mat . Apply ( vec1 , &vec2 ) ; // Move data t o t h e a c c e l e r a t o r mat . MoveToAccelerator ( ) ; vec1 . MoveToAccelerator ( ) ; vec2 . MoveToAccelerator ( ) ; // P e r f o r m i n g t h e matrix−v e c t o r m u l t i p l i c a t i o n on a c c e l e r a t o r mat . Apply ( vec1 , &vec2 ) ; // Move data t o t h e h o s t mat . MoveToHost ( ) ; vec1 . MoveToHost ( ) ; vec2 . MoveToHost ( ) ; ”Using an accelerator for sparse matrix-vector multiplication” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CG, L o c a l V e c t o r , ValueType > l s ; MultiColoredILU, L o c a l V e c t o r , ValueType > p ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; x . MoveToAccelerator ( ) ; l s . MoveToAccelerator ( ) ; l s . S o l v e ( rhs , &x ) ; ”Using an accelerator for preconditioned CG solver (building phase on the host)” 1 2 3 4 5 6 7 8 9 10 11 12 13 CG, L o c a l V e c t o r , ValueType > l s ; MultiColoredILU, L o c a l V e c t o r , ValueType > p ; mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; x . MoveToAccelerator ( ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Using an accelerator for preconditioned CG solver (building phase on the accelerator)” 8.5. ASYNCHRONOUS TRANSFERS 8.5 75 Asynchronous Transfers The PARALUTION library also provides asynchronous transfers of data (currently, only for CUDA backend). This can be done with the CopyFromAsync() function or with the MoveToAcceleratorAsync() and MoveToHostAsync(). These functions return immediately and perform the asynchronous transfer in background mode. The synchronization is done with the Sync() function. 1 2 3 4 5 6 7 8 9 10 11 12 L o c a l V e c t o r x , y ; LocalMatrix mat ; mat . MoveToAcceleratorAsync ( ) ; x . MoveToAcceleratorAsync ( ) ; y . MoveToAcceleratorAsync ( ) ; // do some computation mat . Sync ( ) ; x . Sync ( ) ; y . Sync ( ) ; ”Asynchronous Transfers with MoveToAcceleratorAsync” 1 2 3 4 5 6 7 8 9 10 L o c a l V e c t o r x , y ; y . MoveToHost ( ) ; x . MoveToAccelerator ( ) ; x . CopyFromAsync ( y ) ; // do some computation x . Sync ( ) ; ”Asynchronous Transfers with CopyFromAsync” When using the MoveToAcceleratorAsync() and MoveToHostAsync() functions, the object will still point to its original location (i.e. host for calling MoveToAcceleratorAsync() and accelerator for MoveToHostAsync()). The object will switch to the new location after the Sync() function is called. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 L o c a l V e c t o r x , y ; LocalMatrix mat ; // mat , x and y a r e i n i t i a l l y on t h e h o s t ... // S t a r t async t r a n s f e r mat . MoveToAcceleratorAsync ( ) ; x . MoveToAcceleratorAsync ( ) ; // t h i s w i l l be performed on t h e h o s t mat . Apply ( x , &y ) ; mat . Sync ( ) ; x . Sync ( ) ; // Move y y . MoveToAccelerator ( ) ; // t h i s w i l l be performed on t h e a c c e l e r a t o r mat . Apply ( x , &y ) ; ”Asynchronous Transfers with MoveToAcceleratorAsync()” Note The objects should not be modified during an active asynchronous transfer. However, if this happen, the values after the synchronization might be wrong. 76 CHAPTER 8. BACKENDS Note CUDA backend – to use the asynchronous transfers you need to enable the pinned memory allocation. Uncomment #define PARALUTION CUDA PINNED MEMORY in file src/utils/allocate free.hpp 8.6 Systems without Accelerators PARALUTION provides full code compatibility on systems without accelerators - i.e. the user can take the code from the GPU systems, re-compile the same code on a machine without a GPU and it will provide the same results. For example, if one compiles the above matrix-vector multiplication code on a system without GPU support, it will just perform two multiplications on the host - the MoveToAccelerator() and MoveFromAccelerator() calls will be ignored. Chapter 9 Advanced Techniques 9.1 Hardware Locking PARALUTION also provides a locking mechanism based on keys. Each key is generated by the MAC address of the network card and the processor’s ID. The library can handle various keys (for running on various computer/nodes). See Figure 9.1 and Figure 9.2. Available S,M Computer MAC addr CPU ID Figure 9.1: Hardware locking based on keys Computer 1 Computer 2 Key 1 Key 2 Key 1 Key 2 Figure 9.2: Hardware locking based on keys The user can add the node keys directly in the code by editing src/utils/hw check.cpp file. A detailed description, on the procedure how new keys are added and computed is given in the source file src/utils/hw check.cpp. Note This feature is available only under Linux OS. 9.2 Time Restriction/Limit The user can specify also an expiration date. After this date the library cannot be initialized. Available S,M To disable/enable as well as to set the date, edit file src/utils/time check.cpp. In this file, there is a detailed description, on how to edit this feature. 77 78 9.3 CHAPTER 9. ADVANCED TECHNIQUES OpenCL Kernel Encryption The OpenCL kernels are compiled at runtime by the OpenCL compiler. Due to this fact, the source code of the OpenCl kernels is stored in a readable form in the library. To avoid an easy access to it, PARALUTION provides encryption of the kernels based on RSA mechanism. Available S,M 9.4 Logging Encryption All logging messages (printed by the library) can be encrypted if the developer needs to ensure limited readability of logs. The encryption is based on RSA mechanism. Available S,M 9.5 Encryption The encryption (for the OpenCL kernels or for the log files) can be turn on or off by editing src/utils/def.hpp. The keys for the encryption and for the decryption are based on the RSA mechanism, the user needs to provide the three of them in the code (encryption, decryption, mask), see file src/utils/enc.cpp 9.6 Memory Allocation All data which is passed to and from PARALUTION (via SetDataPtr/LeaveDataPtr) is using the memory handling functions described in the code. By default, the library uses standard new and delete functions for the host data. Available B,S,M 9.6.1 Allocation Problems If the allocation fails, the library will report an error and exits. If the user requires a special treatment, it has to be placed in the file src/utils/allocate free.cpp. Available B,S,M 9.6.2 Memory Alignment The library can also handle special memory alignment functions. This feature need to be uncommented before the compilation process in the file src/utils/allocate free.cpp. 9.6.3 Pinned Memory Allocation (CUDA) By default, the standard host memory allocation is realized by new and delete. For a better PCI-E transfers for NVIDIA GPUs, the user can also use pinned host memory. This can be activated by uncommenting the corresponding macro in src/utils/allocate free.hpp. Available B,S,M Chapter 10 Plug-ins Note that all plug-ins, except the FORTRAN one, are provided only under GPLv3 license in the Basic version (B), and there is no MPI support. 10.1 FORTRAN PARALUTION comes with an easy to use Fortran plug-in. Currently it supports COO and CSR input matrix formats and uses the intrinsic ISO BIND C to transfer data between Fortran and PARALUTION. The argument passing for the COO and CSR subroutine calls only differ in the matrix related arrays. 1 2 3 4 5 6 7 8 9 10 call paralution fortran solve coo ( & n , m, nnz , & ’GMRES ’ // C NULL CHAR, & ’HYB ’ // C NULL CHAR, & ’ MultiColoredILU ’ // C NULL CHAR, & ’ELL ’ // C NULL CHAR, & C LOC( row ) , C LOC( c o l ) , C LOC( v a l ) , C LOC( r h s ) , & 1 e −10 C DOUBLE, 1 e−6 C DOUBLE, 1 e+8 C DOUBLE, 5 0 0 0 , & 30 , 0 , 1 , & C LOC( x ) , i t e r , resnorm , s t a t ) ”Example of Fortran subroutine call using COO matrix format” & & & & & & & & & 1 2 3 4 5 6 7 8 9 10 call paralution fortran solve csr ( & n , m, nnz , & ’CG ’ // C NULL CHAR, & ’CSR ’ // C NULL CHAR, & ’ MultiColoredSGS ’ // C NULL CHAR, & ’CSR ’ // C NULL CHAR, & C LOC( r o w o f f s e t ) , C LOC( c o l ) , C LOC( v a l ) , C LOC( r h s ) , & 1 e −10 C DOUBLE, 1 e−6 C DOUBLE, 1 e+8 C DOUBLE, 2 0 0 0 , & 0, 0, 1, & C LOC( x ) , i t e r , resnorm , s t a t ) ”Example of Fortran subroutine call using CSR matrix format” & & & & & & & & & The arguments include: • (2) Number of rows, number of columns, number of non-zero elements • (3) Solver: CG, BiCGStab, GMRES, Fixed-Point • (4) Operator matrix format: DENSE, CSR, MCSR, COO, DIA, ELL, HYB • (5) Preconditioner: None, Jacobi, MultiColoredGS, MultiColoredSGS, ILU, MultiColoredILU • (6) Preconditioner matrix format: DENSE, CSR, MCSR, COO, DIA, ELL, HYB • (7) Row index (COO) or row offset pointer (CSR), column index, right-hand side • (8) Absolute tolerance, relative tolerance, divergence tolerance, maximum number of iterations • (9) Size of the Krylov subspace (GMRES), ILU(p), ILU(q) • (10) Outputs: solution vector, number of iterations needed to converge, final residual norm, status code 79 80 CHAPTER 10. PLUG-INS A detailed listing is also given in the header of the PARALUTION Fortran plug-in file. For a successful integration of PARALUTION into Fortran code a compiled PARALUTION library is necessary. Furthermore, you need to compile and link the Fortran plug-in (located in src/plug-ins) because it is not included in the compiled library file. To achieve this, a simple Makefile can be used. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CC=g++ FC=g f o r t r a n FLAGS=−O3 −l s t d c++ −fopenmp INC=−I . . / . . / . . / b u i l d / i n c LIB = . . / . . / . . / b u i l d / l i b / l i b p a r a l u t i o n . s o OBJ=p a r a l u t i o n f o r t r a n . o f o r t r a n c o d e . o default : fortran code f o r t r a n c o d e : $ (OBJ) $ (FC) −o f o r t r a n c o d e $ (OBJ) $ ( LIB ) $ (FLAGS) fortran code . o : fortran code . f90 $ (FC) $ (FLAGS) −c f o r t r a n c o d e . f 9 0 p a r a l u t i o n f o r t r a n . o : . . / . . / plug−i n s / p a r a l u t i o n f o r t r a n . cpp $ (CC) $ (FLAGS) $ (INC) −c . . / . . / plug−i n s / p a r a l u t i o n f o r t r a n . cpp clean : rm − r f ∗ . o f o r t r a n c o d e ”Example Makefile for PARALUTION integration to Fortran” Note Examples are in src/examples/fortran. 10.2 OpenFOAM OpenFOAM [44] is a C++ toolbox for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, including computational fluid dynamics (CFD). To use the PARALUTION solvers in an OpenFOAM application you need to compile the library and include it in the application: • Include the chosen solvers in Make/files, for example if you want to add PCG you need to include PARALUTION DIR/build/inc/plug-ins/OpenFOAM/matrices/lduMatrix/solvers/paralution PCG/paralution PCG.C • Add in the Make/options file the path to the include files -IPARALUTION DIR/build/inc in EXE INC part, and in EXE LIBS the path to the PARALUTION library PARALUTION DIR/build/lib/libparalution.so needs to be added. • Include the libraries which you want to use (-fopenmp for OpenMP, -lOpenCL for OpenCL, and -lcudart -lcublas -lcusparse -L/usr/local/cuda/lib for CUDA). • Edit the fvSolution configuration file. • See the example in src/examples/OpenFOAM 1 solvers 2 { 3 val 4 { 5 6 7 8 9 10 11 12 13 14 solver preconditioner ILUp ILUq MatrixFormat PrecondFormat useAccelerator paralution PCG ; paralution MultiColoredILU ; 0; 1; HYB; ELL ; true ; tolerance relTol 1 e −09; 1 e −05; 10.3. DEAL.II 15 16 17 } 81 maxIter 100000; }; ”Example configuration for PARALUTION PCG solver” 1 solvers 2 { 3 val 4 { 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 }; 30 } solver MatrixFormat paralution AMG ; HYB; smoother SmootherFormat nPreSweeps nPostSweeps paralution MultiColoredGS ; ELL ; 1; 2; CoarseGridSolver preconditioner PrecondFormat CG; none ; CSR ; InterpolationType AggrRelax nCellsInCoarsestLevel couplingStrength Relaxation scaleCorrection SmoothedAggregation ; 2.0/3.0; 300; 0.01; 1.3; true ; tolerance relTol maxIter 1 e −09; 1 e −05; 10000; useAccelerator true ; ”Example configuration for PARALUTION AMG solver” The configuration includes: • Matrix formats: DENSE, CSR, MCSR, COO, DIA, ELL, HYB. • Operator matrix format: field MatrixFormat. • Preconditioner/smoother matrix format: field PrecondFormat and SmootherFormat. • Solvers: paralution PCG, paralution PBiCG, paralution PGMRES, paralution AMG. • Preconditioners/smoothers: preconditioners/smoothers are: paralution Jacobi, paralution MultiColoredGS (Gauss-Seidel), paralution MultiColoredSGS (Symmetric Gauss-Seidel), paralution MultiColoredILU (also with fields ILUp and ILUq), paralution FSAI, paralution MultiElimination (with fields MEp and LastBlockPrecond). • Using accelerator (GPU): useAccelerator true/false. Note You need to call the paralution::init paralution() and paralution::stop paralution() function in the OpenFOAM solver application file. Note Optimal configuration with respect to the performance could be different for the host and for the accelerator backends. Note For details, please check the source files of the solvers. 10.3 Deal.II Deal.II [8] is a C++ program library targeted at the computational solution of partial differential equations using adaptive finite elements. It uses state-of-the-art programming techniques to offer you a modern interface to the 82 CHAPTER 10. PLUG-INS complex data structures and algorithms required. The plug-in for the finite element package Deal.II consists of 3 functions: • import dealii matrix – imports a sparse Deal.II matrix to PARALUTION, the user also needs to specify the sparsity pattern of the Deal.II matrix. • import dealii vector – converts a Deal.II vector into a PARALUTION vector • export dealii vector – exports a PARALUTION vector to a Deal.II vector. For vector functions both of the vectors have to be allocated and they need to have the same sizes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 #i n c l u d e

#i n c l u d e ...... // Deal . I I SparseMatrix matrix ; SparsityPattern sparsity pattern ; Vector r h s ; Vector s o l ; ...... // PARALUTION p a r a l u t i o n : : LocalMatrix p a r a l u t i o n m a t r i x ; p a r a l u t i o n : : L o c a l V e c t o r p a r a l u t i o n r h s , p a r a l u t i o n s o l ; p a r a l u t i o n : : CG

, p a r a l u t i o n : : L o c a l V e c t o r , double> l s ; p a r a l u t i o n : : MultiColoredILU

, p a r a l u t i o n : : L o c a l V e c t o r , double> p a r a l u t i o n p r e c o n d ; ...... // A l l o c a t e PARALUTION v e c t o r s paralution rhs . Allocate ( ” rhs ” , rhs . s i z e () ) ; paralution sol . Allocate (” sol ” , sol . size () ) ; // Import t h e Deal . I I matrix import dealii matrix ( sparsity pattern , matrix , &p a r a l u t i o n m a t r i x ) ; p a r a l u t i o n m a t r i x . ConvertToCSR ( ) ; // Import t h e r h s and t h e s o l u t i o n v e c t o r s i m p o r t d e a l i i v e c t o r ( rhs , &p a r a l u t i o n r h s ) ; i m p o r t d e a l i i v e c t o r ( s o l , &p a r a l u t i o n s o l ) ; // Setup t h e l i n e a r s o l v e r l s . Clear () ; l s . SetOperator ( paralution matrix ) ; l s . SetPreconditioner ( paralution precond ) ; // B u i l d t h e p r e c o n d i t i o n e d CG s o l v e r l s . Build ( ) ; 10.4. ELMER 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 83 // S e t t h e matrix t o HYB f o r b e t t e r p e r f o r m a n c e on GPUs p a r a l u t i o n m a t r i x . ConvertToHYB ( ) ; // Move a l l t h e data t o t h e a c c e l e r a t o r p a r a l u t i o n m a t r i x . MoveToAccelerator ( ) ; p a r a l u t i o n s o l . MoveToAccelerator ( ) ; p a r a l u t i o n r h s . MoveToAccelerator ( ) ; l s . MoveToAccelerator ( ) ; // S o l v e t h e l i n e a r system l s . S o l v e ( p a r a l u t i o n r h s , &p a r a l u t i o n s o l ) ; // Get t h e r e s u l t s back i n t h e Deal . I I v e c t o r e x p o r t d e a l i i v e c t o r ( p a r a l u t i o n s o l , &s o l ) ; ”Integration of a PCG solver into Deal.II” Note Compatibility problems could occur if the Deal.II library has been compiled with TBB support and the PARALUTION library is with Intel OpenCL (CPU) backend support. Compile the Deal.II code in debug mode for details. Note Examples are in src/examples/dealii. 10.4 Elmer Elmer [13] is an open source multiphysical simulation software. To use PARALUTION within Elmer, several files need to be added and modified. Please make sure that you have a working and compiled copy of Elmer available. In the following, a step-by-step guide is introduced to integrate PARALUTION into the Elmer FEM package. 1. Compile the PARALUTION library with flags -m64 -fPIC. 2. Edit elmer/fem/src/SolverUtils.src and search for IF ( ParEnv % PEs <= 1 ) THEN SELECT CASE( Method ) where all the iterative solvers are added. Add the PARALUTION solver by inserting the case CASE( ’ p a r a l u t i o n ’ ) CALL P a r a l u t i o n S o l v e ( A, x , b , S o l v e r ) 3. Copy the PARALUTION plug-in files Paralution.src and SolveParalution.cxx into elmer/fem/src. 4. Add PARALUTION to the Makefile. This can be achieved by editing elmer/fem/src/Makefile.in. Here, you need to add the PARALUTION objects Paralution.$(OBJ EXT) and SolveParalution.$(OBJ EXT) to the SOLVEROBJS list as well as the rule P a r a l u t i o n . $ (OBJEXT) : P a r a l u t i o n . f 9 0 Types . $ (OBJEXT) L i s t s . $ (OBJEXT) Furthermore, Paralution.$(OBJEXT) need to be added as a dependency to the Solver.$(OBJEXT) and SolverUtils.$(OBJEXT) rule. 5. In the file elmer/fem/src/Makefile.am the list EXTRA DIST need to be extended with SolveParalution.cxx. 6. Finally, you need to add the library dependencies of PARALUTION (-fopenmp, -lcudart, etc. dependent on the backends you compiled PARALUTION with) to the Elmer Makefile . 7. Now you should be able to recompile the fem package of Elmer with PARALUTION. 8. Finally, to use the PARALUTION solvers in the supplied Elmer test examples, you will need to change the ASCII .sif file. Simply modify the solver to Linear System Solver = Paralution. The PARALUTION solver, preconditioner, etc. can now be modified in the SolveParalution.cxx file using all available PARALUTION functions and classes. 84 CHAPTER 10. PLUG-INS 10.5 Hermes / Agros2D Hermes [4, 1] is a C++ library for rapid development of adaptive hp-FEM / hp-DG solvers. Novel hp-adaptivity algorithms help to solve a large variety of problems ranging from ODE and stationary linear PDE to complex time-dependent nonlinear multiphysics PDE systems. How to compile Hermes/Agros2D with PARALUTION (step-by-step): 1. Download Hermes and read the quick description on how to compile it with various features 2. Download PARALUTION 3. Compile, build PARALUTION, and copy headers, and libraries so that Hermes’s CMake system can find them – On Linux, installing PARALUTION to default install directories is sufficient, on Windows some paths have to be set 4. In your CMake.vars file (or directly in CMakeLists.txt) in the root of Hermes add set(WITH PARALUTION YES) – It is on by default, so by default one has to include PARALUTION to build Hermes 5. That is it, build Hermes, it will automatically link to PARALUTION, include headers and make it usable. How to use PARALUTION: 1. Check the doxygen manual of the classes (a) Hermes::Algebra::ParalutionMatrix (b) Hermes::Algebra::ParalutionVector (c) Hermes::Preconditioners::ParalutionPrecond (d) Hermes::Solvers::IterativeParalutionLinearMatrixSolver (e) Hermes::Solvers::AMGParalutionLinearMatrixSolver (f) and all classes that these inherit from 2. If you want to see Hermes and PARALUTION readily work together, take any test example in the hermes2d folder in the Hermes root and add one of these lines at the beginning of your main() (a) HermesCommonApi.set integral param value(matrixSolverType, SOLVER PARALUTION ITERATIVE); // to use iterative solver (b) HermesCommonApi.set integral param value(matrixSolverType, SOLVER PARALUTION AMG); // to use AMG solver 3. Solver classes of Hermes (NewtonSolver, PicardSolver, LinearSolver, ...) will then take this API setting into account and use PARALUTION as the matrix solver. 10.6 Octave/MATLAB MATLAB [36] and GNU Octave [2] are computing languages for numerical computations. How to compile the PARALUTION example for Octave/MATLAB: 1. The Octave/MATLAB example is located in src/plug-ins/MATLAB. 2. To compile the example for Octave, the Octave development environment is required. (Ubuntu: liboctavedev ) 3. For compilation, open a terminal and run make octave to compile the example for Octave or make matlab for MATLAB respectively. 4. Once compiled, a file called paralution pcg.mex should be listed. 10.6. OCTAVE/MATLAB 85 1 user@pc : ˜ / p a r a l u t i o n / s r c / plug−i n s /MATLAB$ make o c t a v e 2 m k o c t f i l e −−mex −c p a r a l u t i o n p c g . cpp −o o c t a v e p a r a l u t i o n p c g . o −I . . / . . / . . / b u i l d / i n c 3 m k o c t f i l e −−mex o c t a v e p a r a l u t i o n p c g . o −L . . / . . / . . / b u i l d / l i b −l p a r a l u t i o n 4 mv o c t a v e p a r a l u t i o n p c g . mex p a r a l u t i o n p c g . mex 5 user@pc : ˜ / p a r a l u t i o n / s r c / plug−i n s /MATLAB$ ”The PARALUTION Octave example compile process” How to run the PARALUTION Octave/MATLAB examples: 1. Start Octave/MATLAB 2. Start the PARALUTION example by the command pcg example. 1 o c t a v e :1> pcg example 2 pcg : c o n v e r g e d i n 159 i t e r a t i o n s . t h e i n i t i a l r e s i d u a l norm was r e d u c e d 1 . 0 6 6 6 4 e+06 t i m e s . 3 Number o f CPU c o r e s : 8 4 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y s e c o n d c o r e ( a v o i d i n g HyperThreading ) 5 Number o f GPU d e v i c e s i n t h e system : 2 6 L o c a l M a t r i x name=A; rows =10000; c o l s =10000; nnz =49600; p r e c =64 b i t ; asm=no ; format=CSR ; h o s t backend={CPU(OpenMP) } ; a c c e l e r a t o r backend={GPU(CUDA) } ; c u r r e n t=GPU(CUDA) 7 PCG s o l v e r s t a r t s , with p r e c o n d i t i o n e r : 8 Jacobi preconditioner 9 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =0; r e l t o l =1e −06; d i v t o l =1e +08; max i t e r =500 10 I t e r a t i o n C o n t r o l i n i t i a l r e s i d u a l = 100 11 I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm =9.37523 e −05; r e l v a l =9.37523 e −07; i t e r =159 12 PCG ends 13 dim = 100 14 L2 norm | x matlab − x p a r a l u t i o n | i s 15 1 . 9 7 3 9 e −09 16 o c t a v e :2> ”The PARALUTION Octave example output” Create your own solver and preconditioner setup: 1. Open the PARALUTION example file paralution pcg.cpp. 2. Change all PARALUTION relevant structures, similar to the supplied C++ examples. 3. Save paralution pcg.cpp and repeat the compilation steps. 86 CHAPTER 10. PLUG-INS Chapter 11 Remarks 11.1 Performance • Solvers – typically the PARALUTION solvers perform better than MKL/CUSPARSE based solvers due to faster preconditioners and better routines for matrix and vector operations. • Solvers – you can also build the solvers (via Build()) on the accelerator. In many cases this is faster than computing it on the host, especially for GPU backends. • Sizes – small-sized problems tend to perform better on the host (CPU) due to the good caching system, while large-sized problems typically perform better on GPU devices. • Vectors – avoid accessing via [] operators, use techniques based on SetDataPtr() and LeaveDataPtr() instead. • Host/Threads – by default the host OpenMP backend picks the maximum number of threads available. However, if your CPU supports HyperThreading, it will allow to run two times more threads than number of cores. This, in many cases, leads to lower performance. If your system supports HyperThreading, you may observe a performance increase by setting the number of threads (via the set omp threads paralution function) equal to the number of physical cores. • Solving a system with multiple right-hand-sides – if you need to solve a linear system multiple times, avoid constructing the solver/preconditioner every time. • Solving similar systems – if you are solving similar linear systems, you might want to consider to use the same preconditioner to avoid long building phases. • Matrix formats – in most of the cases, the classical CSR format performs very similar to all other formats on the CPU. On accelerators with many-cores (like GPU), formats like DIA and ELL typically perform better. However, for general sparse matrices one could use HYB format to avoid large memory overhead (e.g. in DIA or ELL formats). The multi-colored preconditioners could be performed in ELL for most of the matrices. • Matrix formats - not all matrix conversions are performed on the device, the platform will give you a warning if the object need to be moved. • Integration – if you are deploying the PARALUTION library into another software framework try to design your integration functions to avoid init paralution() and stop paralution() every time you call a solver in the library. • Compilation – be sure to compile the library with the correct optimization level (-O3 ). • Info – check if your solver is really performed on the accelerator by printing the matrix information (my matrix.info()) just before calling the ls.Solve function. • Info – check the configuration of the library for your hardware with info paralution(). • Mixed-precision defect correction – this technique is recommended for accelerators (e.g. GPUs) with partial or no double precision support. The stopping criteria for the inner solver has to be tuned well for good performance. • Plug-ins – all plug-ins perform an additional copying of the data to construct the matrix, solver, preconditioner, etc. This results in overhead when calling the PARALUTION solvers/schemes. To avoid this, adapt the plug-in to your application as much as possible. • Verbose level – it is a good practice to use the verbose level 2 to obtain critical messages regarding the performance. 87 88 CHAPTER 11. REMARKS • Xeon Phi – the allocation of memory on the Xeon Phi is slow, try to avoid unnecessary data allocation whenever is possible. 11.2 Accelerators • Old GPUs - PARALUTION requires NVIDIA GPU with minimum compute capability 2.0, if the GPU is not 2.0 the computation will fall back to the OpenMP host backend. • Avoid PCI-Express communication – whenever possible avoid extra PCI communication (like copying data from/to the accelerator), check also the internal structure of the functions. • GPU init time – if you are running your computation on a NVIDIA GPU card where no X Windows is running (for Linux OS), you can minimize the initialization time of the GPUs by nvidia-smi -pm 1 which eliminates reloading the driver every time you launch your GPU program. • Pinned memory – pinned memory allocation (page-locked) are used for all host memory allocations when using the CUDA backend. This provides faster transfers over the PCI-Express and allows asynchronous data movement. By default this option is disabled, to enable the pinned memory allocation uncomment #define PARALUTION CUDA PINNED MEMORY in file src/utils/allocate free.hpp • Async transfers – the asynchronous transfers are available only for the CUDA backend so far. If async transfers are called from other backends they will perform simple sync move or copy. • Xeon Phi – the Intel MIC architecture could be used also via the OpenCL backend. However the performance is not so great due to the fact that most of the kernels are optimize for GPU devices. • Xeon Phi – you can tune the number OpenCL parameters, after the execution of cmake and before compiling the library with make edit the OpenCL hardware parameters located in src/utils/HardwareParameters.hpp. • OpenCL (x86 CPU) – the sparse-matrix vector multiplication in COO format is hanging, we are working to fix that. • OpenCL – after calling the cmake you can set the block size and the warp size by editing src/utils/HardwareParameters.hpp. After that you just need to compile the library with make. Alternatively you can modify the src/utils/ocl check hw.cpp file. 11.3 Plug-ins • Deal.II – to avoid the initialization time of the accelerator (if used), put the init paralution() and stop paralution() function in the main application file and include the library in the header. 11.4 Correctness • Matrix – if you are assembling or modifying your matrix, you can check your matrix in octave/MATLAB by just writing it into a matrix-market file and read it via mmread() function [39]. You can also input a MATLAB/octave matrix in such way. • Solver tolerance – be sure you are setting the correct relative and absolute tolerance values for your problem. • Solver stopping criteria – check the computation of the relative stopping criteria if it is based on or |b−Axk |2 |b−Ax0 |2 |b−Axk |2 . |b|2 • Ill-conditioned problems – solving very ill-conditioned problems by iterative methods without a proper preconditioning technique might produce wrong results. The solver could stop by showing a low relative tolerance based on the residual but this might be wrong. • Ill-conditioned problems – building the Krylov subspace for many ill-conditioned problems could be a tricky task. To ensure orthogonality in the subspace you might want to perform double orthogonalization (i.e. re-orthogonalization) to avoid rounding errors. • I/O files – if you read/write matrices/vectors from files, check the ASCII format of the values (e.g. 34.3434 or 3.43434E + 01). 11.5 Compilation • OpenMP backend – the OpenMP support is by default enabled. To disable it you need to specify DSUPPORT OMP=OFF in the cmake 11.6. PORTABILITY 89 • CUDA 5.5 and gcc 4.8 – these compiler versions are incompatible (the compilation will report error ”kernel launches from templates are not allowed in system files”). Please use lower gcc version, you can push the nvcc compiler to use it by setting a link in the default CUDA installation directory (/usr/local/cuda) - e.g. by running under root ln -s /usr/bin/gcc-4.4 /usr/local/cuda/bin/gcc. Or try the -ccbin option. • CUDA backend – be sure that the paths are correctly set (e.g. for Linux export LD LIBRARY PATH= $LD LIBRARY PATH:/usr/local/cuda/lib64/ and export CPLUS INCLUDE PATH=$CPLUS INCLUDE PATH: /usr/local/cuda/include/ ). Then you can run cmake with make -DSUPPORT OCL=OFF -DSUPPORT CUDA=ON .. • OpenCL backend – similar for CUDA backend, you need to set the correct paths for the OpenCL library and then you can run cmake with Then you can run cmake with make -DSUPPORT OCL=ON -DSUPPORT CUDA=OFF .. • OpenCL backend – when compiling the library with OpenCL (with cmake or with Makefile), during the compilation process you will be asked to select an OpenCL platform and device (if you have more than one). By doing that, the library will select the optimal number of threads and blocks for your hardware. Later on, you can change the platform and device, via the set ocl paralution() or select device paralution() function, but the threads configuration will be not changed. • MIC backend – the Intel Compiler should be loaded (icc), then run the camke with cmake -DSUPPORT MIC=ON -DSUPPORT CUDA=OFF -DSUPPORT OCL=OFF .. 11.6 Portability • Different backends – you do not have to recompile your code if you want to run your program with different backends. You just need to load the corresponding dynamic library. As an example, if you compile the library with OpenCL support in /usr/local/paralution-ocl/build/lib and with CUDA support in /usr/local/paralution-cuda/build/lib, you will just need to set the path (i.e. export LD LIBRARY PATH= $LD LIBRARY PATH:/usr/local/paralution-ocl/build/lib or export LD LIBRARY PATH= $LD LIBRARY PATH: /usr/local/paralution-cuda/build/lib) and just run your program. 90 CHAPTER 11. REMARKS Chapter 12 Performance Benchmarks 12.1 Single-Node Benchmarks 12.1.1 Hardware Description Hardware 2x Xeon E5-2680 2x Xeon E5620 Core i7 4770K Core i7 620M MIC/Xeon Phi 5110 K20X K20c GTX Titan GTX680 GTX590 GTX560Ti FirePro (FP) W8000 Cores/SP 2x 8 2x 4 4 2 60 2688 2496 2688 1536 512 384 1792 Memory [GB] 64 96 32 8 8 6 5 6 2 1.5 2 2 Theoretical Bandwidth [GB/s] 2x 51.2 2x 25.6 25.6 17.1 320 250 208 288 192 163 128 176 Backend OpenMP(Host) OpenMP(Host) OpenMP(Host) OpenMP(Host) OpenMP(MIC) CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL OpenCL Notes NUMA NUMA UMA UMA ECC ECC ECC no ECC no ECC no ECC no ECC ECC Table 12.1: Hardware configurations The performance values are obtained with the ”benchmark” tool from the PARALUTION package. The tool is compiled on various systems, no code modification is applied. The operating system (OS) for all hardware configurations is Linux. All tests are performed in double precision. The hardware specifications are obtained from Intel 1 2 3 4 5 , AMD 6 and NVIDIA 7 8 9 10 11 websites. The configuration is: • PARALUTION - version 0.4.0 • Host/OpenMP – the number of threads is equal to the number of real cores (no HT), gcc versions 4.4.6, 4.4.7 and 4.6.3 • CUDA – CUDA version 5.0 • OpenCL – NVIDIA OpenCL version 1.1 (comes with CUDA 5.0); AMD OpenCL version 1.2 (comes with AMD SDK 2.8.1.0) • MIC/OpenMP – for the Xeon Phi, the number of threads for the OpenMP section is not set (the default configuration is used), icc version 13.1 1 Intel E5620: http://ark.intel.com/de/products/47925/Intel-Xeon-Processor-E5620-12M-Cache-2 40-GHz-5 86-GTs-Intel-QPI E5-2680: http://ark.intel.com/de/products/64583 3 Intel Core i7 4770K: http://ark.intel.com/products/75123/ 4 Intel Core i7 620M (Notebook): http://ark.intel.com/products/43560 5 Intel Xeon Phi 5110P: http://ark.intel.com/de/products/71992/Intel-Xeon-Phi-Coprocessor-5110P-8GB-1 053-GHz-60-core 6 AMD FirePro W8000: http://www.amd.com/us/products/workstation/graphics/ati-firepro-3d/w8000/Pages/w8000.aspx#3 7 NVIDIA GTX560 Ti: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-560ti/specifications 8 NVIDIA GTX590: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-590/specifications 9 NVIDIA GTX680: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications 10 NVIDIA GTX TITAN: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan/specifications 11 NVIDIA K20 and K20X: http://www.nvidia.com/object/tesla-servers.html 2 Intel 91 92 CHAPTER 12. PERFORMANCE BENCHMARKS 12.1.2 BLAS1 and SpMV The vector-vector routines (BLAS1) are performed with size 4,000,000 – Figure 12.1. • ScaleAdd is x = αx + y, where x, y ∈ Rn and α ∈ R Pn • Dot product is α = i=0 xi yi , where x, y ∈ Rn and α ∈ R Pn • Reduce is α = i=0 xi , where x ∈ Rn and α ∈ R pPn n 2 • L2 Norm is α = i=0 xi , where x ∈ R and α ∈ R The backends for all vector-vector routines are CUDA for NVIDIA GPU; OpenCL for AMD GPU; OpenMP for Host/MIC. Performance of vector routines 2x Xeon E5620 2x Xeon E5-2680 GTX560Ti GTX590 GTX680 GTX Titan K20c K20X MIC 5110 FP W8000 200 GB/s 150 100 50 0 ScaleAdd Dot Reduce L2 Norm Figure 12.1: BLAS1 routines The SpMV (sparse matrix-vector multiplication) is computed using a 2D Laplace (structured grid, finite difference) matrix on a grid with 2,000 × 2,000 = 4,000,000 DoFs – Figure 12.2. All routines are executed 100 times and the average time (in ms resolution) is taken. 12.1.3 CG Solver Furthermore, a non-preconditioned CG has been performed on a Laplace matrix resulting from a Finite Difference discretization of the unit square with 4.000.000 unknowns (as for the SpMV tests) on all listed architectures – Figure 12.3 and Figure 12.4. The right-hand side is set to one, initial guess to zero. A relative tolerance of 1e-6 based on L2 norm is used. For the speed-up comparison of the CG method, we compare the best configurations for all setups – Figure 12.5. 12.1. SINGLE-NODE BENCHMARKS 93 Performance of SpMV (2D Laplace matrix) 200 GB/s 150 2x Xeon E5620 2x Xeon E5-2680 GTX560Ti GTX590 GTX680 GTX Titan K20c K20X MIC 5110 FP W8000 100 50 0 CSR ELL Figure 12.2: SpMV CG - 2D Laplace 300 OpenMP (CSR) OpenMP (ELL) time [sec] 250 200 150 100 50 0 i7 620M 2x Xeon E5620 i7 4770K2x Xeon E5-2680 MIC 5110 Figure 12.3: CG CPU Performance 94 CHAPTER 12. PERFORMANCE BENCHMARKS CG - 2D Laplace 45 CUDA (CSR) OpenCL (CSR) CUDA (ELL) OpenCL (ELL) 40 35 time [sec] 30 25 20 15 10 5 0 GTX560Ti GTX590 GTX680 GTX Titan K20c K20X FP W8000 Figure 12.4: CG GPU Performance CG - 2D Laplace 16 Speedup vs 2x Xeon E5620 Speedup vs 2x Xeon E5-2680 14 12 speed-up 10 8 6 4 2 0 GTX560Ti GTX590 GTX680 GTX Titan K20c Figure 12.5: CG Speed-up K20X MIC 5110 FP W8000 12.2. MULTI-NODE BENCHMARKS 12.2 95 Multi-Node Benchmarks For the multi-node benchmark, we use the following setup • One MPI rank per GPU (1:1 mapping) • Solver: CG with FSAI preconditioner • Solver: CG with Pair-wise (global) AMG Hardware/Compiler/Software Configuration • PARALUTION 1.0.0 M • CPU 2×8-core Xeon per node • GPU 2×K40 per node • Intel C++ compiler 15.0.0 • Intel MPI compiler 4.1.0.024 • CUDA 6.5 Benchmark Problems • Pressure solver only • 3D Cavity 27M unknowns • PitzDaily 16M unknowns (LES) In the following figures we show strong scalability (i.e. the size of the problem is fixed, the number of nodes/GPUs varies). The examples represent the solution of the pressure problem of standard benchmarks, where the data (matrix, right-hand-side, initial guess, normalization factor) is extracted from OpenFOAM [44]. The timings represent the solving phase of the iterative solver. Figure 12.6: 3D Cavity – CG and FSAI, GPU Scalability 96 CHAPTER 12. PERFORMANCE BENCHMARKS Figure 12.7: 3D Cavity – Pairwise AMG, CPU and GPU Performances Figure 12.8: 3D Cavity – Pairwise AMG and CG-FSAI Performances 12.2. MULTI-NODE BENCHMARKS Figure 12.9: Pitzdaily – CG and FSAI, GPU Scalability Figure 12.10: Pitzdaily – Pairwise AMG, CPU and GPU Performances 97 98 CHAPTER 12. PERFORMANCE BENCHMARKS Figure 12.11: Pitzdaily – Pairwise AMG and CG-FSAI Performances Chapter 13 Graph Analyzers 0 0 100 50 200 100 150 300 200 400 250 500 300 600 350 700 400 800 450 900 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 300 350 400 450 300 350 400 450 Figure 13.1: gr3030 and nos5 matrices, see [3] and [6] 0 0 100 50 200 100 150 300 200 400 250 500 300 600 350 700 400 800 450 900 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 Figure 13.2: CMK permutation of gr3030 and nos5 99 250 100 CHAPTER 13. GRAPH ANALYZERS 0 0 100 50 200 100 150 300 200 400 250 500 300 600 350 700 400 800 450 900 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 300 350 400 450 250 300 350 400 450 300 350 400 450 Figure 13.3: RCMK permutation of gr3030 and nos5 0 0 100 50 200 100 150 300 200 400 250 500 300 600 350 700 400 800 450 900 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 Figure 13.4: Multi-coloring permutation of gr3030 and nos5 0 0 100 50 200 100 150 300 200 400 250 500 300 600 350 700 400 800 450 900 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 Figure 13.5: MIS permutation of gr3030 and nos5 250 101 0 0 100 50 200 100 150 300 200 400 250 500 300 600 350 700 400 800 450 900 0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 300 350 400 Figure 13.6: Connectivity ordering of gr3030 and nos5 0 0 20 50 40 100 60 150 80 200 100 250 120 300 0 20 40 60 80 100 120 0 50 100 150 200 250 300 200 250 300 Figure 13.7: impcolc and tols340 matrices, see [5] and [7] 0 0 20 50 40 100 60 150 80 200 100 250 120 300 0 20 40 60 80 100 120 0 50 100 150 Figure 13.8: Zero-block permutation of impcolc and tols340 450 102 CHAPTER 13. GRAPH ANALYZERS Chapter 14 Functionality Table 14.1 Backend Support Backend name Status Target Parallelism Required Lib Optional Lib Compiler OS 14.2 Host Stable Intel/AMD CPU OpenMP None Intel MKL icc, gcc, VS Linux,Mac, Windows CUDA Stable NVIDIA GPU CUDA CUBLAS, CUSPARSE1 None nvcc Linux,Mac, Windows OpenCL Stable NVIDIA/AMD GPU OpenCL None None NVIDIA/AMD ocl Linux,Mac, Windows MIC/Xeon Phi Beta MIC/Xeon Phi OpenMP (offload mode) None None icc Linux Matrix-Vector Multiplication Formats CSR COO ELL DIA HYB DENSE BCSR Host Yes Yes2 Yes Yes Yes Yes No CUDA Yes Yes Yes Yes Yes Yes No OpenCL Yes Yes Yes Yes Yes Yes3 No MIC/Xeon Phi Yes Yes2 Yes Yes Yes No No All matrix conversions are performed via the CSR format (either as a source or a destitution – e.g. the user cannot directly convert a DIA matrix to an ELL matrix). The conversions can be computed on the host or on the CUDA backend (except back conversions (to a CSR matrix), DENSE, BCSR and MCSR conversions). All other backends can perform the conversions via the host. B S M Host Yes Yes Yes CUDA No Yes Yes OpenCL No Yes Yes MIC/Xeon Phi No No No Table 14.1: Complex support The base version supports only complex computation on the host. 1 CUBLAS and CUSPARSE are delivered with the CUDA package version 3 Basic version 2 Serial 103 104 CHAPTER 14. FUNCTIONALITY TABLE 14.3 Local Matrices and Vectors All matrix operations (except the sparse matrix-vector multiplication SpMV) require a CSR matrix. Note that if the input matrix is not a CSR matrix, an internal conversion will be performed to the CSR format followed by a back conversion to the current matrix format after the operation. In this case, a warning message on level 2 will be printed. Local Matrices I/O file Direct memory access Extract sub matrix Extract diagonal Extract inv diagonal Assemble Permutation Graph Analyzer Transpose Sparse Mat-Mat Mult Local Vectors I/O file Direct memory access Vector updates Dot product Dot product (conj/complex) Sub copy Point-wise mult Scaling Permutation 14.4 Host CUDA OpenCL MIC/Xeon Phi Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes via Host Yes via Host Yes Yes via Host Yes Yes Yes Yes via Host via Host via Host via Host via Host via Host Yes via Host via Host via Host via Host via Host via Host via Host via Host Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes No Yes Yes Yes Yes Global Matrices and Vectors All Local functions (for Vectors or Matrices) can be applied on the interior of each Global Matrix. Global Matrices I/O file Direct memory access Global Vectors I/O file Direct memory access Vector updates Dot product Dot product (conj/complex) Point-wise mult Scaling Host CUDA OpenCL MIC/Xeon Phi Yes Yes via Host Yes via Host Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes No Yes Yes Note The OpenCL backend supports only GPUs in the multi-node version. 14.5. LOCAL SOLVERS AND PRECONDITIONERS 14.5 105 Local Solvers and Preconditioners Local Solvers CG – Building CG – Solving FCG – Building FCG – Solving CR – Building CR – Solving BiCGStab – Building BiCGStab – Solving BiCGStab(l) – Building BiCGStab(l) – Solving QMRCGStab – Building QMRCGStab – Solving GMRES – Building GMRES – Solving FGMRES – Building FGMRES – Solving Chebyshev – Building Chebyshev – Solving DPCG – Building DPCG – Solving Mixed-Precision DC – Building Mixed-Precision DC – Solving Fixed-Point Iteration – Building Fixed-Point Iteration – Solving AMG(Plain Aggregation) – Building AMG(Plain Aggregation) – Solving AMG(Smoothed Aggregation) – Building AMG(Smoothed Aggregation) – Solving AMG(RS) – Building AMG(RS) – Solving AMG(Pair-wise) – Building AMG(Pair-wise) – Solving Local Direct Solvers (DENSE only) LU – Building LU – Solving QR – Building QR – Solving Inversion – Building Inversion – Solving Host CUDA OpenCL MIC/Xeon Phi Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes via Host Yes via Host Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host via Host Yes Yes Yes Yes via Host Yes via Host Yes via Host Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host via Host Yes Yes Yes Yes via Host Yes via Host Yes via Host Yes via Host Yes Yes Yes Yes Yes Yes Yes via Host via Host via Host via Host via Host Yes via via via via via via via via via via via via Host Host Host Host Host Host Host Host Host Host Host Host Note The building phase of the iterative solver depends also on the selected preconditioner. 106 Local Preconditioners Jacobi – Building Jacobi – Solving MC-ILU(0,1) – Building MC-ILU(0,1) – Solving MC-ILU(> 0,> 1) – Building MC-ILU(> 0,> 1) – Solving ME-(I)LU – Building ME-(I)LU – Solving ILU(0) – Building ILU(0) – Solving ILU(> 0) – Building ILU(> 0) – Solving ILUT – Building ILUT – Solving IC(0) – Building IC(0) – Solving FSAI – Building FSAI – Solving SPAI – Building SPAI – Solving Chebyshev – Building Chebyshev – Solving MC-GS/SGS – Building MC-GS/SGS – Solving GS/SGS – Building GS/SGS – Solving AS/RAS – Building AS/RAS – Solving Block – Building Block – Solving SaddlePoint – Building SaddlePoint – Solving CHAPTER 14. FUNCTIONALITY TABLE Host CUDA OpenCL MIC/Xeon Phi Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes via Host Yes via Host Yes Yes Yes via Host Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes via Host Yes via Host Yes via Host via Host via Host via Host via Host via Host via Host via Host via Host Yes via Host Yes via Host Yes via Host Yes via Host via Host Yes Yes Yes Yes via Host Yes Yes Yes via Host Yes via Host Yes via Host Yes via Host via Host via Host via Host via Host via Host via Host via Host via Host Yes via Host Yes via Host Yes via Host Yes via Host via Host via Host Yes via Host Yes via Host Yes 14.6. GLOBAL SOLVERS AND PRECONDITIONERS 14.6 107 Global Solvers and Preconditioners Global Solvers CG – Building CG – Solving FCG – Building FCG – Solving CR – Building CR – Solving BiCGStab – Building BiCGStab – Solving BiCGStab(l) – Building BiCGStab(l) – Solving QMRCGStab – Building QMRCGStab – Solving GMRES – Building GMRES – Solving FGMRES – Building FGMRES – Solving Mixed-Precision DC – Building Mixed-Precision DC – Solving Fixed-Point Iteration – Building Fixed-Point Iteration – Solving AMG(Pair-wise) – Building AMG(Pair-wise) – Solving Host CUDA OpenCL MIC/Xeon Phi Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes via Host Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No Note The building phase of the iterative solver depends also on the selected preconditioner. All local preconditioner can be used on a Global level by Block-Jacobi type of preconditioner, where the local preconditioner will be applied on each node/process interior matrix. 108 CHAPTER 14. FUNCTIONALITY TABLE Chapter 15 Code Examples 15.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Preconditioned CG #i n c l u d e #i n c l u d e #i n c l u d e

u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { i f ( a r g c == 1 ) { s t d : : c e r r << argv [ 0 ] << ” [Num t h r e a d s ] ” << s t d : : e n d l ; exit (1) ; } init paralution () ; i f ( argc > 2) { s e t o m p t h r e a d s p a r a l u t i o n ( a t o i ( argv [ 2 ] ) ) ; } info paralution () ; L o c a l V e c t o r b , b o l d , ∗ b k , ∗ b k1 , ∗ b tmp ; LocalMatrix mat ; mat . ReadFileMTX ( s t d : : s t r i n g ( argv [ 1 ] ) ) ; // G e r s h g o r i n spectrum a p p r o x i m a t i o n d o u b l e glambda min , glambda max ; // Power method spectrum a p p r o x i m a t i o n d o u b l e plambda min , plambda max ; // Maximum number o f i t e r a t i o n f o r t h e power method int iter max = 10000; double tick , tack ; // G e r s h g o r i n a p p r o x i m a t i o n o f t h e e i g e n v a l u e s mat . G e r s h g o r i n ( glambda min , glambda max ) ; s t d : : c o u t << ” G e r s h g o r i n : Lambda min = ” << glambda min << ” ; Lambda max = ” << glambda max << s t d : : e n d l ; mat . MoveToAccelerator ( ) ; b . MoveToAccelerator ( ) ; b o l d . MoveToAccelerator ( ) ; b . A l l o c a t e ( ” b k+1” , mat . g e t n r o w ( ) ) ; b k1 = &b ; b o l d . A l l o c a t e ( ” b k ” , mat . g e t n r o w ( ) ) ; b k = &b o l d ; b k−>Ones ( ) ; 114 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 CHAPTER 15. CODE EXAMPLES mat . i n f o ( ) ; tick = paralution time () ; // compute lambda max f o r ( i n t i =0; i<=i t e r m a x ; ++i ) { mat . Apply ( ∗ b k , b k1 ) ; // s t d : : c o u t << b k1−>Dot ( ∗ b k ) << s t d : : e n d l ; b k1−>S c a l e ( d o u b l e ( 1 . 0 ) / b k1−>Norm ( ) ) ; b tmp = b k1 ; b k1 = b k ; b k = b tmp ; } // g e t lambda max ( R a y l e i g h q u o t i e n t ) mat . Apply ( ∗ b k , b k1 ) ; plambda max = b k1−>Dot ( ∗ b k ) ; tack = paralution time ( ) ; s t d : : c o u t << ”Power method ( lambda max) e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; mat . Ad d Sc a l ar D ia g on a l ( d o u b l e ( −1.0) ∗ plambda max ) ; b k−>Ones ( ) ; tick = paralution time () ; // compute lambda min f o r ( i n t i =0; i<=i t e r m a x ; ++i ) { mat . Apply ( ∗ b k , b k1 ) ; // s t d : : c o u t << b k1−>Dot ( ∗ b k ) + plambda max << s t d : : e n d l ; b k1−>S c a l e ( d o u b l e ( 1 . 0 ) / b k1−>Norm ( ) ) ; b tmp = b k1 ; b k1 = b k ; b k = b tmp ; } // g e t lambda min ( R a y l e i g h q u o t i e n t ) mat . Apply ( ∗ b k , b k1 ) ; plambda min = ( b k1−>Dot ( ∗ b k ) + plambda max ) ; // back t o t h e o r i g i n a l matrix mat . Ad d Sc a l ar D ia g on a l ( plambda max ) ; tack = paralution time ( ) ; s t d : : c o u t << ”Power method ( lambda min ) e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; s t d : : c o u t << ”Power method Lambda min = ” << plambda min << ” ; Lambda max = ” << plambda max 15.3. GERSHGORIN CIRCLES+POWER METHOD+CHEBYSHEV ITERATION+PCG WITH CHEBYSHEV POLYNOM 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 } << ” ; i t e r =2x” << i t e r m a x << s t d : : e n d l ; L o c a l V e c t o r x ; L o c a l V e c t o r r h s ; x . CloneBackend ( mat ) ; r h s . CloneBackend ( mat ) ; x . A l l o c a t e ( ”x” , mat . g e t n r o w ( ) ) ; r h s . A l l o c a t e ( ” r h s ” , mat . g e t n r o w ( ) ) ; // Chebyshev i t e r a t i o n Chebyshev, L o c a l V e c t o r , d o u b l e > l s ; r h s . Ones ( ) ; x . Zeros () ; l s . S e t O p e r a t o r ( mat ) ; l s . S e t ( plambda min , plambda max ) ; l s . Build ( ) ; tick = paralution time () ; l s . S o l v e ( rhs , &x ) ; tack = paralution time ( ) ; s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; // PCG + Chebyshev p o l y n o m i a l CG, L o c a l V e c t o r , d o u b l e > cg ; AIChebyshev, L o c a l V e c t o r , d o u b l e > p ; // damping f a c t o r plambda min = plambda max / 7 ; p . S e t ( 3 , plambda min , plambda max ) ; r h s . Ones ( ) ; x . Zeros () ; cg . S e t O p e r a t o r ( mat ) ; cg . S e t P r e c o n d i t i o n e r ( p ) ; cg . B u i l d ( ) ; tick = paralution time () ; cg . S o l v e ( rhs , &x ) ; tack = paralution time ( ) ; s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; stop paralution () ; return 0; ”Gershgorin circles + Power method + Chebyshev iteration + PCG with Chebyshev polynomial” This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 2 116 CHAPTER 15. CODE EXAMPLES Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : None OpenMP t h r e a d s : 2 ReadFileMTX : f i l e n a m e =/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; r e a d i n g . . . ReadFileMTX : f i l e n a m e =/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; done G e r s h g o r i n : Lambda min = 0 ; Lambda max = 16 L o c a l M a t r i x name=/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; rows =900; c o l s =900; nnz =7744; p r e c =64 b i t ; asm=no ; format=CSR ; h o s t backend={CPU(OpenMP) } ; a c c e l e r a t o r backend={None } ; c u r r e n t=CPU(OpenMP) Power method ( lambda max) e x e c u t i o n : 0 . 3 3 8 0 0 8 s e c Power method ( lambda min ) e x e c u t i o n : 0 . 3 1 0 9 4 5 s e c Power method Lambda min = 0 . 0 6 1 4 6 2 8 ; Lambda max = 1 1 . 9 5 9 1 ; i t e r =2x10000 Chebyshev ( non−precond ) l i n e a r s o l v e r s t a r t s I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −15; r e l t o l =1e −06; d i v t o l =1e +08; max i t e r =1000000 I t e r a t i o n C o n t r o l i n i t i a l r e s i d u a l = 30 I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm =2.98708 e −05; r e l v a l =9.95692 e −07; i t e r =919 Chebyshev ( non−precond ) ends Solver execution :0.028368 sec PCG s o l v e r s t a r t s , with p r e c o n d i t i o n e r : Approximate I n v e r s e Chebyshev ( 3 ) p r e c o n d i t i o n e r AI matrix nnz = 62500 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −15; r e l t o l =1e −06; d i v t o l =1e +08; max i t e r =1000000 I t e r a t i o n C o n t r o l i n i t i a l r e s i d u a l = 30 I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm =2.55965 e −05; r e l v a l =8.53216 e −07; i t e r =17 PCG ends Solver execution :0.002306 sec ”Output of the program with matrix gr 30 30.mtx” 15.4. MIXED-PRECISION PCG SOLVER 15.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 Mixed-precision PCG Solver #i n c l u d e #i n c l u d e #i n c l u d e

u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { i f ( a r g c == 1 ) { s t d : : c e r r << argv [ 0 ] << ” [Num t h r e a d s ] ” << s t d : : e n d l ; exit (1) ; } init paralution () ; i f ( argc > 2) { s e t o m p t h r e a d s p a r a l u t i o n ( a t o i ( argv [ 2 ] ) ) ; } info paralution () ; L o c a l V e c t o r x ; L o c a l V e c t o r r h s ; LocalMatrix mat ; // r e a d from f i l e mat . ReadFileMTX ( s t d : : s t r i n g ( argv [ 1 ] ) ) ; x . A l l o c a t e ( ”x” , mat . g e t n r o w ( ) ) ; r h s . A l l o c a t e ( ” r h s ” , mat . g e t n r o w ( ) ) ; MixedPrecisionDC, L o c a l V e c t o r , double , LocalMatrix , L o c a l V e c t o r , f l o a t > mp; CG, L o c a l V e c t o r , f l o a t > cg ; MultiColoredILU, L o c a l V e c t o r , f l o a t > p ; double tick , tack ; r h s . Ones ( ) ; x . Zeros () ; // s e t u p a l o w e r t o l f o r t h e i n n e r s o l v e r cg . S e t P r e c o n d i t i o n e r ( p ) ; cg . I n i t ( 1 e −5, 1 e −2, 1 e +20 , 100000) ; // s e t u p t h e mixed−p r e c i s i o n DC mp. S e t O p e r a t o r ( mat ) ; mp. S e t ( cg ) ; mp. B u i l d ( ) ; tick = paralution time () ; mp. S o l v e ( rhs , &x ) ; 117 118 60 61 62 63 64 65 66 67 } CHAPTER 15. CODE EXAMPLES tack = paralution time ( ) ; s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; stop paralution () ; return 0; ”Mixed-precision PCG solver” This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 2 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : None OpenMP t h r e a d s : 2 ReadFileMTX : f i l e n a m e =/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; r e a d i n g . . . ReadFileMTX : f i l e n a m e =/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; done MixedPrecisionDC [ 6 4 b i t −32 b i t ] s o l v e r s t a r t s , with s o l v e r : PCG s o l v e r , with p r e c o n d i t i o n e r : M u l t i c o l o r e d ILU p r e c o n d i t i o n e r ( power ( q )−p a t t e r n method ) , ILU ( 0 , 1 ) number o f c o l o r s = 4 ; ILU nnz = 7744 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −15; r e l t o l =1e −06; d i v t o l =1e +08; max i t e r =1000000 I t e r a t i o n C o n t r o l i n i t i a l r e s i d u a l = 30 MixedPrecisionDC : s t a r t i n g t h e i n t e r n a l s o l v e r [ 3 2 b i t ] PCG s o l v e r s t a r t s , with p r e c o n d i t i o n e r : M u l t i c o l o r e d ILU p r e c o n d i t i o n e r ( power ( q )−p a t t e r n method ) , ILU ( 0 , 1 ) number o f c o l o r s = 4 ; ILU nnz = 7744 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −05; r e l t o l = 0 . 0 1 ; d i v t o l =1e +20; max i t e r =100000 I t e r a t i o n C o n t r o l i n i t i a l r e s i d u a l = 30 I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm = 0 . 2 2 7 8 2 5 ; r e l v a l = 0 . 0 0 7 5 9 4 1 7 ; i t e r =11 PCG ends MixedPrecisionDC : d e f e c t c o r r e c t i n g on t h e h o s t [ 6 4 b i t ] MixedPrecisionDC : s t a r t i n g t h e i n t e r n a l s o l v e r [ 3 2 b i t ] PCG s o l v e r s t a r t s , with p r e c o n d i t i o n e r : M u l t i c o l o r e d ILU p r e c o n d i t i o n e r ( power ( q )−p a t t e r n method ) , ILU ( 0 , 1 ) number o f c o l o r s = 4 ; ILU nnz = 7744 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −05; r e l t o l = 0 . 0 1 ; d i v t o l =1e +20; max i t e r =100000 IterationControl i n i t i a l r e s i d u a l = 0.227811 I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm = 0 . 0 0 1 5 4 3 7 5 ; r e l v a l = 0 . 0 0 6 7 7 6 4 6 ; i t e r =8 PCG ends MixedPrecisionDC : d e f e c t c o r r e c t i n g on t h e h o s t [ 6 4 b i t ] MixedPrecisionDC : s t a r t i n g t h e i n t e r n a l s o l v e r [ 3 2 b i t ] PCG s o l v e r s t a r t s , with p r e c o n d i t i o n e r : M u l t i c o l o r e d ILU p r e c o n d i t i o n e r ( power ( q )−p a t t e r n method ) , ILU ( 0 , 1 ) number o f c o l o r s = 4 ; ILU nnz = 7744 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −05; r e l t o l = 0 . 0 1 ; d i v t o l =1e +20; max i t e r =100000 It er at io nCo nt ro l i n i t i a l r e s i d u a l = 0.00154375 I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm =1.46245 e −05; r e l v a l = 0 . 0 0 9 4 7 3 3 ; i t e r =14 PCG ends MixedPrecisionDC : d e f e c t c o r r e c t i n g on t h e h o s t [ 6 4 b i t ] I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm =1.46244 e −05; r e l 15.4. MIXED-PRECISION PCG SOLVER v a l =4.87482 e −07; i t e r =4 MixedPrecisionDC ends Solver execution :0.002661 sec ”Output of the program with matrix gr 30 30.mtx” 119 120 15.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 CHAPTER 15. CODE EXAMPLES PCG Solver with AMG #i n c l u d e #i n c l u d e #i n c l u d e

u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { d o u b l e t i c k , tack , s t a r t , end ; start = paralution time () ; i f ( a r g c == 1 ) { s t d : : c e r r << argv [ 0 ] << ” [Num t h r e a d s ] ” << s t d : : e n d l ; exit (1) ; } init paralution () ; i f ( argc > 2) s e t o m p t h r e a d s p a r a l u t i o n ( a t o i ( argv [ 2 ] ) ) ; info paralution () ; L o c a l V e c t o r x ; L o c a l V e c t o r r h s ; LocalMatrix mat ; mat . ReadFileMTX ( s t d : : s t r i n g ( argv [ 1 ] ) ) ; x . A l l o c a t e ( ”x” , mat . g e t n r o w ( ) ) ; r h s . A l l o c a t e ( ” r h s ” , mat . g e t n r o w ( ) ) ; r h s . Ones ( ) ; x . Zeros () ; tick = paralution time () ; CG, L o c a l V e c t o r , d o u b l e > l s ; l s . Verbose ( 0 ) ; // AMG P r e c o n d i t i o n e r AMG, L o c a l V e c t o r , d o u b l e > p ; p . InitMaxIter (1) ; p . Verbose ( 0 ) ; l s . SetPreconditioner (p) ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; tack = paralution time ( ) ; s t d : : c o u t << ” B u i l d i n g time : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; // move a f t e r b u i l d i n g s i n c e AMG b u i l d i n g i s not s u p p o r t e d on GPU y e t mat . MoveToAccelerator ( ) ; x . MoveToAccelerator ( ) ; 15.5. PCG SOLVER WITH AMG 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 } 121 r h s . MoveToAccelerator ( ) ; l s . MoveToAccelerator ( ) ; mat . i n f o ( ) ; tick = paralution time () ; l s . I n i t ( 1 e −10 , 1 e −8, 1 e +8, 1 0 0 0 0 ) ; l s . Verbose ( 2 ) ; l s . S o l v e ( rhs , &x ) ; tack = paralution time ( ) ; s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; l s . Clear () ; stop paralution () ; end = p a r a l u t i o n t i m e ( ) ; s t d : : c o u t << ” T o t a l runtime : ” << ( end−s t a r t ) /1000000 << ” s e c ” << s t d : : e n d l ; return 0; ”PCG with AMG” This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 2 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : None OpenMP t h r e a d s : 2 ReadFileMTX : f i l e n a m e =/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; r e a d i n g . . . ReadFileMTX : f i l e n a m e =/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; done B u i l d i n g time : 0 . 0 0 1 9 0 1 s e c L o c a l M a t r i x name=/home/ d i m i t a r / m a t r i c e s / s m a l l /sym/ g r 3 0 3 0 . mtx ; rows =900; c o l s =900; nnz =7744; p r e c =64 b i t ; asm=no ; format=CSR ; h o s t backend={CPU(OpenMP) } ; a c c e l e r a t o r backend={None } ; c u r r e n t=CPU(OpenMP) PCG s o l v e r s t a r t s , with p r e c o n d i t i o n e r : AMG s o l v e r AMG number o f l e v e l s 2 AMG u s i n g smoothed a g g r e g a t i o n i n t e r p o l a t i o n AMG c o a r s e s t o p e r a t o r s i z e = 100 AMG c o a r s e s t l e v e l nnz = 816 AMG with smoother : Fixed Po in t I t e r a t i o n s o l v e r , with p r e c o n d i t i o n e r : M u l t i c o l o r e d Gauss−S e i d e l (GS) p r e c o n d i t i o n e r number o f c o l o r s = 4 I t e r a t i o n C o n t r o l c r i t e r i a : abs t o l =1e −10; r e l t o l =1e −08; d i v t o l =1e +08; max i t e r =10000 I t e r a t i o n C o n t r o l i n i t i a l r e s i d u a l = 30 I t e r a t i o n C o n t r o l i t e r =1; r e s i d u a l =13.3089 I t e r a t i o n C o n t r o l i t e r =2; r e s i d u a l =1.50267 I t e r a t i o n C o n t r o l i t e r =3; r e s i d u a l =0.0749297 I t e r a t i o n C o n t r o l i t e r =4; r e s i d u a l =0.00853667 I t e r a t i o n C o n t r o l i t e r =5; r e s i d u a l =0.000410892 I t e r a t i o n C o n t r o l i t e r =6; r e s i d u a l =2.70269 e −05 I t e r a t i o n C o n t r o l i t e r =7; r e s i d u a l =1.81103 e −06 I t e r a t i o n C o n t r o l i t e r =8; r e s i d u a l =2.39013 e −07 122 CHAPTER 15. CODE EXAMPLES I t e r a t i o n C o n t r o l RELATIVE c r i t e r i a has been r e a c h e d : r e s norm =2.39013 e −07; r e l v a l =7.96709 e −09; i t e r =8 PCG ends Solver execution :0.010904 sec T o t a l runtime : 0 . 0 2 4 3 0 3 s e c ”Output of the program with matrix gr 30 30.mtx” 15.6. AMG SOLVER WITH EXTERNAL SMOOTHERS 15.6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 AMG Solver with External Smoothers #i n c l u d e #i n c l u d e #i n c l u d e

u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { d o u b l e t i c k , tack , s t a r t , end ; i f ( a r g c == 1 ) { s t d : : c e r r << argv [ 0 ] << ” [Num t h r e a d s ] ” << s t d : : e n d l ; exit (1) ; } init paralution () ; i f ( argc > 2) s e t o m p t h r e a d s p a r a l u t i o n ( a t o i ( argv [ 2 ] ) ) ; info paralution () ; L o c a l V e c t o r x ; L o c a l V e c t o r r h s ; LocalMatrix mat ; mat . ReadFileMTX ( s t d : : s t r i n g ( argv [ 1 ] ) ) ; x . A l l o c a t e ( ”x” , mat . g e t n r o w ( ) ) ; r h s . A l l o c a t e ( ” r h s ” , mat . g e t n r o w ( ) ) ; r h s . Ones ( ) ; x . Zeros () ; tick = paralution time () ; start = paralution time () ; // L i n e a r S o l v e r AMG, L o c a l V e c t o r , d o u b l e > l s ; l s . S e t O p e r a t o r ( mat ) ; // c o u p l i n g s t r e n g t h l s . SetCouplingStrength (0.001) ; // number o f unknowns on c o a r s e s t l e v e l l s . SetCoarsestLevel (300) ; // i n t e r p o l a t i o n type f o r g r i d t r a n s f e r o p e r a t o r s l s . S e t I n t e r p o l a t i o n ( SmoothedAggregation ) ; // R e l a x a t i o n parameter f o r smoothed i n t e r p o l a t i o n a g g r e g a t i o n l s . SetInterpRelax ( 2 . / 3 . ) ; // Manual s m oo t h er s l s . SetManualSmoothers ( t r u e ) ; // Manual c o u r s e g r i d s o l v e r l s . SetManualSolver ( true ) ; // g r i d t r a n s f e r s c a l i n g l s . SetScaling ( true ) ; 123 124 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 CHAPTER 15. CODE EXAMPLES l s . BuildHierarchy () ; i n t l e v e l s = l s . GetNumLevels ( ) ; // Smoother f o r each l e v e l I t e r a t i v e L i n e a r S o l v e r , L o c a l V e c t o r , d o u b l e > ∗∗sm = NULL; MultiColoredGS, L o c a l V e c t o r , d o u b l e > ∗∗ g s = NULL; // Coarse Grid S o l v e r CG, L o c a l V e c t o r , d o u b l e > c g s ; c g s . Verbose ( 0 ) ; sm = new I t e r a t i v e L i n e a r S o l v e r , L o c a l V e c t o r , d o u b l e >∗[ l e v e l s − 1 ] ; g s = new MultiColoredGS, L o c a l V e c t o r , d o u b l e >∗[ l e v e l s − 1 ] ; // P r e c o n d i t i o n e r // MultiColoredILU, L o c a l V e c t o r , d o u b l e > p ; // cgs−>S e t P r e c o n d i t i o n e r ( p ) ; f o r ( i n t i =0; i , L o c a l V e c t o r , d o u b l e > ∗ f p ; f p = new FixedPoint , L o c a l V e c t o r , d o u b l e >; fp−>S e t R e l a x a t i o n ( 1 . 3 ) ; sm [ i ] = f p ; g s [ i ] = new MultiColoredGS, L o c a l V e c t o r , d o u b l e >; g s [ i ]−>SetPrecondMatrixFormat (ELL) ; sm [ i ]−> S e t P r e c o n d i t i o n e r ( ∗ g s [ i ] ) ; sm [ i ]−>Verbose ( 0 ) ; } ls ls ls ls ls ls ls . SetOperatorFormat (CSR) ; . SetSmoother (sm) ; . SetSolver ( cgs ) ; . SetSmootherPreIter (1) ; . SetSmootherPostIter (2) ; . I n i t ( 1 e −10 , 1 e −8, 1 e +8, 1 0 0 0 0 ) ; . Verbose ( 2 ) ; l s . Build ( ) ; mat . MoveToAccelerator ( ) ; x . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; l s . MoveToAccelerator ( ) ; mat . i n f o ( ) ; tack = paralution time ( ) ; s t d : : c o u t << ” B u i l d i n g time : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; tick = paralution time () ; l s . S o l v e ( rhs , &x ) ; tack = paralution time ( ) ; 15.6. AMG SOLVER WITH EXTERNAL SMOOTHERS 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 } 125 s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; l s . Clear () ; // Free for ( int delete delete } delete [ ] delete [ ] a l l a l l o c a t e d data i =0; i

Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 141 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.14 Create Date : 2016:01:19 08:13:01+01:00 Modify Date : 2016:01:19 08:13:01+01:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
EXIF Metadata provided by EXIF.tools

Paralution User Manual

Navigation menu

Versions of this User Manual:

Views

Navigation