Paralution User Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 141
Download | |
Open PDF In Browser | View PDF |
User manual Version 1.1.0 January 19, 2016 3 PARALUTION - User Manual This document is licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License http://creativecommons.org/licenses/by-nc-nd/3.0/legalcode This project is funded by Imprint: PARALUTION Labs UG (haftungsbeschränkt) & Co. KG Am Hasensprung 6, 76571 Gaggenau, Germany Handelsregister: Amtsgericht Mannheim, HRA 706051 Vertreten durch PARALUTION Labs Verwaltungs UG (haftungsbeschränkt) Am Hasensprung 6, 76571 Gaggenau, Germany Handelsregister: Amtsgericht Mannheim, HRB 721277 Geschäftsführer: Dimitar Lukarski, Nico Trost www.paralution.com Copyright c 2015-2016 4 PARALUTION - User Manual 1 Introduction 1.1 Overview . . . . . . . . . . . 1.2 Purpose of this User Manual 1.3 API Documentation . . . . . 1.4 Dual License . . . . . . . . . 1.4.1 Basic Versions . . . . 1.4.2 Commercial Versions . 1.5 Version Nomenclature . . . . 1.6 Features Nomenclature . . . . 1.7 Cite . . . . . . . . . . . . . . 1.8 Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 10 11 11 11 12 12 12 13 2 Installation 2.1 Linux/Unix-like OS . . . . 2.1.1 Makefile . . . . . . 2.1.2 CMake . . . . . . . 2.1.3 Shared Library . . 2.2 Windows OS . . . . . . . 2.2.1 OpenMP backend 2.2.2 CUDA backend . . 2.2.3 OpenCL backend . 2.3 Mac OS . . . . . . . . . . 2.4 Supported Compilers . . . 2.5 Simple Test . . . . . . . . 2.6 Compilation and Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 15 15 16 16 16 17 18 18 18 19 19 3 Basics 3.1 Design and Philosophy . . . . . . . 3.2 Operators and Vectors . . . . . . . 3.2.1 Local Operators/Vectors . . 3.2.2 Global Operators/Vectors . 3.3 Functionality on the Accelerators . 3.4 Initialization of the Library . . . . 3.4.1 Thread-core Mapping . . . 3.4.2 OpenMP Threshold Size . . 3.4.3 Disable the Accelerator . . 3.4.4 MPI and Multi-accelerators 3.5 Automatic Object Tracking . . . . 3.6 Verbose Output . . . . . . . . . . . 3.7 Verbose Output and MPI . . . . . 3.8 Debug Output . . . . . . . . . . . 3.9 File Logging . . . . . . . . . . . . . 3.10 Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 22 22 22 24 25 25 25 26 26 26 26 26 26 4 Single-node Computation 4.1 Introduction . . . . . . . 4.2 Code Structure . . . . . 4.3 Value Type . . . . . . . 4.4 Complex Support . . . . 4.5 Allocation and Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 27 27 28 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 PARALUTION - USER MANUAL 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 Matrix Formats . . . . . . . . . . . . . . 4.6.1 Coordinate Format – COO . . . 4.6.2 Compressed Sparse Row/Column 4.6.3 Diagonal Format – DIA . . . . . 4.6.4 ELL Format . . . . . . . . . . . 4.6.5 HYB Format . . . . . . . . . . . 4.6.6 Memory Usage . . . . . . . . . . 4.6.7 Backend support . . . . . . . . . I/O . . . . . . . . . . . . . . . . . . . . . Access . . . . . . . . . . . . . . . . . . . Raw Access to the Data . . . . . . . . . Copy CSR Matrix Host Data . . . . . . Copy Data . . . . . . . . . . . . . . . . Object Info . . . . . . . . . . . . . . . . Copy . . . . . . . . . . . . . . . . . . . . Clone . . . . . . . . . . . . . . . . . . . Assembling . . . . . . . . . . . . . . . . Check . . . . . . . . . . . . . . . . . . . Sort . . . . . . . . . . . . . . . . . . . . Keying . . . . . . . . . . . . . . . . . . . Graph Analyzers . . . . . . . . . . . . . Basic Linear Algebra Operations . . . . Local Stencils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Format – CSR/CSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 30 31 31 31 32 32 32 33 34 35 35 35 35 36 37 38 38 38 39 40 41 5 Multi-node Computation 5.1 Introduction . . . . . . . . . . . . . . . 5.2 Code Structure . . . . . . . . . . . . . 5.3 Parallel Manager . . . . . . . . . . . . 5.4 Global Matrices and Vectors . . . . . . 5.4.1 Value Type and Formats . . . . 5.4.2 Matrix and Vector Operations 5.4.3 Asynchronous SpMV . . . . . . 5.5 Communication Channel . . . . . . . . 5.6 I/O . . . . . . . . . . . . . . . . . . . . 5.6.1 File Organization . . . . . . . . 5.6.2 Parallel Manager . . . . . . . . 5.6.3 Matrices . . . . . . . . . . . . . 5.6.4 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 45 45 45 45 45 45 45 46 47 47 6 Solvers 6.1 Code Structure . . . . . . . . 6.2 Iterative Linear Solvers . . . 6.3 Stopping Criteria . . . . . . . 6.4 Building and Solving Phase . 6.5 Clear function and Destructor 6.6 Numerical Update . . . . . . 6.7 Fixed-point Iteration . . . . . 6.8 Krylov Subspace Solvers . . . 6.8.1 CG . . . . . . . . . . . 6.8.2 CR . . . . . . . . . . . 6.8.3 GMRES . . . . . . . . 6.8.4 FGMRES . . . . . . . 6.8.5 BiCGStab . . . . . . . 6.8.6 IDR . . . . . . . . . . 6.8.7 FCG . . . . . . . . . . 6.8.8 QMRCGStab . . . . . 6.8.9 BiCGStab(l) . . . . . 6.9 Deflated PCG . . . . . . . . . 6.10 Chebyshev Iteration . . . . . 6.11 Mixed-precision Solver . . . . 6.12 Multigrid Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 51 51 52 53 53 53 54 54 54 54 55 55 55 55 56 56 56 57 57 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PARALUTION - USER MANUAL 7 6.12.1 Geometric Multigrid 6.12.2 Algebraic Multigrid 6.13 Direct Linear Solvers . . . . 6.14 Eigenvalue Solvers . . . . . 6.14.1 AMPE-SIRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 61 61 61 7 Preconditioners 7.1 Code Structure . . . . . . . . . . . . . . . . . . . . . . . 7.2 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Multi-colored (Symmetric) Gauss-Seidel and SOR . . . 7.4 Incomplete LU with levels – ILU(p) . . . . . . . . . . . 7.5 Incomplete Cholesky – IC . . . . . . . . . . . . . . . . . 7.6 Incomplete LU with threshold – ILUT(t,m) . . . . . . . 7.7 Power(q)-pattern method – ILU(p,q) . . . . . . . . . . . 7.8 Multi-Elimination ILU . . . . . . . . . . . . . . . . . . . 7.9 Diagonal Preconditioner for Saddle-Point Problems . . . 7.10 Chebyshev Polynomial . . . . . . . . . . . . . . . . . . . 7.11 FSAI(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 SPAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13 Block Preconditioner . . . . . . . . . . . . . . . . . . . . 7.14 Additive Schwarz and Restricted Additive Schwarz – AS 7.15 Truncated Neumann Series Preconditioner (TNS) . . . . 7.16 Variable Preconditioner . . . . . . . . . . . . . . . . . . 7.17 CPR Preconditioner . . . . . . . . . . . . . . . . . . . . 7.18 MPI Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and RAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 63 64 65 65 65 66 66 66 67 67 68 68 70 71 72 72 72 8 Backends 8.1 Backend and Accelerators . . . . 8.2 Copy . . . . . . . . . . . . . . . . 8.3 CloneBackend . . . . . . . . . . . 8.4 Moving Objects To and From the 8.5 Asynchronous Transfers . . . . . 8.6 Systems without Accelerators . . . . . . . . . . . . . . . . . . . . . . . Accelerator . . . . . . . . . . . . . . 9 Advanced Techniques 9.1 Hardware Locking . . . . . . . . . . . . . . 9.2 Time Restriction/Limit . . . . . . . . . . . 9.3 OpenCL Kernel Encryption . . . . . . . . . 9.4 Logging Encryption . . . . . . . . . . . . . 9.5 Encryption . . . . . . . . . . . . . . . . . . 9.6 Memory Allocation . . . . . . . . . . . . . . 9.6.1 Allocation Problems . . . . . . . . . 9.6.2 Memory Alignment . . . . . . . . . . 9.6.3 Pinned Memory Allocation (CUDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 73 73 74 75 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 77 78 78 78 78 78 78 78 10 Plug-ins 10.1 FORTRAN . . . . 10.2 OpenFOAM . . . . 10.3 Deal.II . . . . . . . 10.4 Elmer . . . . . . . 10.5 Hermes / Agros2D 10.6 Octave/MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 79 80 81 83 84 84 11 Remarks 11.1 Performance 11.2 Accelerators 11.3 Plug-ins . . 11.4 Correctness 11.5 Compilation 11.6 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 88 88 88 88 89 . . . . . . . . . . . . . . . . . . . . . . . . 8 12 Performance Benchmarks 12.1 Single-Node Benchmarks . . . 12.1.1 Hardware Description 12.1.2 BLAS1 and SpMV . . 12.1.3 CG Solver . . . . . . . 12.2 Multi-Node Benchmarks . . . PARALUTION - USER MANUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Graph Analyzers 14 Functionality Table 14.1 Backend Support . . . . . . . . . . . . 14.2 Matrix-Vector Multiplication Formats 14.3 Local Matrices and Vectors . . . . . . 14.4 Global Matrices and Vectors . . . . . . 14.5 Local Solvers and Preconditioners . . . 14.6 Global Solvers and Preconditioners . . 91 91 91 92 92 95 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Code Examples 15.1 Preconditioned CG . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Multi-Elimination ILU Preconditioner with CG . . . . . . . . . 15.3 Gershgorin Circles+Power Method+Chebyshev Iteration+PCG 15.4 Mixed-precision PCG Solver . . . . . . . . . . . . . . . . . . . . 15.5 PCG Solver with AMG . . . . . . . . . . . . . . . . . . . . . . 15.6 AMG Solver with External Smoothers . . . . . . . . . . . . . . 15.7 Laplace Example File with 4 MPI Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 103 103 104 104 105 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with Chebyshev Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 109 111 113 117 120 123 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Change Log 133 Bibliography 139 Chapter 1 Introduction 1.1 Overview PARALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism, targeting modern processors and accelerators including multi/many-core CPU and GPU platforms. The main goal of this project is to provide a portable library for iterative sparse methods on state of the art hardware. Figure 1.1 shows the PARALUTION framework as middle-ware between different parallel backends and application specific packages. API: C++ FORTRAN Multi-core CPU OpenMP Plug-ins: MATLAB/Octave, Deal.II, OpenFOAM, Hermes, Elmer, ... GPU CUDA/OpenCL Xeon Phi OpenMP Figure 1.1: The PARALUTION library – middleware between hardware and problem specific packages. The major features and characteristics of the library are: • Various backends: – Host – designed for multi-core CPU, based on OpenMP – GPU/CUDA – designed for NVIDIA GPU – OpenCL – designed for OpenCL-compatible devices (NVIDIA GPU, AMD GPU, CPU, Intel MIC) – OpenMP(MIC) – designed for Intel Xeon Phi/MIC • Multi-node/accelerator support – the library supports multi-node and multi-accelerator configurations via MPI layer. • Easy to use – the syntax and the structure of the library provide easy learning curves. With the help of the examples, anyone can try out the library – no knowledge in CUDA, OpenCL or OpenMP required. • No special hardware/library required – there are no hardware or library requirements to install and run PARALUTION. If a GPU device and CUDA, OpenCL, or Intel MKL are available, the library will use them. • Most popular operating systems: – Unix/Linux systems (via cmake/Makefile and gcc/icc) – MacOS (via cmake/Makefile and gcc/icc) – Windows (via Visual Studio) 9 10 CHAPTER 1. INTRODUCTION • Various iterative solvers: – Fixed-Point iteration – Jacobi, Gauss-Seidel, Symmetric-Gauss Seidel, SOR and SSOR – Krylov subspace methods – CR, CG, BiCGStab, BiCGStab(l), GMRES, IDR, QMRCGSTAB, Flexible CG/GMRES – Deflated PCG – Mixed-precision defect-correction scheme – Chebyshev iteration – Multigrid – geometric and algebraic • Various preconditioners: – Matrix splitting – Jacobi, (Multi-colored) Gauss-Seidel, Symmetric Gauss-Seidel, SOR, SSOR – Factorization – ILU(0), ILU(p) (based on levels), ILU(p,q) (power(q)-pattern method) and Multielimination ILU (nested/recursive), ILUT (based on threshold), IC(0) – Approximate Inverse - Chebyshev matrix-valued polynomial, SPAI, FSAI and TNS – Diagonal-based preconditioner for Saddle-point problems – Block-type of sub-preconditioners/solvers – Additive Schwarz and Restricted Additive Schwarz – Variable type of preconditioners • Generic and robust design – PARALUTION is based on a generic and robust design allowing expansion in the direction of new solvers and preconditioners, and support for various hardware types. Furthermore, the design of the library allows the use of all solvers as preconditioners in other solvers, for example you can easily define a CG solver with a Multi-elimination preconditioner, where the last-block is preconditioned with another Chebyshev iteration method which is preconditioned with a multi-colored Symmetric GaussSeidel scheme. • Portable code and results – all code based on PARALUTION is portable and independent of GPU/CUDA, OpenCL or MKL. The code will compile and run everywhere. All solvers and preconditioners are based on a single source code, which delivers portable results across all supported backends (variations are possible due to different rounding modes on the hardware). The only difference which you can see for a hardware change is the performance variation. • Support for several sparse matrix formats – Compressed Sparse Row (CSR), Modified Compressed Sparse Row (MCSR), Dense (DENSE), Coordinate (COO), ELL, Diagonal (DIA), Hybrid format of ELL and COO (HYB) formats. • Plug-ins – the library provides a plug-in for the CFD package OpenFOAM [44] and for the finite element library Deal.II [8, 9]. The library also provides a FORTRAN interface and a plug-in for MATLAB/Octave [36, 2]. 1.2 Purpose of this User Manual The purpose of this document is to present the PARALUTON library step-by-step. This includes the installation process, internal structure and design, application programming interface (API) and examples. The change log of the software is also presented here. All related documentation (web site information, user manual, white papers, doxygen) follows the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License [12]. A copy of the license can be found in the library package. 1.3 API Documentation The most important library’s functions are presented in this document. The library’s full API (references) are documented via the automatic documentation system - doxygen [55]. The references are available on the PARALUTION web site. 1.4. DUAL LICENSE 11 Basic GPLv3 version Single - node/ Multi - node/ accelerator accelerator version version Basic GPLv3 Single node Commercial license Advanced solvers Full complex support More advanced solvers MPI communication layer Figure 1.2: The three PARALUTION versions 1.4 Dual License The PARALUTION library is distributed within a dual license model. Three different version can be obtained – a GPLv3 version which is free of charge and two commercial versions. 1.4.1 Basic Versions The basic version of the code is released under the GPL v3 license [18]. It provides the core components of the single node functionality of the library. This version of the library is free. Please, note that due to GPLv3 license model, any code which uses the library must be released as Open Source and it should have compliance to the GPLv3. 1.4.2 Commercial Versions The user can purchase a commercial license which will grant the ability to embed the code into a (commercial) closed binary package. In addition these versions come with more features and with support. The only restriction imposed to the user is that she/he is not allowed to further distribute the source-code. Single-node/accelerator Version In addition to the basic version, the commercial license features: • License to embed the code into a commercial/non-opensource product • Full complex numbers support on Host, CUDA and OpenCL • More iterative solvers - QMRCGStab, BiCGStab(l), FCG • More AMG schemes - Ruge-Stüben, Pairwise • More preconditioners - VariablePreconditioners Multi-node/accelerator Version In addition to the commercial basic version, the multi-node license features: • License to embed the code into a commercial/non-opensource product • Same functionality as for the single-node/accelerator version • MPI layer for HPC clusters 12 CHAPTER 1. INTRODUCTION • Global AMG solver (Pairwise) Note Please note, that the multi-node version does not support Windows/Visual Studio. 1.5 Version Nomenclature Please note the following compatibility policy with respect to the versioning. The version number x.y.z represents: x is the major (increases when modifications in the library have been made and/or a new API has been introduced), y is the minor (increases when new functionality (algorithms, schemes, etc) has been added, possibly small/no modification of the API) and z is the revision (increases due to bugfixing or performance improvement). The alpha and beta versions are denoted with a and b, typically these are pre-released versions. As mentioned above there are three versions of PARALUTION, each release has the same version numbering plus an additional suffix for each type. They are abbreviated with ”B” for the Basic version (free under GPLv3), with ”S” for the single node/accelerator commercial and with ”M” for the multi-node/accelerator commercial version. 1.6 Features Nomenclature The functions described in this user manual follows the following nomenclature: • ValueType – type of values can be double (D), float (F), integer (I), and complex (double and float). The particular bit representation (8,16,32,64bit) depends on your compiler and operating system. • Computation – on which backend the computation can be performed, the abbreviation follows: Host backend with OpenMP (H); CUDA backend (C); OpenCL backend (O); Xeon Phi backend with OpenMP (X). • Available – in which version this functionality is available: Basic/GPLv3 (B); Single-node/accelerator commercial version (S); Multi-node/accelerator commercial version (M). The following example states that the function has double, float, integer and complex support; it can be performed on Host backend with OpenMP, CUDA backend, OpenCL backend, Xeon Phi backend with OpenMP; and it is available in all three versions of the library. ValueType D,F,I,C Computation H,C,O,X Available B,S,M Solvers and preconditioners split the computation into a Building phase and Solving phase which can have different computation backend performance. In the following example the solver/preconditioner support double, float and complex; the building phase can be performed only on the Host with OpenMP or on CUDA; the solving phase can be performed on all other backends; it is available only in the commercial versions of the library. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available S,M Note The full complex support is available only in the commercial version, check Section 4.4. 1.7 Cite If you like to cite the PARALUTION library you can do it as you like with citing our web site. Please specify the version of the software or/and date of accessing our web page, like that: 1 2 3 4 5 6 @misc{ p a r a l u t i o n , a u t h o r=”{PARALUTION Labs }” , t i t l e =” {PARALUTION vX .Y. Z} ” , y e a r=” 20XX” , n o t e = {\ u r l { h t t p : //www. p a r a l u t i o n . com/}} } 1.8. WEBSITE 1.8 Website The official web site of the library (including all (free and commercial) versions) is http://www.paralution.com 13 14 CHAPTER 1. INTRODUCTION Chapter 2 Installation PARALUTION can be compiled under Linux/Unix, Windows and Mac systems. Note Please check for additional remarks Sections 11.5. 2.1 Linux/Unix-like OS After downloading and unpacking the library, the user needs to compile it. We provide two compilation configurations – cmake and Makefile. 2.1.1 Makefile In this case, the user needs to modify the Makefile which contains the information about the available compilers. By default PARALUTION will only use gcc [20] compiler (no GPU support). The user can switch to icc [23] with or without MKL [24] support. To compile with GPU support, the user needs to uncomment the corresponding CUDA1 [43] lines in the Makefile. The same procedure needs to be followed for the OpenCL [29] and for the OpenMP(MIC) backend. Note During the compilation only one backend can be selected, i.e. if a GPU is available the user needs to select either CUDA or OpenCL support. The default installation process can be summarized in the following lines: 1 2 3 4 5 6 7 8 wget h t t p : //www. p a r a l u t i o n . com/ download / p a r a l u t i o n −x . y . z . t a r . gz t a r z x v f p a r a l u t i o n −x . y . z . t a r . gz cd p a r a l u t i o n −x . y . z / s r c make a l l make i n s t a l l where x.y.z is the version of the library. Note Please note, that the multi-node version of PARALUTION can only be compiled using CMake. 2.1.2 CMake The compilation with cmake [30] is easier to handle – all compiler specifications are determined automatically. The compilation process can be performed by 1 2 3 4 5 6 7 8 9 10 11 wget h t t p : //www. p a r a l u t i o n . com/ download / p a r a l u t i o n −x . y . z . t a r . gz t a r z x v f p a r a l u t i o n −x . y . z . t a r . gz cd p a r a l u t i o n −x . y . z mkdir b u i l d cd b u i l d cmake . . make 1 NVIDIA CUDA, when mentioned also includes CUBLAS and CUSPARSE 15 16 CHAPTER 2. INSTALLATION where x.y.z is the version of the library. Advanced compilation can be performed with cmake -i .., you need this option to compile the library with OpenMP(MIC) backend. The priority during the compilation process of the backends are: CUDA, OpenCL, MIC You can also choose specific options via the command line, for example CUDA: 1 2 3 4 5 6 7 cd p a r a l u t i o n −x . y . z mkdir b u i l d cd b u i l d cmake −DSUPPORT CUDA=ON −DSUPPORT OMP=ON . . make −j For the Intel Xeon Phi, OpenMP(MIC) backend: 1 2 3 4 5 6 7 cd p a r a l u t i o n −x . y . z mkdir b u i l d cd b u i l d cmake −DSUPPORT MIC=ON −DSUPPORT CUDA=OFF −DSUPPORT OCL=OFF make −j .. Note ptk file is generated in the build directory when using cmake. 2.1.3 Shared Library Both compilation processes produce a shared library file libparalution.so. Ensure that the library object can be found in your library path. If you do not copy the library to a specific location you can add the path under Linux in the LD LIBRARY PATH variable. 1 e x p o r t LD LIBRARY PATH=$LD LIBRARY PATH: ˜ / p a r a l u t i o n −x . y . z / b u i l d / l i b / 2.2 Windows OS This section will introduce a step-by-step guide to compile and use the PARALUTION library in a Windows based environment. Note Please note, that the multi-node version does not support Windows/Visual Studio. 2.2.1 OpenMP backend PARALUTION with OpenMP backend comes with Visual Studio Project files that support Visual Studio 2010, 2012 and 2013. PARALUTION is built as a static library during this step-by-step guide. • Open Visual Studio and navigate to File\Open Project. Figure 2.1: Open an existing Visual Studio project. • Load the corresponding PARALUTION project file, located in the visualstudio\paralution omp directory. The PARALUTION and CG projects should appear. • Right-click the PARALUTION project, and start to build the library. Once finished, Visual Studio should report a successful built. • Next, repeat the building procedure with the CG example project. Once finished, a successful built should be reported. 2.2. WINDOWS OS 17 Figure 2.2: Build the PARALUTION library. Figure 2.3: Visual Studio output for a successful built of the static PARALUTION library. Figure 2.4: Build the PARALUTION CG example. Figure 2.5: Visual Studio output for a successful built of the PARALUTION CG example. • Finally, the CG executable should be located within the visualstudio\paralution omp\Release directory. Note For testing, Windows 7 and Visual Studio Express 2013 has been used. Note OpenMP support can be enabled/disabled in the project properties. Navigate to C++\Language for the corresponding switch. Note The current version of PARALUTION does not support MPI (i.e. the M-version of the library) under Windows. 2.2.2 CUDA backend PARALUTION with CUDA backend comes with Visual Studio Project files that support Visual Studio 2010, 2012 and 2013. Please follow the same steps as for the OpenMP backend compilation but using the visualstudio\paralution cuda directory. 18 2.2.3 CHAPTER 2. INSTALLATION OpenCL backend PARALUTION with OpenCL backend comes with Visual Studio Project files that support Visual Studio 2010, 2012 and 2013. Please follow the same steps as for the OpenMP backend compilation but using the visualstudio\paralution ocl directory. Additionally, the OpenCL include directory and OpenCL library directory need to be specified within Visual Studio, as illustrated in Figures 2.6 and 2.7. Figure 2.6: Setting up Visual Studio OpenCL include directory. 2.3 Mac OS To compile PARALUTION under Mac, please follow the Linux/Unix-like OS instruction for the CMake compilation. 2.4 Supported Compilers The library has been tested with the following compilers: cmake gcc/g++ icc (MKL) CUDA Intel OpenCL NVIDIA OpenCL AMD OpenCL Visual Studio MPICH OpenMPI Intel MPI 2.8.12.2; 3.0.2; 3.1.3; 3.2.0 4.4.7; 4.6.3; 4.8.2 12.0; 13.1; 14.0.4; 15.0.0 5.0; 5.5; 6.0; 6.5; 7.0; 7.5 1.2 1.2 1.2; 2.0 2010, 2012, 2013 3.1.3 1.5.3; 1.6.3; 1.6.5; 1.8.4 4.1.2; 5.0.1 B,S,M B,S,M B,S,M B,S,M B,S,M B,S,M B,S,M B,S M M M Note Please note, that CUDA has limited ICC support. Note Please note, that Intel MPI >= 5.0.0 is only supported by CMAKE >= 3.2.0. 2.5. SIMPLE TEST 19 Figure 2.7: Setting up Visual Studio OpenCL library directory. 2.5 Simple Test You can test the installation by running a CG solver on a Laplace matrix [38]. After compiling the library you can perform the CG solver test by executing: 1 2 3 4 5 6 7 cd p a r a l u t i o n −x . y . z cd b u i l d / b i n wget f t p : //math . n i s t . gov /pub/ MatrixMarket2 / Harwell−Boeing / l a p l a c e / g r 3 0 3 0 . mtx . gz g z i p −d g r 3 0 3 0 . mtx . gz . / cg g r 3 0 3 0 . mtx 2.6 Compilation and Linking After compiling the PARALUTION library, the user need to specify the include and the linker path to compile a program. 1 g++ −O3 −Wall −I / path / t o / p a r a l u t i o n −x . y . z / b u i l d / i n c −c main . cpp −o main . o 2 g++ −o main main . o −L/ path / t o / p a r a l u t i o n −x . y . z / b u i l d / l i b / −l p a r a l u t i o n ”Compilation and linking” Before the execution of a program which has been compiled with PARALUTION, the library path need to be added to the environment variables, similar to 1 e x p o r t LD LIBRARY PATH=$LD LIBRARY PATH: ˜ / p a r a l u t i o n −x . y . z / b u i l d / l i b / When compiling with MPI, library and program need to be compiled using mpic++ or the corresponding MPI C++ compiler. 20 CHAPTER 2. INSTALLATION Chapter 3 Basics 3.1 Design and Philosophy The main idea of the PARALUTION objects is that they are separated from the actual hardware specification. Once you declare a matrix, a vector or a solver they are initially allocated on the host (CPU). Then every object can be moved to a selected accelerator by a simple move-to-accelerator function. The whole execution mechanism is based on run-time type information (RTTI) which allows you to select where and how you want to perform the operations at run time. This is in contrast to the template-based libraries which need this information at compile time. The philosophy of the library is to abstract the hardware-specific functions and routines from the actual program which describes the algorithm. It is hard and almost impossible for most of the large simulation software based on sparse computation to adapt and port their implementation in order to use every new technology. On the other hand, the new high performance accelerators and devices have the capability to decrease the computational time significantly in many critical parts. This abstraction layer of the hardware specific routines is the core of PARALUTION’s design, it is built to explore fine-grained level of parallelism suited for multi/many-core devices. This is in contrast to most of the parallel sparse libraries available which are mainly based on domain decomposition techniques. Thus, the design of the iterative solvers the preconditioners is very different. Another cornerstone of PARALUTION is the native support of accelerators - the memory allocation, transfers and specific hardware functions are handled internally in the library. PARALUTION helps you to use accelerator technologies but does not force you to use them. As you can see later in this chapter, even if you offload your algorithms and solvers to the accelerator device, the same source code can be compiled and executed in a system without any accelerators. 3.2 Operators and Vectors The main objects in PARALUTION are linear operators and vectors. All objects can be moved to an accelerator at run time – a structure diagram is presented in Figure 3.1. Currently, we support GPUs by CUDA (NVIDIA) and OpenCL (NVIDIA, AMD) backends, and we provide OpenMP MIC backend for the Intel Xeon Phi. Operators/Vectors Solvers/Preconditioners Host Dynamic Switch New Backend Accelerator OpenMP Intel MIC OpenMP GPU, Accelerators OpenCL GPU CUDA Figure 3.1: Host and backends structure for different hardware 21 22 CHAPTER 3. BASICS The linear operators are defined as local or global matrices (i.e. on a single node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). Local Matrix Global Vector Operators Local Vector Vectors Global Matrix Local Stencil Figure 3.2: Operator and vector classes The only template parameter of the operators and vectors is the data type (ValueType). The operator data type could be float or double, while the vector data type can be float, double or int (int is used mainly for the permutation vectors). In the current version, cross ValueType object operations are not supported. Each of the objects contains a local copy of the hardware descriptor created by the init platform() function. This allows the user to modify it according to his needs and to obtain two or more objects with different hardware specifications (e.g. different amount of OpenMP threads, CUDA block sizes, etc). 3.2.1 Local Operators/Vectors By Local Operators/Vectors we refer to Local Matrices and Stencils, and to Local Vectors. By Local we mean the fact they stay on a single system. The system can contain several CPUs via UMA or NUMA memory system, it can contain an accelerator. 3.2.2 Global Operators/Vectors By Global Operators/Vectors we refer to Global Matrix and to Global Vectors. By Global we mean the fact they can stay on a single or multiple nodes in a network. For this this type of computation, the communication is based on MPI. 3.3 Functionality on the Accelerators Naturally, not all routines and algorithms can be performed efficiently on many-core systems (i.e. on accelerators). To provide full functionality the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not the object is moved to the host and the routine is computed there. This guarantees that your code will run (maybe not in the most efficient way) with any accelerator regardless of the available functionality for it. 3.4 Initialization of the Library The body of a PARALUTION code is very simple, it should contain the header file and the namespace of the library. The program must contain an initialization call, which will check and allocate the hardware, and a finalizing call which will release the allocated hardware. 1 #i n c l u d e2 3 u s i n g namespace p a r a l u t i o n ; 4 5 i n t main ( i n t argc , c h a r ∗ argv [ ] ) { 6 7 init paralution () ; 8 9 info paralution () ; 10 11 // . . . 12 3.4. INITIALIZATION OF THE LIBRARY 13 stop paralution () ; 14 15 } ”Initialize and shutdown PARALUTION” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 32 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e Number o f GPU d e v i c e s i n t h e system : 2 PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : GPU(CUDA) OpenMP t h r e a d s : 3 2 S e l e c t e d GPU d e v i c e : 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 0 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 1 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ”An example output for info paralution() on a GPU (CUDA) system” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 32 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e Number o f OpenCL d e v i c e s i n t h e system : 2 PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : OpenCL OpenMP t h r e a d s : 3 2 S e l e c t e d OpenCL p l a t f o r m : 0 S e l e c t e d OpenCL d e v i c e : 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− P l a t f o r m number : 0 P l a t f o r m name : NVIDIA CUDA De vi c e number : 0 De vi c e name : T e s l a K20c De vi c e type : GPU totalGlobalMem : 4799 MByte c l o c k R a t e : 705 OpenCL v e r s i o n : OpenCL 1 . 1 CUDA −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− P l a t f o r m number : 0 P l a t f o r m name : NVIDIA CUDA De vi c e number : 1 De vi c e name : T e s l a K20c De vi c e type : GPU totalGlobalMem : 4799 MByte 23 24 CHAPTER 3. BASICS 29 c l o c k R a t e : 705 30 OpenCL v e r s i o n : OpenCL 1 . 1 CUDA 31 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ”An example output for info paralution() on a GPU (OpenCL) system” 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 16 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e MIC backed i s i n i t i a l i z e d PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : MIC(OpenMP) OpenMP t h r e a d s : 1 6 Number o f MIC d e v i c e s i n t h e system : 2 S e l e c t e d MIC d e v i c e s : 0 ”An example output for info paralution() on Intel Xeon Phi (OpenMP) system” This v e r s i o n o f PARALUTION i s r e l e a s e d under GPL. By downloading t h i s package you f u l l y a g r e e with t h e GPL l i c e n s e . Number o f CPU c o r e s : 2 Host t h r e a d a f f i n i t y p o l i c y − t h r e a d mapping on e v e r y c o r e PARALUTION v e r B0 . 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : None OpenMP t h r e a d s : 2 ”An example output for info paralution() on system without accelerator” The init paralution() function defines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations. For control the library provides the following functions • select device paralution(int dev) – this is a unified function which select a specific device. If you have compiled the library with a backend and for this backend there are several available cards you can use this function to select a particular one. This function works for all backends (CUDA, OpenCL, Xeon Phi). • set omp threads paralution(int nthreads) – with this function the user can set the number of OpenMP threads. This function has to be called after the init paralution(). • set gpu cuda paralution(int ndevice) – in a multi-GPU system, the user can select a specific GPU by this function. This function has to be called before the init paralution(). • set ocl paralution(int nplatform, int ndevice) – in a multi-platform/accelerator system, the user can select a specific platform and device by this function. This function has to be called before the init paralution(). 3.4.1 Thread-core Mapping The number of threads which PARALUTION will use can be set via the set omp threads paralution() function or by the global OpenMP environment variable (for Unix-like OS this is OMP NUM THREADS ). During the initialization phase, the library provides affinity thread-core mapping based on: • if the number of cores (including hyperthreading cores) is greater or equal than two times the number of threads – then all the threads can occupy every second core ID (e.g. 0, 2, 4, ...). This is to avoid having two threads working on the same physical core when hyperthreading is enabled. • if the number of threads is less or equal to the number of cores (including hyperthreading), and the previous clause is false. Then the threads can occupy every core ID (e.g. 0, 1, 2, 3, ...). • if non of the above criteria is matched – the default thread-core mapping is used (typically set by the OS). Note The thread-core mapping is available only for Unix-like OS. For Windows OS, the thread-core mapping is selected by the operating system. Note The user can disable the thread affinity by calling set omp affinity(false) (and enable it with set omp affinity(true)), before initializing the library (i.e. before init paralution()). 3.4. INITIALIZATION OF THE LIBRARY 3.4.2 25 OpenMP Threshold Size Whenever you want to work on a small problem, you might observe that the OpenMP Host backend is (slightly) slower than using no OpenMP. This is mainly attributed to the small amount of work which every thread should perform and the large overhead of forking/joining threads. This can be avoid by the OpenMP threshold size parameter in PARALUTION. The default threshold is set to 10, 000, which means that all matrices under (and equal) this size will use only one thread (disregarding the number of OpenMP threads set in the system). The threshold can be modified via the function set omp threshold(). 3.4.3 Disable the Accelerator If you want to disable the accelerator (without recompiling the code), you need to call disable accelerator paralution() function before the init paralution(). 3.4.4 MPI and Multi-accelerators When initializing the library with MPI, the user need to pass the rank of the MPI process as well as the number of accelerators which are available on each node. Basically, in this way the user can specify the mapping of MPI process and accelerators – the allocated accelerator will be rank % num dev per node. Thus the user can run two MPI process on systems with two GPUs by specifying the number of devices to 2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 #i n c l u d e
#i n c l u d e
u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { M PI I n it (& argc , &argv ) ; MPI Comm comm = MPI COMM WORLD; int int MPI MPI num processes ; rank ; Comm size (comm, &n u m p r o c e s s e s ) ; Comm rank (comm, &rank ) ; i n i t p a r a l u t i o n ( rank , 2 ) ; info paralution () ; // . . . stop paralution () ; } ”Initialize and shutdown PARALUTION with MPI” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 The d e f a u l t OS t h r e a d a f f i n i t y c o n f i g u r a t i o n w i l l be used Number o f GPU d e v i c e s i n t h e system : 2 The d e f a u l t OS t h r e a d a f f i n i t y c o n f i g u r a t i o n w i l l be used PARALUTION v e r M0. 8 . 0 PARALUTION p l a t f o r m i s i n i t i a l i z e d A c c e l e r a t o r backend : GPU(CUDA) OpenMP t h r e a d s : 1 S e l e c t e d GPU d e v i c e : 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 0 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 26 16 17 18 19 20 21 22 23 24 25 26 CHAPTER 3. BASICS −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− De vi c e number : 1 De vi c e name : T e s l a K20c totalGlobalMem : 4799 MByte c l o c k R a t e : 705500 compute c a p a b i l i t y : 3 . 5 ECCEnabled : 1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− MPI rank : 0 MPI s i z e : 2 ”An example output for info paralution() with 2 MPI processes” 3.5 Automatic Object Tracking By default, after the initialization of the library, it tracks all objects and releasing the allocated memory in them when the library is stopped. This ensure large memory leaks when the objects are allocated but not freed. The user can disable the tracking by editing src/utils/def.hpp. 3.6 Verbose Output PARALUTION provides different levels of output messages. They can be set in the file src/utils/def.hpp before the compilation of the library. By setting a higher level, the user will obtain more detailed information about the internal calls and data transfers to and from the accelerators. 3.7 Verbose Output and MPI To avoid all MPI processes to print information on the screen the default configuration is that only RANK 0 outputs information on the screen. The user can change the RANK or allow all RANK to print by modifying src/utils/def.hpp. If file logging is enabled, all ranks write into corresponding log files. 3.8 Debug Output You can also enable debugging output which will print almost every detail in the program, including object constructor/destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers and preconditioners. The debug flag can be set in src/utils/def.hpp. When enabled, additional assert()s are being checked during the computation. This might decrease the performance of some operations. 3.9 File Logging All output can be logged into a file, the file name will be paralution-XXXX.log, where XXX will be a counter in milliseconds. To enable the file you need to edit src/utils/def.hpp. 3.10 Versions For checking the version in your code you can use the PARALUTION’s pre-defined macros. 1 2 3 4 5 PARALUTION VER MAJOR −− g i v e s t h e v e r s i o n major v a l u e PARALUTION VER MINOR −− g i v e s t h e v e r s i o n minor v a l u e PARALUTION VER REV −− g i v e s t h e v e r s i o n r e v i s i o n PARALUTION VER PRE −− g i v e s pre−r e l e a s e s ( a s ”ALPHA” o r ”BETA” ) The final PARALUTION VER gives the version as 10000major + 100minor + revision. The different type of versions (Basic/GPLv3; Single-node; Multi-node) are defined as PARALUTION VER TYPE B for Basic/GPLv3; S for Single-node; and M for Multi-node. Chapter 4 Single-node Computation 4.1 Introduction In this chapter we describe all base objects (matrices, vectors and stencils) for computation on single-node (shared-memory) systems. A typical configuration is presented on Figure 4.1. Accelerator (e.g. GPU) Bus (PCI-E) Host (CPU) Figure 4.1: A typical single-node configuration, where gray-boxes represent the cores, blue-boxes represent the memory, arrows represent the bandwidth The compute node contains none, one or more accelerators. The compute node could be any kind of sharedmemory (single, dual, quad CPU) system. Note that the memory of the accelerator and of the host can be physically different. 4.2 Code Structure The Data is an object, pointing to the BaseMatrix class. The pointing is coming from either a HostMatrix or an AcceleratorMatrix. The AcceleratorMatrix is created by an object with an implementation in each of the backends (CUDA, OpenCL, Xeon Phi) and a matrix format. Switching between host or accelerator matrix is performed in the LocalMatrix class. The LocalVector is organized in the same way. Each matrix format has its own class for the host and for each accelerator backend. All matrix classes are derived from the BaseMatrix which provides the base interface for computation as well as for data accessing, see Figure4.4. The GPU (CUDA backend) matrix structure is presented in Figure 4.5, all other backend follows the same organization. 4.3 Value Type The value (data) type of the vectors and the matrices is defined as a template. The matrix can be of type float (32-bit), double (64-bit) and complex (64/128-bit). The vector can be float (32-bit), double (64-bit), complex 27 28 CHAPTER 4. SINGLE-NODE COMPUTATION Local Matrix/Vector New Backend Dynamic Switch Host Data Accelerator OpenMP Intel MIC OpenMP Data Data GPU, Accelerators OpenCL Data GPU CUDA Data Figure 4.2: Local Matrices and Vectors paralution::ParalutionObj paralution::ParalutionObj paralution::BaseParalution< ValueType > paralution::BaseParalution< ValueType > paralution::Operator< ValueType > paralution::Vector< ValueType > paralution::LocalMatrix< ValueType > paralution::LocalVector< ValueType > Figure 4.3: LocalMatrix and Local Vector paralution::BaseMatrix< ValueType > paralution::AcceleratorMatrix< ValueType > paralution::HostMatrix< ValueType > paralution::GPUAcceleratorMatrix< ValueType > paralution::HostMatrixBCSR< ValueType > paralution::MICAcceleratorMatrix< ValueType > paralution::HostMatrixCOO< ValueType > paralution::OCLAcceleratorMatrix< ValueType > paralution::HostMatrixCSR< ValueType > paralution::HostMatrixDENSE< ValueType > paralution::HostMatrixDIA< ValueType > paralution::HostMatrixELL< ValueType > paralution::HostMatrixHYB< ValueType > paralution::HostMatrixMCSR< ValueType > Figure 4.4: BaseMatrix (64/128-bit) and int (32/64-bit). The information about the precision of the data type is shown in the Print() function. 4.4 Complex Support PARALUTION supports complex computation in all functions due to its internal template structure. The host implementation is based on the std::complex. In binary, the data is the same also for the CUDA and for the OpenCL backend. The complex support of the backends with respect to the versions is presented in Table 4.1. 4.5. ALLOCATION AND FREE 29 paralution::BaseMatrix< ValueType > paralution::AcceleratorMatrix< ValueType > paralution::GPUAcceleratorMatrix< ValueType > paralution::GPUAcceleratorMatrixBCSR< ValueType > paralution::GPUAcceleratorMatrixCOO< ValueType > paralution::GPUAcceleratorMatrixCSR< ValueType > paralution::GPUAcceleratorMatrixDENSE< ValueType > paralution::GPUAcceleratorMatrixDIA< ValueType > paralution::GPUAcceleratorMatrixELL< ValueType > paralution::GPUAcceleratorMatrixHYB< ValueType > paralution::GPUAcceleratorMatrixMCSR< ValueType > Figure 4.5: GPUAcceleratorMatrix (CUDA) B S M Host Yes Yes Yes CUDA No Yes Yes OpenCL No Yes Yes Xeon Phi No No No Table 4.1: Complex support 4.5 Allocation and Free The allocation functions require a name of the object (this is only for information purposes) and corresponding size description for vector and matrix objects. 1 L o c a l V e c t o r vec ; 2 3 vec . A l l o c a t e ( ”my v e c t o r ” , 1 0 0 ) ; 4 vec . C l e a r ( ) ; ”Vector allocation/free” 1 2 3 4 5 6 7 LocalMatrix mat ; mat . AllocateCSR ( ”my c s r matrix ” , 4 5 6 , 1 0 0 , 1 0 0 ) ; // nnz , rows , columns mat . C l e a r ( ) ; mat . AllocateCOO ( ”my coo matrix ” , 2 0 0 , 1 0 0 , 1 0 0 ) ; // nnz , rows , columns mat . C l e a r ( ) ; ”Matrix allocation/free” 4.6 Matrix Formats Matrices where most of the elements are equal to zero are called sparse. In most practical applications the number of non-zero entries is proportional to the size of the matrix (e.g. typically, if the matrix A is RN ×N then the number of elements are of order O(N )). To save memory, we can avoid storing the zero entries by introducing a structure corresponding to the non-zero elements of the matrix. PARALUTION supports sparse CSR, MCSR, COO, ELL, DIA, HYB and dense matrices (DENSE). To illustrate the different format, let us consider the following matrix in Figure 4.6. Here the matrix is A ∈ R5×5 with 11 non-zero entries. The indexing in all formats described below are zero based (i.e. the index values starts at 0, not with 1). 30 CHAPTER 4. SINGLE-NODE COMPUTATION 0 0 1 1 2 2 3 4 11 1 3 4 2 5 6 7 3 8 4 9 10 Figure 4.6: A sparse matrix example Note The functionality of every matrix object is different and depends on the matrix format. The CSR format provides the highest support for various functions. For a few operations an internal conversion is performed, however, for many routines an error message is printed and the program is terminated. Note In the current version, some of the conversions are performed on the host (disregarding the actual object allocation - host or accelerator). 1 2 3 4 5 6 7 mat . ConvertToCSR ( ) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n CSR format mat . Apply ( x , &y ) ; 1 2 3 4 5 6 7 mat . ConvertTo (CSR) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n CSR format mat . Apply ( x , &y ) ; mat . ConvertToELL ( ) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n ELL format mat . Apply ( x , &y ) ; ”Conversion between matrix formats” mat . ConvertTo (ELL) ; // Perform a matrix−v e c t o r m u l t i p l c a t i o n i n ELL format mat . Apply ( x , &y ) ; ”Conversion between matrix formats (alternative)” 4.6.1 Coordinate Format – COO The most intuitive sparse format is the coordinate format (COO). It represent the non-zero elements of the matrix by their coordinates, we need to store two index arrays (one for row and one for column indexing) and the values. Thus, our example matrix will have the following structure: row 0 0 0 1 1 2 2 2 3 4 4 col 0 1 3 1 2 1 2 3 3 3 4 val 1 2 11 3 4 5 6 7 8 9 10 Figure 4.7: Sparse matrix in COO format 4.6.2 Compressed Sparse Row/Column Format – CSR/CSC One of the most popular formats in many scientific codes is the compressed sparse row (CSR) format. In this format, we do not store the whole row indices but we only save the offsets to positions. Thus, we can easily jump to any row and we can access sequentially all elements there. However, this format does not allow sequential accessing of the column entries. 4.6. MATRIX FORMATS 31 row offset 0 3 5 8 9 11 col 0 1 3 1 2 1 2 3 3 3 4 val 1 2 11 3 4 5 6 7 8 9 10 Figure 4.8: Sparse matrix in CSR format Analogy to this format is the compressed sparse column (CSC), where we represent the offsets by the column – it is clear that we can traverse column elements sequentially. In many finite element (difference/volumes) applications, the diagonal elements are non-zero. In such cases, we can store them at the beginning of the value arrays. This format is often referred as modified compressed sparse row/column (MCSR/MCSC). 4.6.3 Diagonal Format – DIA If all (or most) of the non-zero entries belong to a few diagonals of matrix, we can store them with the responding offsets. In our example, we have 4 diagonal, the main diagonal (denoted with 0 offset) is fully occupied while the others contain zero entries. dig val -1 0 1 3 * 1 2 11 0 3 4 0 5 6 7 * 0 8 0 * 9 10 * * Figure 4.9: Sparse matrix in DIA format Please note, that the values in this format are stored as array with size D × ND , where D is the number of diagonals in the matrix and ND is the number of elements in the main diagonal. Since, not all values in this array are occupied - the not accessible entries are denoted with star, they correspond to the offsets in the diagonal array (negative values represent offsets from the beginning of the array). 4.6.4 ELL Format The ELL format can be seen as modification of the CSR, where we do not store the row offsets. Instead, we have a fixed number of elements per row. 4.6.5 HYB Format As you can notice the DIA and ELL cannot represent efficiently completely unstructured sparse matrix. To keep the memory footprint low DIA requires the elements to belong to a few diagonals and the ELL format needs fixed number of elements per row. For many applications this is a too strong restriction. A solution to this issue is to represent the more regular part of the matrix in such a format and the remaining part in COO format. The HYB format, implemented in PARALUTION, is a mix between ELL and COO, where the maximum elements per row for the ELL part is computed by nnz/num row. 32 CHAPTER 4. SINGLE-NODE COMPUTATION col val 0 1 3 1 2 11 1 2 * 3 4 * 1 2 3 5 6 7 3 * * 8 * * 3 4 * 9 10 * Figure 4.10: Sparse matrix in ELL format 4.6.6 Memory Usage The memory footprint of the different matrix formats is presented in the following table. Here, we consider a N × N matrix where the number of non-zero entries are denoted with N N Z. Format Structure Values Dense COO CSR ELL DIA – 2 × NNZ N + 1 + NNZ M ×N D N ×N NNZ NNZ M ×N D × ND For the ELL matrix M characterizes the maximal number of non-zero elements per row and for the DIA matrix D defines the number of diagonals and ND defines the size of the main diagonal. 4.6.7 Backend support CSR COO ELL DIA HYB DENSE BCSR 4.7 Host Yes Yes1 Yes Yes Yes Yes No CUDA Yes Yes Yes Yes Yes Yes No OpenCL Yes Yes Yes Yes Yes Yes2 No MIC/Xeon Phi Yes Yes1 Yes Yes Yes No No I/O The user can read and write matrix files stored in Matrix Market Format [37]. 1 LocalMatrix mat ; 2 mat . ReadFileMTX ( ” my matrix . mtx” ) ; 3 mat . WriteFileMTX ( ” my matrix . mtx” ) ; ”I/O MTX Matrix” Matrix files in binary format are also supported for the compressed sparse row storage format. 1 LocalMatrix mat ; 2 mat . ReadFileCSR ( ” my matrix . c s r ” ) ; 3 mat . WriteFileCSR ( ” my matrix . c s r ” ) ; ”I/O CSR Binary Matrix” 1 Serial 2 Basic version version 4.8. ACCESS 33 The binary format stores the CSR relevant data as follows 1 2 3 4 5 6 out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) out . w r i t e ( ( c h a r ∗ ) &nrow , s i z e o f ( IndexType ) ) ; &n c o l , s i z e o f ( IndexType ) ) ; &nnz , s i z e o f ( IndexType ) ) ; r o w o f f s e t , ( nrow+1)∗ s i z e o f ( IndexType ) ) ; c o l , nnz ∗ s i z e o f ( IndexType ) ) ; val , nnz ∗ s i z e o f ( ValueType ) ) ; ”CSR Binary Format” The vector can be read or written via ASCII formatted files 1 L o c a l V e c t o r vec ; 2 3 vec . ReadFileASCII ( ” my vector . dat ” ) ; 4 vec . WriteFileASCII ( ” my vector . dat ” ) ; ”I/O ASCII Vector” 4.8 Access ValueType D,F,I,C Computation H Available B,S,M The elements in the vector can be accessed via [] operators when the vector is allocated on the host. In the following example, a vector is allocated with 100 elements and initialized with 1 for all odd elements and −1 for all even elements. 1 2 3 4 5 6 7 8 L o c a l V e c t o r vec ; vec . A l l o c a t e ( ” v e c t o r ” , 1 0 0 ) ; vec . Ones ( ) ; f o r ( i n t i =0; i <100; i=i +2) vec [ i ] = −1; ”Vector element access” Note Accessing elements via the [] operators is slow. Use this for debugging only. There is no direct access to the elements of matrices due to the sparsity structure. Matrices can be imported by a copy function, for CSR matrix this is CopyFromCSR() and CopyToCSR(). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 // A l l o c a t e t h e CSR matrix i n t ∗ r o w o f f s e t s = new i n t [ 1 0 0 + 1 ] ; i n t ∗ c o l = new i n t [ 3 4 5 ] ; ValueType ∗ v a l = new ValueType [ 3 4 5 ] ; // f i l l t h e CSR matrix ... // C r e a t e a PARALUTION matrix LocalMatrix mat ; // Import matrix t o PARALUTION mat . AllocateCSR ( ”my matrix ” , 3 4 5 , 1 0 0 , 1 0 0 ) ; mat . CopyFromCSR( r o w o f f s e t s , c o l , v a l ) ; // Export matrix from PARALUTION // t h e r o w o f f s e t s , c o l , v a l have t o be a l l o c a t e d mat . CopyToCSR( r o w o f f s e t s , c o l , v a l ) ; ”Matrix access” 34 CHAPTER 4. SINGLE-NODE COMPUTATION 4.9 Raw Access to the Data ValueType D,F,I,C Computation H,C Available B,S,M For vector and matrix objects, you can have direct access to the raw data via pointers. You can set already allocated data with the SetDataPtr() function. 1 L o c a l V e c t o r vec ; 2 3 ValueType ∗ p t r v e c = new ValueType [ 2 0 0 ] ; 4 5 vec . SetDataPtr (& p t r v e c , ” v e c t o r ” , 2 0 0 ) ; ”Set allocated data to a vector” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 // A l l o c a t e t h e CSR matrix i n t ∗ r o w o f f s e t s = new i n t [ 1 0 0 + 1 ] ; i n t ∗ c o l = new i n t [ 3 4 5 ] ; ValueType ∗ v a l = new ValueType [ 3 4 5 ] ; // f i l l t h e CSR matrix ... // C r e a t e a PARALUTION matrix LocalMatrix mat ; // S e t t h e CSR matrix i n PARALUTION mat . SetDataPtrCSR(& r o w o f f s e t s , &c o l , &val , ”my matrix ” , 345 , 100 , 100) ; ”Set allocated data to a CSR matrix” With LeaveDataPtr() you can obtain the raw data from the object. This will leave the object empty. 1 2 3 4 5 6 7 L o c a l V e c t o r vec ; ValueType ∗ p t r v e c = NULL; vec . A l l o c a t e ( ” v e c t o r ” , 1 0 0 ) ; vec . LeaveDataPtr(& p t r v e c ) ; ”Get (steal) the data from a vector” 1 2 3 4 5 6 7 8 9 10 11 12 13 // C r e a t e a PARALUTION matrix LocalMatrix mat ; // A l l o c a t e and f i l l t h e PARALUTION matrix mat ... // D e f i n e e x t e r n a l CSR s t r u c t u r e i n t ∗ r o w o f f s e t s = NULL; i n t ∗ c o l = NULL; ValueType ∗ v a l = NULL; mat . LeaveDataPtrCSR(& r o w o f f s e t s , &c o l , &v a l ) ; ”Get (steal) the data from a CSR matrix” After calling the SetDataPtr*() functions (for Vectors or Matrices), the passed pointers will be set to NULL. 4.10. COPY CSR MATRIX HOST DATA 35 Note If the object is allocated on the host then the pointers from the SetDataPtr() and LeaveDataPtr() will be on the host, if the vector object is on the accelerator then the data pointers will be on the accelerator. Note If the object is moved to and from the accelerator then the original pointer will be invalid. Note Never rely on old pointers, hidden object movement to and from the accelerator will make them invalid. Note Whenever you pass or obtain pointers to/from a PARALUTION object, you need to use the same memory allocation/free functions, please check the source code for that (for Host src/utils/allocate free.cpp and for Host/CUDA src/base/gpu/gpu allocate free.cu) 4.10 Copy CSR Matrix Host Data ValueType D,F,C Computation H,C,O Available B,S,M If the CSR matrix data pointers are only accessible as constant, the user can create a PARALUTION matrix object and pass const CSR host pointers by using the CopyFromHostCSR() function. PARALUTION will then allocate and copy the CSR matrix on the corresponding backend, where the original object was located at. 4.11 Copy Data ValueType D,F,I,C Computation H,C Available B,S,M The user can copy data to and from a local vector via the CopyFromData() and CopyToData() functions. The vector must be allocated before with the corresponding size of the data. 4.12 Object Info Information about the object can be printed with the info() function 1 mat . i n f o ( ) ; 2 vec . i n f o ( ) ; ”Vector/Matrix information” 1 L o c a l M a t r i x name=G 3 c i r c u i t . mtx ; rows =1585478; c o l s =1585478; nnz =7660826; p r e c =64 b i t ; asm=no ; format=CSR ; backends={CPU(OpenMP) , GPU(CUDA) } ; c u r r e n t=CPU(OpenMP) 2 3 L o c a l V e c t o r name=x ; s i z e =900; p r e c =64 b i t ; h o s t backend={CPU(OpenMP) } ; a c c e l e r a t o r backend={No A c c e l e r a t o r } ; c u r r e n t=CPU(OpenMP) ”Vector/Matrix information” In this example, the matrix has been loaded, stored in CSR format, double precision, the library is compiled with CUDA support and no MKL, and the matrix is located on the host. The vector information is coming from another compilation of the library with no OpenCL/CUDA/MKL support. 4.13 Copy ValueType D,F,I,C Computation H,C,O,X Available B,S,M All matrix and vector objects provide a CopyFrom() and a CopyTo() function. The destination object should have the same size or be empty. In the latter case the object is allocated at the source platform. Note This function allows cross platform copying - one of the objects could be allocated on the accelerator backend. 36 1 2 3 4 5 6 7 8 9 10 11 12 13 CHAPTER 4. SINGLE-NODE COMPUTATION L o c a l V e c t o r vec1 , vec2 ; // A l l o c a t e and i n i t vec1 and vec2 // . . . vec1 . MoveToAccelerator ( ) ; // now vec1 i s on t h e a c c e l e r a t o r ( i f any ) // and vec2 i s on t h e h o s t // we can copy vec1 t o vec2 and v i c e v e r s a vec1 . CopyFrom ( vec2 ) ; ”Vector copy” Note For vectors, the user can specify source and destination offsets and thus copy only a part of the whole vector into another vector. Note When copying a matrix - the source and destination matrices should be in the same format. 4.14 Clone ValueType D,F,I,C Computation H,C,O,X Available B,S,M The copy operators allow you to copy the values of the object to another object, without changing the backend specification of the object. In many algorithms you might need auxiliary vectors or matrices. These objects can be cloned with the function CloneFrom(). 1 2 3 4 5 6 7 8 9 10 L o c a l V e c t o r vec ; // a l l o c a t e and i n i t vec ( h o s t o r a c c e l e r a t o r ) // . . . L o c a l V e c t o r tmp ; // tmp w i l l have t h e same v a l u e s // and i t w i l l be on t h e same backend a s vec tmp . CloneFrom ( vec ) ; ”Clone” If the data of the object needs to be kept, then you can use the CloneBackend() function to copy (clone) only the backend. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 L o c a l V e c t o r vec ; LocalMatrix mat ; // a l l o c a t e and i n i t vec , mat ( h o s t o r a c c e l e r a t o r ) // . . . L o c a l V e c t o r tmp ; // tmp and vec w i l l have t h e same backend a s mat tmp . CloneBackend ( mat ) ; vec . CloneBackend ( mat ) ; // t h e matrix−v e c t o r m u l t i p l i c a t i o n w i l l be performed // on t h e backend s e l e c t e d i n mat mat . Apply ( vec , &tmp ) ; ”Clone backend” 4.15. ASSEMBLING 4.15 37 Assembling ValueType D,F,I,C Computation H (CSR-only) Available B,S,M In many codes based on finite element, finite difference or finite volume, the user need to assemble the matrix before solving the linear system. For this purpose we provide an assembling function – the user need to pass three arrays, representing the row and column index and the value; similar function is provided for the vector assembling. Ai,j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X vi,j . (4.1) LocalMatrix mat ; L o c a l V e c t o r r h s ; // rows i n t i [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 4 , 3 , 3 } ; // c o l s i n t j [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 2 , 2 , 5 , 2 , 2 } ; // v a l u e s ValueType v [ 1 1 ] = { 2 . 3 , 3 . 5 , 4 . 7 , 6 . 3 , 0 . 4 , 4 . 3 , 6 . 2 , 4 . 4 , 4 . 6 , 0 . 7 , 4 . 8 } ; mat . Assemble ( i , j , v , // t u p l e ( i , j , v ) 11 , // s i z e o f t h e i n p u t a r r a y ”A” ) ; // name o f t h e matrix r h s . Assemble ( i , a , // a s s e m b l i n g data 11 , // s i z e o f t h e i n p u t a r r a y ” r h s ” ) ; // name o f t h e v e c t o r ”Matrix assembling” In this case the function will determine the matrix size, the number of non-zero elements, as well as the non-zero pattern and it will assemble the matrix in CSR format. If the size of the matrix is know, then the user can pass this information to the assembling function and thus it will avoid a loop for checking the indexes before the actual assembling procedure. 1 mat . Assemble ( i , j , a , 1 1 , ”A” , 6 , 6 ) ; ”Matrix assembling with known size” The matrix function has two implementations – serial and OpenMP parallel (host only). In many cases, one might want to assemble the matrix in a loop by modifying the original index pattern – typically for time-dependent/non-linear problems. In this case, the matrix does not need to be fully assembled, the user can use the AssembleUpdate() function to provide the new values. 1 2 3 4 5 6 7 8 9 10 11 12 13 L o c a l V e c t o r x , r h s ; LocalMatrix mat ; i n t i [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 4 , 3 , 3 } ; i n t j [ 1 1 ] = {0 , 1 , 2 , 3 , 4 , 5 , 2 , 2 , 5 , 2 , 2 } ; ValueType a [ 1 1 ] = { 2 . 3 , 3 . 5 , 4 . 7 , 6 . 3 , 0 . 4 , 4 . 3 , 6 . 2 , 4 . 4 , 4 . 6 , 0 . 7 , 4 . 8 } ; mat . Assemble ( i , j , a , 1 1 , ”A” ) ; r h s . Assemble ( i , a , 1 1 , ” r h s ” ) ; x . A l l o c a t e ( ”x” , mat . g e t n c o l ( ) ) ; mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; 38 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 CHAPTER 4. SINGLE-NODE COMPUTATION x . MoveToAccelerator ( ) ; GMRES , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; f o r ( i n t k=0; k <10; k++) { mat . Z e r o s ( ) ; x . Zeros () ; rhs . Zeros () ; // Modify t h e a s s e m b l i n g data a [ 4 ] = k∗k ∗ 3 . 3 ; a [ 8 ] = k∗k − 1 . 3 ; a [ 1 0 ] = k∗k / 2 . 0 ; a[3] = k; mat . AssembleUpdate ( a ) ; r h s . Assemble ( i , a , 1 1 , ” r h s ” ) ; l s . R e s e t O p e r a t o r ( mat ) ; l s . S o l v e ( rhs , &x ) ; } x . MoveToHost ( ) ; ”Several matrix assembling with static pattern” If the information for assembling is available (for calling the AssembleUpdate() function), then in the info() function will print asm=yes For the implementation of the assembly function we use an adopted and modified code based on [16]. Performance results for various cases can be found in [17]. 4.16 Check ValueType D,F,I,C Computation H (CSR-only) Available B,S,M Checks, if the object contains valid data via the Check() function. For vectors, the function checks if the values are not infinity and not NaN (not a number). For the matrices, this function checks the values and if the structure of the matrix is correct. 4.17 Sort ValueType D,F,I,C Computation H (CSR-only) Available B,S,M Sorts the column values in a CSR matrix via the Sort() function. 4.18 Keying ValueType D,F,I,C Computation H (CSR-only) Available S,M 4.19. GRAPH ANALYZERS 39 Typically it is hard to compare if two matrices have the same (structure and values) or they just have the same structure. To do this, we provide a key function which generates three keys, for the row index, column index and for the values. An example is presented in the following Listing. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 LocalMatrix mat ; mat . ReadFileMTX ( s t d : : s t r i n g ( argv [ 1 ] ) ) ; l o n g i n t row key ; long int col key ; long int val key ; mat . Key ( row key , col key , val key ) ; s t d : : c o u t << ”Row key = ” << row key << s t d : : e n d l << ” Col key = ” << c o l k e y << s t d : : e n d l << ” Val key = ” << v a l k e y << s t d : : e n d l ; ”Generating a matrix key” 4.19 Graph Analyzers ValueType D,F,I,C Computation H (CSR-only) Available B,S,M The following functions are available for analyzing the connectivity in graph of the underlying sparse matrix. • (R)CMK reordering • Maximal independent set • Multi-coloring • ZeroBlockPermutation • Connectivity ordering All graph analyzing functions return a permutation vector (int type) which is supposed to be used with the Permute() and PermuteBackward() functions in the matrix and vector classes. The CMK (Cuthill–McKee) and RCMK (Reverse Cuthill–McKee) orderings minimize the bandwidth of a given sparse matrix. 1 L o c a l V e c t o r cmk ; 2 3 mat .RCMK(&cmk ) ; 4 mat . Permute ( cmk ) ; ”CMK ordering” The Maximal independent set (MIS) algorithm finds a set with maximal size which contains elements that do not depend on other elements in this set. 1 2 3 4 5 6 L o c a l V e c t o r mis ; int size ; mat . MaximalIndependentSet ( s i z e , &mis ) ; mat . Permute ( mis ) ; ”MIS ordering” The Multi-coloring (MC) algorithm builds a permutation (coloring the matrix) in such way that no two adjacent nodes in the sparse matrix are having the same color. 40 1 2 3 4 5 6 7 8 9 CHAPTER 4. SINGLE-NODE COMPUTATION L o c a l V e c t o r mc ; i n t num colors ; i n t ∗ b l o c k c o l o r s = NULL; mat . M u l t i C o l o r i n g ( num colors , &b l o c k c o l o r s , &mc) ; mat . Permute (mc) ; ”MC ordering” For saddle-point problems (i.e. matrices with zero-diagonal entries) the ZeroBlockPermutation maps all zero-diagonal elements to the last block of the matrix. 1 2 3 4 5 6 L o c a l V e c t o r zbp ; int size ; mat . ZeroBlockPermutation ( s i z e , &zbp ) ; mat . Permute ( zbp ) ; ”Zero-Block-Permutation” Connectivity ordering returns a permutation which sorts the matrix by non-zero entries per row. 1 L o c a l V e c t o r conn ; 2 3 mat . C o n n e c t i v i t y O r d e r (&conn ) ; 4 mat . Permute ( conn ) ; ”Connectivity ordering” Visualization of the graph analyzers can be found in Chapter 13. 4.20 Basic Linear Algebra Operations There are more functions and routines which matrix and vector objects can perform - for further specifications and API, check the doxygen documentation. Matrix objects can also perform • Matrix-vector multiplication x := Ay – Apply() • Matrix-vector multiplication and addition x := x + Ay – ApplyAdd() • Symmetric permutation – Permute(); PermuteBackward() • Sub-matrix / diagonal / L / U extractions – ExtractSubMatrix(); ExtractSubMatrices(); ExtractDiagonal(); ExtractInverseDiagonal(); ExtractU(); ExtractL(); ExtractColumnVector() ; ExtractRowVector() • Sparse triangular solvers (L,U,LL,LU) – ; LSolve(); LLSolve(); LUSolve() • Factorizations – ILU(0) – ILU with no fill-ins – ILU0() – ILU(p) – based on levels – ILUpFactorize(); – ILU(p,q) – based on power(q)-pattern method – via ILUpFactorize() – ILUT – based on threshold ILUTFactorize() – IC0 – Incomplete Cholesky with no fill-ins – ICFactorize() • FSAI computation – FSAI() • Symbolic power – compute the pattern of |A|p , where |.| = ai,j – SymbolicPower() • Sparse matrix-matrix addition (with the same or different sparsity patterns) – MatrixAdd() • Sparse matrix-matrix multiplication – MatrixMult() • Sparse matrix multiplication with a diagonal matrix (left and right multiplication) – DiagonalMatrixMultL(); DiagonalMatrixMultR() 4.21. LOCAL STENCILS 41 • Compute the Gershgorin circles (eigenvalue approximations) – Gershgorin() • Compress (i.e. delete elements) under specified threshold, this function also updates the structure – Compress() • Transpose – Transpose() • Create a restriction matrix via restriction map – CreateFromMap() • Scale with scalar all values / diagonal / off-diagonal – Scale(); ScaleDiagonal(); ScaleOffDiagonal() • Add a scalar to all values / diagonal / off-diagonal – AddScalar(); AddScalarDiagonal(); AddScalarOffDiagonal() Vector objects can also perform • Dot product (scalar product) – Dot() • Scaling – Scale() • Vector updates of types: x := x + αy, x := y + αx, x := αx + βy, x := αx + βy + γz – ScaleAdd(); AddScale(); ScaleAddScale(); ScaleAdd2() • L1 , L2 and L∞ -norm – Asum(); Norm(); Amax() • Sum – Reduce() • Point-wise multiplication of two vector (element-wise multiplication) – PointWiseMult() • Permutations (forward and backward) – Permute(); PermuteBackward() • Copy within specified permutation – CopyFromPermute(); CopyFromPermuteBackward() • Restriction/prolongation via map – Restriction(); Prolongation() • Initialized the vector with random values – SetRandom() • Power – Power() • Copy from double or float vector – CopyFromDouble(); CopyFromFloat() 4.21 Local Stencils Instead of representing the linear operator as a matrix, the user can use stencil operators. The stencil operator has to be defined manually in each class and backend. This is a necessary step in order to have good performance in terms of spatial-block technique and compiler optimization. paralution::ParalutionObj paralution::BaseParalution< ValueType > paralution::Operator< ValueType > paralution::LocalStencil< ValueType > Figure 4.11: Code structure of a host LocalStencil 1 2 3 4 5 6 7 8 9 10 11 12 #i n c l u d e #i n c l u d e #i n c l u d e u s i n g namespace p a r a l u t i o n ; i n t main ( i n t argc , c h a r ∗ argv [ ] ) { init paralution () ; info paralution () ; 42 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 } CHAPTER 4. SINGLE-NODE COMPUTATION L o c a l V e c t o r
x ; L o c a l V e c t o r r h s ; L o c a l S t e n c i l s t e n c i l ( Laplace2D ) ; s t e n c i l . S e t G r i d ( 1 0 0 ) ; // 100 x100 x . A l l o c a t e ( ”x” , s t e n c i l . g e t n r o w ( ) ) ; rhs . Alloc ate ( ” rhs ” , s t e n c i l . get nrow ( ) ) ; // L i n e a r S o l v e r CG , L o c a l V e c t o r , d o u b l e > l s ; r h s . Ones ( ) ; x . Zeros () ; l s . SetOperator ( s t e n c i l ) ; l s . Build ( ) ; stencil . info () ; double tick , tack ; tick = paralution time () ; l s . S o l v e ( rhs , &x ) ; tack = paralution time ( ) ; s t d : : c o u t << ” S o l v e r e x e c u t i o n : ” << ( tack−t i c k ) /1000000 << ” s e c ” << s t d : : e n d l ; stop paralution () ; return 0; ”Example of 2D Laplace stencil” Chapter 5 Multi-node Computation 5.1 Introduction In this chapter we describe all base objects (matrices and vectors) for computation on multi-node (distributedmemory) systems. Two typical configurations are presented on Figure 5.1 and Figure 5.2. Network Figure 5.1: An example for a multi-node configurations, all nodes are connected via network – single socket system with a single accelerator. Network Figure 5.2: An example for a multi-node configurations, all nodes are connected via network – dual socket systems with two accelerators attached to each node To each compute node, one or more accelerators can be attached. The compute node could be any kind of shared-memory (single, dual, quad CPU) system, details on a single-node can be found in Figure 4.1. Note that the memory of the accelerator and of the host are physically different. All nodes can communicate with each other via network. For the communication channel between the nodes (and between the accelerators on single or multiple nodes) we use the MPI library. Additional libraries for communication can be added on request. 43 44 CHAPTER 5. MULTI-NODE COMPUTATION The PARALUTION library supports non-overlapping type of distribution, where the computational domain is split into several sub-domain with the corresponding information about the boundary and ghost layers. An example is shown on Figure 5.3. The square box domain is distributed into four sub-domains. Each subdomain belongs to a process P 0, P 1, P 2 and P 3. P0 P1 P0 P1 P2 P3 P2 P3 Figure 5.3: An example for domain distribution To perform a sparse matrix-vector (SpMV) multiplication, each process need to multiply its own portion of the domain and update the corresponding ghost elements. For P 0 this multiplication reads Ax AI xI + AG xG = y, (5.1) = yI , (5.2) where I stands for interior and G stands for ghost, the xG is a vector with three sections coming from P 1, P 2 and P 3. The whole ghost part of the global vector is used mainly for the SpMV product and this part of the vector does not play any role in the computation of all vector-vector operations. 5.2 Code Structure Each object contains two local sub-objects. The global matrix stores the interior and the ghost matrix via local objects. The global vector also stores its data via two local objects. In addition to the local data, the global objects have information about the global communication via the parallel manager – see Figure 5.4. Local Matrix/Vector New Backend Host Dynamic Switch Data/Mapping Data/Mapping Accelerator OpenMP Intel MIC OpenMP Data/Mapping Data/Mapping GPU, Accelerators OpenCL Data/Mapping GPU CUDA Data/Mapping Interior Local Matrix/Vector Ghost Local Matrix/Vector Mapping Figure 5.4: Global Matrices and Vectors 5.3 Parallel Manager The parallel manager class handles the communication and the mapping of the global matrix. Each global matrix and vector need to be initialized with a valid parallel manager in order to perform any operation. For many distributed simulation, the underlying matrix is already distributed. This information need to be passed to the parallel manager. 5.4. GLOBAL MATRICES AND VECTORS 45 The required information is: • Global size • Local size of the interior/ghost for each rank • Communication pattern (what information need to be sent to whom) 5.4 Global Matrices and Vectors The global matrices and vectors store their data via two local objects. For the global matrix, the interior can be access via the GetInterior() and GetGhost() functions, which point to two valid local matrices. The same function names exist for the the global vector, which point to two local vector objects. 5.4.1 Value Type and Formats The value type and the formats can be set in similar manner as in the local matrix. The supported formats are identical to the local case. 5.4.2 Matrix and Vector Operations Most functions which are available for the local vectors can be performed on the global vector objects as well. For the global matrix, only the SpMV function is supported but all other functions can be performed on the interior of the matrix (without the couplings via the ghost layer). 5.4.3 Asynchronous SpMV To minimize latency and to increase scalability the PARALUTION library supports asynchronous sparse matrixvector multiplication. The implementation of the SpMV starts with asynchronous transfer of the needed ghost buffers while at the same time it computes the interior matrix. When the computation of the interior SpMV is done, the ghost transfer is synchronized and the ghost SpMV is performed. To minimize the PCI-E bus, the OpenCL and CUDA implementation provide a special packaging technique for transferring all ghost data into a contiguous memory buffer. 5.5 Communication Channel For the communication channel we use MPI but this can be extended, on request, with some other communication mechanics. 5.6 I/O The user can store and load the full data from and to files. For a solver, the needed data would be: • Parallel manager • Sparse matrix • Vector The reading/writing of the data can be done fully in parallel without any communication. To visualize the data we use 4 × 4 grid which is distributed on 4 MPI processes (organized in 2 × 2), each local matrix store the local unknowns (with local index) – see Figure 5.5. The data associated with RANK 0 is presented on Figure 5.6. 5.6.1 File Organization When the parallel manager, global matrix or global vector are writing to a file, the main file (passed as a file name to this function) will contain information for all files on all ranks. 1 2 3 4 pm. dat . rank . 0 pm. dat . rank . 1 pm. dat . rank . 2 pm. dat . rank . 3 ”Parallel manager (main) file with 4 MPI ranks” 46 CHAPTER 5. MULTI-NODE COMPUTATION 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 Figure 5.5: An example of 4 × 4 grid, distributed in 4 domains (2 × 2) 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 RANK = 0 GLOBAL_SIZE = 16 LOCAL_SIZE = 4 GHOST_SIZE = 4 BOUNDARY_SIZE = 4 NUMBER_OF_RECEIVERS = 2 NUMBER_OF_SENDERS = 2 RECEIVERS_RANK = {1, 2} SENDERS_RANK = {1, 2} RECEIVERS_INDEX_OFFSET = {0, 2, 4} SENDERS_INDEX_OFFSET = {0, 2, 4} BOUNDARY_INDEX = {1, 3, 2, 3} Figure 5.6: An example of 4 MPI process and the data associate with RANK 0 1 2 3 4 5 6 7 8 matrix . mtx . i n t e r i o r . rank . 0 matrix . mtx . g h o s t . rank . 0 matrix . mtx . i n t e r i o r . rank . 1 matrix . mtx . g h o s t . rank . 1 matrix . mtx . i n t e r i o r . rank . 2 matrix . mtx . g h o s t . rank . 2 matrix . mtx . i n t e r i o r . rank . 3 matrix . mtx . g h o s t . rank . 3 ”Matrix (main) file with 4 MPI ranks” 1 2 3 4 r h s . dat . rank . 0 r h s . dat . rank . 1 r h s . dat . rank . 2 r h s . dat . rank . 3 ”Vector (main) file with 4 MPI ranks” Example for the data in each file can be found in Section 15.7. 5.6.2 Parallel Manager The data for each rank can be seen as two type of information – one for receiving data from the neighbors, which is presented on Figure 5.7. There the RANK0 needs information what type of data will be received, and from whom. The second needed data is the sending information – RANK0 needs to know where and what to send, see Figure 5.8. Receiving data – RANK0 requires: • Total size of the received information (GHOST SIZE – integer value). • Number of MPI ranks which will send data to RANK0 (NUMBER OF RECEIVERS – integer value). • Which are the MPI ranks sending the data (RECEIVERS RANK – integer array). • How the received data (from each rank) will be stored in the ghost vector (RECEIVERS INDEX OFFSET – integer array), in this example the first 30 elements will be received from P1 [0, 2) and the second 30 from P2 [2, 4) Sending data – RANK0 requires: 5.6. I/O 47 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 RANK = 0 GLOBAL_SIZE = 16 LOCAL_SIZE = 4 GHOST_SIZE = 4 BOUNDARY_SIZE = 4 NUMBER_OF_RECEIVERS = 2 NUMBER_OF_SENDERS = 2 RECEIVERS_RANK = {1, 2} SENDERS_RANK = {1, 2} RECEIVERS_INDEX_OFFSET = {0, 2, 4} SENDERS_INDEX_OFFSET = {0, 2, 4} BOUNDARY_INDEX = {1, 3, 2, 3} Figure 5.7: An example of 4 MPI process, RANK 0 receives data, the associated data is marked in bold) 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3 RANK = 0 GLOBAL_SIZE = 16 LOCAL_SIZE = 4 GHOST_SIZE = 4 BOUNDARY_SIZE = 4 NUMBER_OF_RECEIVERS = 2 NUMBER_OF_SENDERS = 2 RECEIVERS_RANK = {1, 2} SENDERS_RANK = {1, 2} RECEIVERS_INDEX_OFFSET = {0, 2, 4} SENDERS_INDEX_OFFSET = {0, 2, 4} BOUNDARY_INDEX = {1, 3, 2, 3} Figure 5.8: An example of 4 MPI process, RANK 0 sends data, the associated data is marked in bold • Total size of the sending information (BOUNDARY SIZE – integer value). • Number of MPI ranks which will receive data from RANK0 (NUMBER OF SENDERS – integer value). • Which are the MPI ranks receiving the data (SENDERS RANK – integer array). • How the sending data (from each rank) will be stored in the sending buffer (SENDERS INDEX OFFSET – integer array), in this example the first 30 elements will be sent to P1 [0, 2) and the second 30 to P2 [2, 4) • The elements which need to be send (BOUNDARY INDEX – integer array). In this example the data which needs to be send to P1 and to P2 is the ghost layer marked as ghost P0. The vertical stripe needs to be send to P1 and the horizontal stripe to P2. The numbering of local unknowns (in local indexing) for P1 (the vertical stripes) are 1, 2 (size of 2) and they are stored in the BOUNDARY INDEX. After 2 elements, the elements for P2 are stored, they are 2, 3 (2 elements). 5.6.3 Matrices For each rank, two matrices are used – interior and ghost matrix. They can be stored in separate files, one for each matrix. The file format could be Matrix Market (MTX) or binary. 5.6.4 Vectors For each rank, the vector object needs only the local (interior) values. They are stored into a single file. The file could be ASCII or binary. 48 CHAPTER 5. MULTI-NODE COMPUTATION Chapter 6 Solvers In this chapter, all linear solvers are presented. Most of the iterative solvers can be performed on linear operators LocalMatrix, LocalStencil and GlobalMatrix – i.e. the iterative solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI. The only exception is the AMG (Algebraic Multigrid) which has two versions (one for the Local and one for the Global type of computation). The only pure local solvers (the one which does not support global/MPI operations) are the mixed-precision defect-correction solver, all direct solvers and the eigenvalue solvers. All solvers need three template parameters – Operators, Vectors and Scalar type. There are three possible combinations • LocalMatrix, LocalVector, float or double • LocalStencil, LocalVector, float or double • GlobalMatrix, GlobalVector, float or double where the Operators/Vectors need to use the same ValueType as the scalar for the solver. 6.1 Code Structure Variable Preconditioner Deflated PCG Block-Jacobi AS/RAS Solvers Multigrid AMG/GMG CPR Mixed-Precision Defect-Correction MultiElimination ILU Iterative Liner Solvers Chebyshev Iteration Fixed-Iteration Schemes Preconditioners Saddle-point Iteration Control ILU/IC CG, BiCGStab, GMRES, IDR, CR FSAI/SPAI Chebyshev MC-GS/SGS/SOR/SSOR Power(q)-pattern ILU Figure 6.1: Solver’s classes hierarchy The base class Solver is purely virtual, it provides an interface for: 49 50 CHAPTER 6. SOLVERS • SetOperator() – set the operator A - i.e. the user can pass the matrix here • Build() – build the solver (including preconditioners, sub-solvers, etc), the user need to specify the operator first • Solve() – solve the system Ax = b, the user need to pass a right-hand-side b and a vector x where the solution will be obtained • Print() – shows solver information • ReBuildNumeric() – only re-build the solver numerically (if possible), • MoveToHost() and MoveToAccelerator() – offload the solver (including preconditioners and sub-solvers) to the host/accelerator. The computation of the residual can be based on different norms (L1 , L2 and L∞ ). This is controlled by the function SetResidualNorm(). paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::IterativeLinearSolver< OperatorType, VectorType, ValueType > paralution::BaseMultiGrid< OperatorType, VectorType, ValueType > paralution::BiCGStab< OperatorType, VectorType, ValueType > paralution::BiCGStabl< OperatorType, VectorType, ValueType > paralution::CG< OperatorType, VectorType, ValueType > paralution::Chebyshev< OperatorType, VectorType, ValueType > paralution::CR< OperatorType, VectorType, ValueType > paralution::FCG< OperatorType, VectorType, ValueType > paralution::FGMRES< OperatorType, VectorType, ValueType > paralution::FixedPoint< OperatorType, VectorType, ValueType > paralution::GMRES< OperatorType, VectorType, ValueType > paralution::IDR< OperatorType, VectorType, ValueType > paralution::QMRCGStab< OperatorType, VectorType, ValueType > Figure 6.2: Iterative Solvers paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::IterativeLinearSolver< OperatorType, VectorType, ValueType > paralution::BaseMultiGrid< OperatorType, VectorType, ValueType > paralution::MultiGrid< OperatorType, VectorType, ValueType > Figure 6.3: Multi-Grid paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::DirectLinearSolver< OperatorType, VectorType, ValueType > paralution::Inversion< OperatorType, VectorType, ValueType > paralution::LU< OperatorType, VectorType, ValueType > Figure 6.4: Direct Solvers paralution::QR< OperatorType, VectorType, ValueType > 6.2. ITERATIVE LINEAR SOLVERS 6.2 51 Iterative Linear Solvers The iterative solvers are controlled by an iteration control object, which monitors the convergence properties of the solver - i.e. maximum number of iteration, relative tolerance, absolute tolerance, divergence tolerance. The iteration control can also record the residual history and store it in an ASCII file. • Init(), InitMinIter(), InitMaxIter(), InitTol() – initialize the solver and set the stopping criteria • RecordResidualHistory(), RecordHistory() – start the recording of the residual and store it into a file • Verbose() – set the level of verbose output of the solver (0 – no output, 2 – detailed output, including residual and iteration information) • SetPreconditioner() – set the preconditioning 6.3 Stopping Criteria All iterative solvers are controlled based on: • Absolute stopping criteria, when the |rk |Lp < epsABS • Relative stopping criteria, when the |rk |Lp /|r1 |Lp ≤ epsREL • Divergence stopping criteria, when the |rk |Lp /|r1 |Lp ≥ epsDIV • Maximum number of iteration N , when k = N where k is the current iteration, rk is the residual for the current iteration k (i.e. rk = b − Axk ), r1 is the starting residual (i.e. r1 = b − Axinitguess ). In addition, the minimum number of iteration M can be specified. In this case, the solver will not stop to iterate before k ≥ M . 1 2 3 4 5 6 7 8 9 10 11 12 CG , L o c a l V e c t o r , ValueType > l s ; l s . I n i t ( 1 e −10 , 1 e −8, 1 e +8, 10000) ; // // // // abs tol rel tol div tol max iter l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Setting different stopping criteria in a CG solver” The Lp is the norm used for the computation, where p could be 1, 2 and ∞. The norm computation can be set via SetResidualNorm() function with 1 for L1 , 2 for L2 and 3 for L∞ . For the computation with the L∞ , the index of maximum value can be obtained via the GetAmaxResidualIndex() function. If this function is called and the L∞ has not been selected then this function will return −1. The reached criteria can be obtained via GetSolverStatus(), which will return: • 0 – if no criteria has been reached yet • 1 – if absolute tolerance has been reached • 2 – if relative tolerance has been reached • 3 – if divergence tolerance has been reached • 4 – if maximum number of iteration has been reached 1 2 3 4 5 6 7 8 CG , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . SetResidualNorm ( 1 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”CG solver with L1 norm” 52 1 2 3 4 5 6 7 8 9 10 11 12 CHAPTER 6. SOLVERS CG , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . SetResidualNorm ( 3 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; s t d : : c o u t << ” Index o f t h e L \ i n f t y = ” << l s . GetAmaxResidualIndex ( ) << s t d : : end ; s t d : : c o u t << ” S o l v e r s t a t u s = ” << l s . G e t S o l v e r S t a t u s ( ) << s t d : : e n d l ; ”CG solver with L∞ norm” 6.4 Building and Solving Phase Each iterative solver consists of a building step and a solving step. During the building step all necessary auxiliary data is allocated and the preconditioner is constructed. After that the user can call the solving procedure, the solving step can be called several times. When the initial matrix associated with the solver is on the accelerator, the solver will try to build everything on the accelerator. However, some preconditioners and solvers (such as FSAI and AMG) need to be constructed on the host and then they are transferred to the accelerator. If the initial matrix is on the host and we want to run the solver on the accelerator then we need to move the solver to the accelerator as well as the matrix, the right-hand-side and the solution vector. Note that if you have a preconditioner associate with the solver, it will be moved automatically to the accelerator when you move the solver. In the following Listing we present these two scenarios. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CG , L o c a l V e c t o r , ValueType > l s ; MultiColoredILU , L o c a l V e c t o r , ValueType > p ; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CG , L o c a l V e c t o r , ValueType > l s ; MultiColoredILU , L o c a l V e c t o r , ValueType > p ; mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; x . MoveToAccelerator ( ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; // The s o l v e r and t h e p r e c o n d i t i o n e r w i l l be c o n s t r u c t e d on t h e a c c e l e r a t o r l s . Build ( ) ; // The s o l v i n g phase i s performed on t h e a c c e l e r a t o r l s . S o l v e ( rhs , &x ) ; ”An example for a preconditioend CG where the building phase and the solution phase are performed on the accelerator” l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; // The s o l v e r and t h e p r e c o n d i t i o n e r w i l l be c o n s t r u c t e d on t h e h o s t l s . Build ( ) ; // Now we move a l l o b j e c t s ( i n c l u d i n g t h e s o l v e r and t h e p r e c o n d i t i o n e r // t o t h e a c c e l e r a t o r mat . MoveToAccelerator ( ) ; r h s . MoveToAccelerator ( ) ; 6.5. CLEAR FUNCTION AND DESTRUCTOR 15 16 17 18 19 20 53 x . MoveToAccelerator ( ) ; l s . MoveToAccelerator ( ) ; // The s o l v i n g phase i s performed on t h e a c c e l e r a t o r l s . S o l v e ( rhs , &x ) ; ”An example for a preconditioned CG where the building phase is performed on the host and the solution phase is performed on the accelerator” 6.5 Clear function and Destructor The Clear() function clears all the data which is in the solver including the associated preconditioner. Thus, the solver is not anymore associated with this preconditioner. Note that the preconditioner is not deleted (via destructor) only a Clear() is called. When the destructor of the solver class is called, it automatically call the Clear() function. Be careful, when declaring your solver and preconditioner in different places - we highly recommend to manually call the Clear() function of the solver and not to relay on destructor of the solver. 6.6 Numerical Update ValueType D,F,C Building phase H,C Available S,M Some preconditioners require two phases in the their construction: an algebraic (e.g. compute a pattern or structure) and a numerical (compute the actual values). In cases, where the structure of the input matrix is a constant (e.g. Newton-like methods) it is not necessary to fully re-construct the preconditioner. In this case, the user can apply a numerical update to the current preconditioner and passed the new operator. This function is called ReBuildNumeric(). If the preconditioner/solver does not support the numerical update, then a full Clear() and Build() will be performed. 6.7 Fixed-point Iteration Fixed-point iteration is based on additive splitting of the matrix as A = M + N , the scheme reads xk+1 = M −1 (b − N xk ). (6.1) It can also be reformulated as a weighted defect correction scheme xk+1 = xk − ωM −1 (Axk − b). (6.2) The inversion of M can be performed by preconditioners (Jacobi, Gauss-Seidel, ILU, etc) or by any type of solvers. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 FixedPoint , L o c a l V e c t o r , ValueType > l s ; 2 FixedPoint , L o c a l S t e n c i l , ValueType > l s ; 3 FixedPoint , GlobalVector , ValueType > l s ; ”Fixed-Point declaration” 1 FixedPoint , L o c a l V e c t o r , ValueType > f p ; 2 J a c o b i , L o c a l V e c t o r , ValueType > p ; 3 4 f p . S e t O p e r a t o r ( mat ) ; 54 5 6 7 8 9 10 CHAPTER 6. SOLVERS fp . SetPreconditioner (p) ; fp . SetRelaxation ( 1 . 3 ) ; fp . Build ( ) ; f p . S o l v e ( rhs , &x ) ; ”Fixed-point iteration based on Jacobi preconditioner - Jacobi iteration” 6.8 Krylov Subspace Solvers Krylov subspace solvers are iterative methods based on projection. The implemented solvers in PARALUTION are CG, CR, BiCGStab, BiCGStab(l), QMRCGStab, GMRES, as well as Flexible CG and Flexible GMRES. Details of the methods can be found in [46, 10, 15, 48, 54]. 6.8.1 CG CG (Conjugate Gradient) is a method for solving symmetric and positive definite matrices (SPD). The method can be preconditioned where the approximation should also be SPD. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 CG , L o c a l V e c t o r , ValueType > l s ; 2 CG , L o c a l S t e n c i l , ValueType > l s ; 3 CG , GlobalVector , ValueType > l s ; ”CG declaration” 6.8.2 CR CR (Conjugate Residual) is a method for solving symmetric and semi-positive definite matrices. The method can be preconditioned where the approximation should also be SPD or semi-SPD. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 CR , L o c a l V e c t o r , ValueType > l s ; 2 CR , L o c a l S t e n c i l , ValueType > l s ; 3 CR , GlobalVector , ValueType > l s ; ”CR declaration” 6.8.3 GMRES GMRES (Generalized Minimal Residual Method) is a method for solving non-symmetric problems. The pure GMRES solvers is based on restarting technique. The default size of the Krylov subspace for the GMRES is set to 30, it can be modify by SetBasisSize() function. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 GMRES , L o c a l V e c t o r , ValueType > l s ; 2 GMRES , L o c a l S t e n c i l , ValueType > l s ; 3 GMRES , GlobalVector , ValueType > l s ; ”GMRES declaration” 6.8. KRYLOV SUBSPACE SOLVERS 6.8.4 55 FGMRES FGMRES (Flexible Generalized Minimal Residual Method) is a method for solving non-symmetric problems. The Flexible GMRES solvers is based on a window shifting of the Krylov subspace. The default size of the Krylov subspace for the GMRES is set to 30, it can be modify by SetBasisSize() function. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 FGMRES , L o c a l V e c t o r , ValueType > l s ; 2 FGMRES , L o c a l S t e n c i l , ValueType > l s ; 3 FGMRES , GlobalVector , ValueType > l s ; ”FGMRES declaration” 6.8.5 BiCGStab BiCGStab (Bi-Conjugate Gradient Stabilized) is a method for solving non-symmetric problems. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 BiCGStab , L o c a l V e c t o r , ValueType > l s ; 2 BiCGStab , L o c a l S t e n c i l , ValueType > l s ; 3 BiCGStab , GlobalVector , ValueType > l s ; ”BiCGStab declaration” 6.8.6 IDR IDR (Induced Dimension Reduction) is a method for solving non-symmetric problems. The dimension of the shadow space for the IDR(s) method can be set by SetShadowSpace() function, the default value is 4. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M 1 IDR , L o c a l V e c t o r , ValueType > l s ; 2 IDR , L o c a l S t e n c i l , ValueType > l s ; 3 IDR , GlobalVector , ValueType > l s ; ”IDR Solver” Note The orthogonal system in IDR method is based on random numbers, thus it is normal to obtain slightly different number of iterations every time you run the program. 6.8.7 FCG FCG (Flexible Conjugate Gradient) is a method for solving symmetric and positive definite matrices (SPD). The method can be preconditioned where the approximation should also be SPD. For additional informations see [41]. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M 1 FCG , L o c a l V e c t o r , ValueType > l s ; 2 FCG , L o c a l S t e n c i l , ValueType > l s ; 3 FCG , GlobalVector , ValueType > l s ; ”FCG declaration” 56 6.8.8 CHAPTER 6. SOLVERS QMRCGStab QMRCGStab is a method for solving non-symmetric problems. More details are given in [34]. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M 1 QMRCGStab , L o c a l V e c t o r , ValueType > l s ; 2 QMRCGStab , L o c a l S t e n c i l , ValueType > l s ; 3 QMRCGStab , GlobalVector , ValueType > l s ; ”QMRCGStab declaration” 6.8.9 BiCGStab(l) BiCGStab(l) (Bi-Conjugate Gradient Stabilized) is a method for solving non-symmetric problems. The degree l can be set via the function SetOrder(). More details can be found in [19]. ValueType D,F,C Building phase H,C,O,X Solving phase H,C,O,X Available S,M 1 BiCGStabl , L o c a l V e c t o r , ValueType > l s ; 2 BiCGStabl , L o c a l S t e n c i l , ValueType > l s ; 3 BiCGStabl , GlobalVector , ValueType > l s ; ”BiCGStab(l) declaration” Example 1 2 3 4 5 6 7 8 9 CG , L o c a l V e c t o r , ValueType > l s ; MultiColoredILU , L o c a l V e c t o r , ValueType > p ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Preconditioned CG solver with ILU(0 1)” 6.9 Deflated PCG The Deflated Preconditioned Conjugate Gradient Algorithm (DPCG) is a two-level preconditioned CG algorithm to iteratively solve an ill-conditioned linear system, see [51]. ValueType Building phase Solving phase Available D,F H H,C,O,X B (GPLv3 only) Deflation attempts to remove the bad eigenvalues from the spectrum of the preconditioned matrix M −1 A. The preconditioned system can be ’deflated’ such that it is transformed into M −1 P Ax̂ = M −1 P b, (6.3) where P := I − AQ, Q := ZE −1 Z T and E := Z T AZ. Z is called the deflation sub-space matrix. Its columns are approximations of the eigenvectors corresponding to the bad eigenvalues remaining in the spectrum of the preconditioned system given in M −1 Ax = M −1 b. There can be many choices for Z. In the current implementation, we are using a construction scheme tuned for matrices arising from discretizing the pressure-correction equation in two-phase flow, details are discussed in [50]. 6.10. CHEBYSHEV ITERATION 1 2 3 4 5 6 7 57 DPCG , L o c a l V e c t o r , ValueType > l s ; // S e t t h e number o f d e f l a t i o n v e c t o r s l s . SetNVectors ( 2 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Deflated PCG” Note The number of the deflated vectors can be set via SetNVectors() function Note The dimension of the problem is hard-coded in the MakeZ COO(), MakeZ CSR() function in the solver code (dpcg.cpp, see function Build()) 6.10 Chebyshev Iteration The Chebyshev iteration (also known as acceleration scheme) is similar to the CG method but it requires the minimum and the maximum eigenvalue of the matrix. Additional information can be found in [10]. ValueType D,F,C 1 2 3 4 5 6 7 Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M Chebyshev , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . S e t ( lambda min , lambda max ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Chebyshev iteration” 6.11 Mixed-precision Solver The library provides mixed-precision solvers based on defect-correction scheme. The current implementation of the library is based on host correction in double precision and accelerator computation in single precision. The solver is based on the following scheme: xk+1 = xk + A−1 rk , (6.4) where the computation of the residual rk = b − Axk and of the update xk+1 = xk + dk are performed on the host in double precision. The computation of the residual system Adk = rk is performed on the accelerator in single precision. In addition to the setup functions of the iterative solver, the user need to specify the inner (for Adk = rk ) solver. ValueType D-F 1 2 3 4 5 6 7 8 9 10 Building phase H,C,O,X Solving phase H,C,O,X Available B,S,M MixedPrecisionDC , L o c a l V e c t o r , double , LocalMatrix , L o c a l V e c t o r , f l o a t > mp; CG , L o c a l V e c t o r , f l o a t > cg ; MultiColoredILU , L o c a l V e c t o r , f l o a t > p ; // s e t u p a l o w e r t o l f o r t h e i n n e r s o l v e r cg . S e t P r e c o n d i t i o n e r ( p ) ; cg . I n i t ( 1 e −5, 1 e −2, 1 e +20 , 100000) ; 58 11 12 13 14 15 16 17 18 CHAPTER 6. SOLVERS // s e t u p t h e mixed−p r e c i s i o n DC mp. S e t O p e r a t o r ( mat ) ; mp. S e t ( cg ) ; mp. B u i l d ( ) ; mp. S o l v e ( rhs , &x ) ; ”Mixed-precision solver” 6.12 Multigrid Solver The library provides algebraic multigrid as well as a skeleton for geometric multigrid methods. The BaseMultigrid class itself is not constructing the data for the method. It contains the solution procedure for V, W, K-cycles, for details see [53]. The AMG has two different versions for Local (non-MPI) and for Global (MPI) type of computations. 6.12.1 Geometric Multigrid ValueType D,F,C Building phase - Solving phase H,C,O,X Available B,S,M For the geometric multgrid the user need to pass all information for each level and for its construction. This includes smoothing step, prolongation/restriction, grid traversing and coarsest level solver. This data need to be passed to the solver: • Restriction and prolongation operations – they can be performed in two ways, based on Restriction() and Prolongation() function of the LocalVector class, or by matrix-vector multiplication. This is configured by a set function. • Smoothers – they can be of any iterative linear solver’s type. Valid options are Jacobi, Gauss-Seidel, ILU, etc. using a FixedPoint iteration scheme with pre-defined number of iterations. The smoothers could also be a solver such as CG, BiCGStab, etc. • Coarse grid solver – could be of any iterative linear solver type. The class also provides mechanisms to specify where the coarse grid solver has to be performed on the host or on the accelerator. The coarse grid solver can be preconditioned. • Grid scaling – computed based on a L2 norm ratio. • Operator matrices - the operator matrices need to be passed on each grid level • All objects need to be passed already initialized to the multigrid class. 6.12.2 Algebraic Multigrid Plain and Smoothed Aggregation ValueType D,F,C Building phase H Solving phase H,C,O,X Available B,S,M The algebraic multigrid solver (AMG) is based on the Multigrid class. The coarsening is obtained by different aggregation techniques. Currently, we support interpolation schemes based on aggregation [49] and smoothed aggregation [56]. The smoothers could be constructed inside or outside of the class. Detailed examples are given in the examples section. When building the AMG if not additional information is set, the solver is built based on its default values. 1 AMG , L o c a l V e c t o r , ValueType > l s ; 2 3 l s . S e t O p e r a t o r ( mat ) ; 6.12. MULTIGRID SOLVER 59 4 5 l s . Build ( ) ; 6 7 l s . S o l v e ( rhs , &x ) ; ”AMG as a standalone solver” All parameters can in the AMG can be set externally, including smoothers and coarse-grid solver - any type of solvers can be used. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 LocalMatrix mat ; // L i n e a r S o l v e r AMG , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; // c o u p l i n g s t r e n g t h l s . SetCouplingStrength (0.001) ; // number o f unknowns on c o a r s e s t l e v e l l s . SetCoarsestLevel (300) ; // i n t e r p o l a t i o n type f o r g r i d t r a n s f e r o p e r a t o r s l s . S e t I n t e r p o l a t i o n ( SmoothedAggregation ) ; // R e l a x a t i o n parameter f o r smoothed i n t e r p o l a t i o n a g g r e g a t i o n l s . SetInterpRelax ( 2 . / 3 . ) ; // Manual s moo t h er s l s . SetManualSmoothers ( t r u e ) ; // Manual c o u r s e g r i d s o l v e r l s . SetManualSolver ( true ) ; // g r i d t r a n s f e r s c a l i n g l s . SetScaling ( true ) ; l s . BuildHierarchy () ; i n t l e v e l s = l s . GetNumLevels ( ) ; // Smoother f o r each l e v e l I t e r a t i v e L i n e a r S o l v e r , L o c a l V e c t o r , ValueType > ∗∗sm = NULL; MultiColoredGS , L o c a l V e c t o r , ValueType > ∗∗ g s = NULL; // Coarse Grid S o l v e r CG , L o c a l V e c t o r , ValueType > c g s ; c g s . Verbose ( 0 ) ; sm = new I t e r a t i v e L i n e a r S o l v e r , L o c a l V e c t o r , ValueType >∗[ l e v e l s − 1 ] ; g s = new MultiColoredGS , L o c a l V e c t o r , ValueType >∗[ l e v e l s − 1 ] ; // P r e c o n d i t i o n e r f o r t h e c o a r s e g r i d s o l v e r // MultiColoredILU , L o c a l V e c t o r , ValueType > p; 44 // cgs−>S e t P r e c o n d i t i o n e r ( p ) ; 45 46 f o r ( i n t i =0; i , L o c a l V e c t o r , ValueType > ∗ f p ; 48 f p = new FixedPoint , L o c a l V e c t o r , ValueType >; 49 fp−>S e t R e l a x a t i o n ( 1 . 3 ) ; 50 sm [ i ] = f p ; 51 60 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 CHAPTER 6. SOLVERS g s [ i ] = new MultiColoredGS , L o c a l V e c t o r , ValueType >; g s [ i ]−>SetPrecondMatrixFormat (ELL) ; sm [ i ]−> S e t P r e c o n d i t i o n e r ( ∗ g s [ i ] ) ; sm [ i ]−>Verbose ( 0 ) ; } ls ls ls ls ls ls ls . SetOperatorFormat (CSR) ; . SetSmoother (sm) ; . SetSolver ( cgs ) ; . SetSmootherPreIter (1) ; . SetSmootherPostIter (2) ; . I n i t ( 1 e −10 , 1 e −8, 1 e +8, 1 0 0 0 0 ) ; . Verbose ( 2 ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”AMG with manual settings” The AMG can be used also as a preconditioner within a solver. 1 2 3 4 5 6 7 8 9 10 11 12 13 CG , L o c a l V e c t o r , ValueType > l s ; // AMG P r e c o n d i t i o n e r AMG , L o c a l V e c t o r , ValueType > p ; p . InitMaxIter (2) ; p . Verbose ( 0 ) ; l s . SetPreconditioner (p) ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”AMG as a preconditioner” Ruge-Stueben The classic Ruge-Stueben coarsening algorithm is implemented following the [49]. ValueType D,F,C Building phase H Solving phase H,C,O,X Available S,M The solver provides high-efficiency in terms of complexity of the solver (i.e. number of iterations). However, most of the time it has a higher building step and requires higher memory usage. Pair-wise The pairwise aggregation scheme is based on [42]. ValueType D,F,C Building phase H Solving phase H,C,O,X Available S,M The pairwise AMG delivers very efficient building phase which is suitable for Poisson-like equation. Most of the time it requires K-cycle for the solving phase to provide low number of iterations. 6.13. DIRECT LINEAR SOLVERS 61 Global AMG (MPI) The global AMG is based on a variation of the pairwise aggregation scheme, using the MPI standard for inter-node communication. ValueType D,F,C Building phase H Solving phase H,C,O,X Available M The building and the solving phase are fully MPI parallel. This solver works well with updating technique suitable for time-dependent problems (i.e. time-stepping schemes) which decrease significantly the building phase. 6.13 Direct Linear Solvers The library provides three direct methods – LU, QR and full inversion (based on QR decomposition). The user can pass a sparse matrix, internally it will be converted to dense and then the selected method will be applied. These methods are not very optimal and due to the fact that the matrix is converted in a dense format, these methods should be used only for very small matrices. 1 2 3 4 5 6 I n v e r s i o n , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”A direct solver” Note These methods works only for Local-type of problems (no distributed problems). 6.14 Eigenvalue Solvers 6.14.1 AMPE-SIRA The Shift Invert Residual Arnoldi (SIRA) is an eigenvalue solver which solves parts of eigenvalues of a matrix. It computes the closest eigenvalues to zero of a symmetric positive definite (SPD) matrix. Details can be found in [27, 28, 32, 33]. ValueType D,D-F 1 2 3 4 5 6 7 8 9 10 11 12 13 Building phase H,C Solving phase H,C,O Available B (GPLv3 only) SIRA< LocalMatrix , L o c a l V e c t o r , d o u b l e > s i r a ; s i r a . S e t O p e r a t o r ( mat ) ; // S e t t h e number o f t a r g e t e i g e n v a l u e s s i r a . SetNumberOfEigenvalues ( 3 ) ; // S e t p r e c o n t i d i o n e r o f i n n e r l i n e a r system MultiColoredILU , L o c a l V e c t o r , d o u b l e > p ; s i r a . SetInnerPreconditioner (p) ; s i r a . Build ( ) ; s i r a . S o l v e (& e i g e n v a l u e s ) ; ”SIRA solver with 3 target eigenvalues” An adaptive mixed precision (AMP) scheme can be performed on the SIRA method, where the stopping criteria in the inner loop is inspired by Hochstenbach and Notay (HN criteria, see [22]), the declaration takes two different value type. 62 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CHAPTER 6. SOLVERS SIRA< LocalMatrix , L o c a l V e c t o r , double , LocalMatrix , L o c a l V e c t o r , f l o a t > a m p e s i r a ; a m p e s i r a . S e t O p e r a t o r ( mat ) ; a m p e s i r a . SetNumberOfEigenvalues ( 3 ) ; // S e t p r e c o n t i d i o n e r o f i n n e r l i n e a r system MultiColoredILU , L o c a l V e c t o r , d o u b l e > p1 ; MultiColoredILU , L o c a l V e c t o r , f l o a t > p2 ; a m p e s i r a . S e t I n n e r P r e c o n d i t i o n e r ( p1 , p2 ) ; ampe sira . Build ( ) ; a m p e s i r a . I n i t ( 1 e −10 , 2 0 0 , 0 . 8 ) ; a m p e s i r a . S o l v e (& e i g e n v a l u e s ) ; ”SIRA solver with adaptive mixed precision scheme” The Init(1e−10, 200, 0.8) after ampe sira.Build() is an optional function to configure the iteration controller in SIRA solver. All the parameters remain default when the value is set zero. The first argument of Init() sets the absolute tolerance for eigenvalues, where the default value is 1e − 7. The second argument sets the maximun number of outer loop iteration for solving one eigenvalue. The default value is 300, and the final number would be value set × number of target eigenvalues. The third argument sets the maximum number of inner loop iteration. The default number is 100000. If the value is set on (0, 1], then the maximum number would be val × dimension of operator, otherwise it would be set directly the value when it is on (1, ∞). If there is a special starting vector for SIRA solver, say vec, one can assign the vector to SIRA in the first argument in sira.Solve(), that is sira.Solve(vec, &eigenvalues). Note The stopping criteria of SIRA without mixed precision scheme is by default a constant tolerance. If one wants to apply the HN criteria, add sira.SetInnerStoppingCriterion(1) before sira.Build(). Note The preconditioner in the inner linear system of SIRA is set by default FSAI level 1. Chapter 7 Preconditioners In this chapter, all preconditioners are presented. All preconditioners support local operators. They can be used as a global preconditioner via block-jacobi scheme which works locally on each interior matrix. To provide fast application, all preconditioners require extra memory to keep the approximated operator. 7.1 Code Structure The preconditioners provide a solution to the system M z = r, where either the solution z is directly computed by the approximation scheme or it is iterativily obtained with z = 0 initial guess. Variable Preconditioner Deflated PCG Block-Jacobi AS/RAS Solvers Multigrid AMG/GMG CPR Mixed-Precision Defect-Correction MultiElimination ILU Iterative Liner Solvers Chebyshev Iteration Fixed-Iteration Schemes Preconditioners Saddle-point Iteration Control ILU/IC CG, BiCGStab, GMRES, IDR, CR FSAI/SPAI Chebyshev MC-GS/SGS/SOR/SSOR Power(q)-pattern ILU Figure 7.1: Solver’s classes hierarchy 7.2 Jacobi The Jacobi preconditioner is the simplest parallel preconditioner, details can be found in [46, 10, 15]. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M 63 64 CHAPTER 7. PRECONDITIONERS paralution::ParalutionObj paralution::Solver< OperatorType, VectorType, ValueType > paralution::Preconditioner< OperatorType, VectorType, ValueType > paralution::AIChebyshev< OperatorType, VectorType, ValueType > paralution::AS< OperatorType, VectorType, ValueType > paralution::BlockJacobi< OperatorType, VectorType, ValueType > paralution::BlockPreconditioner< OperatorType, VectorType, ValueType > paralution::CPR< OperatorType, VectorType, ValueType > paralution::DiagJacobiSaddlePointPrecond< OperatorType, VectorType, ValueType > paralution::FSAI< OperatorType, VectorType, ValueType > paralution::GS< OperatorType, VectorType, ValueType > paralution::IC< OperatorType, VectorType, ValueType > paralution::ILU< OperatorType, VectorType, ValueType > paralution::ILUT< OperatorType, VectorType, ValueType > paralution::Jacobi< OperatorType, VectorType, ValueType > paralution::MultiColored< OperatorType, VectorType, ValueType > paralution::MultiElimination< OperatorType, VectorType, ValueType > paralution::SGS< OperatorType, VectorType, ValueType > paralution::SPAI< OperatorType, VectorType, ValueType > paralution::TNS< OperatorType, VectorType, ValueType > paralution::VariablePreconditioner< OperatorType, VectorType, ValueType > Figure 7.2: Preconditioners 1 J a c o b i , L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”Jacobi preconditioner” 7.3 Multi-colored (Symmetric) Gauss-Seidel and SOR ValueType D,F,C Building phase H,C, Solving phase H,C,O,X Available B,S,M The additive preconditioners are based on the splitting of the original matrix. Higher parallelism in solving the forward and backward substitution is obtained by performing a multi-colored decomposition. Details can be found in [35, 46]. 1 MultiColoredSGS , L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”Multi-colored symmetric Gauss-Seidel preconditioner” 7.4. INCOMPLETE LU WITH LEVELS – ILU(P ) 1 2 3 4 5 6 65 MultiColoredGS , L o c a l V e c t o r , ValueType > p ; p . SetRelaxation (1 .6 ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”Multi-colored SOR with relaxation parameter 1.6” 7.4 Incomplete LU with levels – ILU(p) ILU(p) factorization based on p-levels. Details can be found in [46]. ValueType D,F,C 1 2 3 4 5 Building phase H Solving phase H,C Available B,S,M ILU , L o c a l V e c t o r , ValueType > p ; p . Set (1) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”ILU(1) preconditioner - based on levels” 7.5 Incomplete Cholesky – IC IC factorization without additional fill-ins. Details are given in [46]. ValueType D,F,C Building phase H Solving phase H,C Available B,S,M 1 IC , L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”IC preconditioner” Note This implementation is still experimental and it is highly recommended to use the ILU preconditioner instead. 7.6 Incomplete LU with threshold – ILUT(t,m) Incomplete LU (ILUT(t,m)) factorization based on threshold (t) and maximum number of elements per row (m). Details can be found in [46]. The preconditioner can be initialized with the threshold value only or with threshold and maximum number of elements per row. ValueType D,F,C 1 2 3 4 5 Building phase H Solving phase H,C Available B,S,M ILUT , L o c a l V e c t o r , ValueType > p ; p . Set ( 0 . 0 1 ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”ILUT(0.01) preconditioner” Note This implementation is still experimental. 66 CHAPTER 7. PRECONDITIONERS 7.7 Power(q)-pattern method – ILU(p,q) ILU(p,q) is based on the ILU(p) factorization with a power(q)-pattern method, the algorithm can be found in[35]. This method provides a higher degree of parallelism of forward and backward substitution compared to the standard ILU(p) preconditioner. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M Note If the preconditioner is initialized with only the first argument (p) then q is taken to be p + 1. 1 2 3 4 5 MultiColoredILU , L o c a l V e c t o r , ValueType > p ; p . Set (1 , 2) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”ILU(1 2) preconditioner - based on power(q)-pattern method” 7.8 Multi-Elimination ILU The multi-elimination ILU preconditioner is based on the following decomposition D F D F I 0 A= = × , E C ED−1 I 0  (7.1) where  = C − ED−1 F . To make the inversion of D easier, we permute the preconditioning before the factorization with a permutation P to obtain only diagonal elements in D. The permutation here is based on a maximal independent set. Details can be found in [46]. This procedure can be applied to the block matrix Â, in this way we can perform the factorization recursively. In the last level of the recursion, we need to provide a solution procedure. By the design of the library, this can be any kind of solver. In the following example we build a preconditioned CG solver with a multi-elimination preconditioner defined with 2 levels and without drop-off tolerance. The last block of preconditioner is solved using a Jacobi preconditioner. ValueType D,F,C 1 2 3 4 5 6 7 8 9 10 11 Building phase H,C Solving phase H,C,O Available B,S,M CG , L o c a l V e c t o r , ValueType > cg ; M u l t i E l i m i n a t i o n , L o c a l V e c t o r , ValueType > p ; J a c o b i , L o c a l V e c t o r , ValueType > j p ; p . Set ( j p , 2) ; cg . S e t O p e r a t o r ( mat ) ; cg . S e t P r e c o n d i t i o n e r ( p ) ; cg . B u i l d ( ) ; cg . S o l v e ( rhs , &x ) ; ”PCG solver with multi-elimination ILU preconditioner and Jacobi” 7.9 Diagonal Preconditioner for Saddle-Point Problems Consider the following saddle-point problem A= K E F 0 . (7.2) 7.10. CHEBYSHEV POLYNOMIAL 67 For such problems we can construct a diagonal Jacobi-like preconditioner of type K 0 P = 0 S (7.3) with S = ED−1 F , where D are the diagonal elements of K. The matrix S is fully constructed (via sparse matrix-matrix multiplication). The preconditioner needs to be initialized with two external solvers/preconditioners – one for the matrix K and one for the matrix S. ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M 1 GMRES , L o c a l V e c t o r , ValueType > gmres ; 2 3 D i a g J a c o b i S a d d l e P o i n t P r e c o n d , L o c a l V e c t o r , ValueType > p ; 4 MultiColoredILU , L o c a l V e c t o r , ValueType > p k ; 5 MultiColoredILU , L o c a l V e c t o r , ValueType > p s ; 6 7 p . Set ( p k , p s ) ; 8 9 gmres . S e t O p e r a t o r ( mat ) ; 10 gmres . S e t B a s i s S i z e ( 5 0 ) ; 11 gmres . S e t P r e c o n d i t i o n e r ( p ) ; 12 gmres . B u i l d ( ) ; 13 14 gmres . S o l v e ( rhs , &x ) ; ”GMRES solver with diagonal Jacobi-like preconditioner for Saddle-Point problems” 7.10 Chebyshev Polynomial The Chebyshev approximate inverse matrix is based on the work [14]. ValueType D,F,C 1 2 3 4 5 6 7 Building phase H,C Solving phase H,C,O,X Available B,S,M Chebyshev , L o c a l V e c t o r , ValueType > l s ; l s . S e t O p e r a t o r ( mat ) ; l s . S e t ( lambda min , lambda max ) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Chebyshev polynomial preconditioner” 7.11 FSAI(q) The Factorized Sparse Approximate Inverse preconditioner computes a direct approximation of M −1 by minimizing the Frobenius norm ||I − GL||F where L denotes the exact lower triangular part of A and G := M −1 . The q FSAI preconditioner is initialized by q, based on the sparsity pattern of |A| [35]. However, it is also possible to supply external sparsity patterns in form of the LocalMatrix class. Further details on the algorithm are given in [31]. 68 CHAPTER 7. PRECONDITIONERS ValueType D,F,C 1 2 3 4 5 Building phase H Solving phase H,C,O,X Available B,S,M FSAI , L o c a l V e c t o r , ValueType > p ; p . Set (2) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; ”FSAI(2) preconditioner” Note The FSAI(q) preconditioner is only suited for SPD matrices. 7.12 SPAI The SParse Approximate Inverse algorithm is an explicitly computed preconditioner for general sparse linear systems. In its current implementation, only the sparsity pattern of the system matrix is supported. The SPAI computation is based on the minimization of the Frobenius norm ||AM − I||F . Details can be found in [21]. ValueType D,F,C Building phase H Solving phase H,C,O,X Available B,S,M 1 SPAI , L o c a l V e c t o r , ValueType > p ; 2 3 l s . SetPreconditioner (p) ; 4 l s . Build ( ) ; ”SPAI preconditioner” Note The SPAI implementation is still experimental. The current version is based on a original static matrix pattern (similar to ILU0). 7.13 Block Preconditioner When handling vector fields, typically one can try to use different preconditioners/solvers for the different blocks. For such problems, the library provides a block-type preconditioner. This preconditioner builds the following block-type matrix Ad B1 P = . Z1 0 Bd . Z2 . . . . 0 0 . Zd (7.4) The solution of P can be performed in two ways. It can be solved by block-lower-triangular sweeps with inversion of the blocks Ad ...Zd and with a multiplication of the corresponding blocks, this is set by the function SetLSolver() (which is the default solution scheme). Alternatively, it can be used only with an inverse of the diagonal Ad ...Zd (such as Block-Jacobi type) by using the function SetDiagonalSolver(). ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M 1 GMRES , L o c a l V e c t o r , ValueType > l s ; 2 3 B l o c k P r e c o n d i t i o n e r , L o c a l V e c t o r , 4 ValueType > p ; 5 S o l v e r , L o c a l V e c t o r , ValueType > ∗∗ p2 ; 6 7.13. BLOCK PRECONDITIONER 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 int n = 2; int ∗ size ; s i z e = new i n t [ n ] ; p2 = new S o l v e r , L o c a l V e c t o r , ValueType >∗[n ] ; f o r ( i n t i =0; i , L o c a l V e c t o r , ValueType > ∗mc ; mc = new MultiColoredILU , L o c a l V e c t o r , ValueType >; p2 [ i ] = mc ; } p . S e t ( n , s i z e , p2 ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”Block Preconditioner for two MC-ILU” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 GMRES , L o c a l V e c t o r , ValueType > l s ; B l o c k P r e c o n d i t i o n e r , L o c a l V e c t o r , ValueType > p ; S o l v e r , L o c a l V e c t o r , ValueType > ∗∗ p2 ; int n = 2; int ∗ size ; s i z e = new i n t [ n ] ; p2 = new S o l v e r , L o c a l V e c t o r , ValueType >∗[n ] ; // Block 0 s i z e [ 0 ] = mat . g e t n r o w ( ) / n ; MultiColoredILU , L o c a l V e c t o r , ValueType > ∗mc ; mc = new MultiColoredILU , L o c a l V e c t o r , ValueType >; 18 p2 [ 0 ] = mc ; 19 20 // Block 1 21 s i z e [ 1 ] = mat . g e t n r o w ( ) / n ; 22 AMG , L o c a l V e c t o r , ValueType > ∗amg ; 23 amg = new AMG , L o c a l V e c t o r , ValueType >; 24 amg−>I n i t M a x I t e r ( 2 ) ; 25 amg−>Verbose ( 0 ) ; 26 p2 [ 1 ] = amg ; 27 28 29 p . S e t ( n , s i z e , p2 ) ; 30 p . S e t D i a g o n a l S o l v e r ( ) ; 31 32 l s . S e t O p e r a t o r ( mat ) ; 33 l s . S e t P r e c o n d i t i o n e r ( p ) ; 69 70 CHAPTER 7. PRECONDITIONERS 34 35 l s . B u i l d ( ) ; 36 37 l s . S o l v e ( rhs , &x ) ; ”Block Preconditioner for MC-ILU and AMG” 7.14 Additive Schwarz and Restricted Additive Schwarz – AS and RAS As a preconditioning technique, we can decompose the linear system Ax = b into small sub-problems based on Ai = RiT ARi (7.5) where Ri are restriction operators. Thus, we can define: • Additive Schwarz (AS) – the restriction operators produce sub-matrices which overlap. This leads to contributions from two preconditioners on the overlapped area, see Figure 7.3 (left). Thus these contribution sections are scaled by 1/2. • Restricted Additive Schwarz (RAS) – this is a mixture of the pure block-Jacobi and the additive Schwarz scheme. Again, the matrix A is decomposed into squared sub-matrices. The sub-matrices are large as in the additive Schwartz approach – they include overlapped areas from other blocks. After we solve the preconditioning sub-matrix problems, we provide solutions only to the non-overlapped area, see Figure 7.3 (right). ValueType D,F,C Building phase H,C Solving phase H,C,O,X Available B,S,M M1 M1 M2 M2 M3 M3 M4 M4 Figure 7.3: Example of a 4 block-decomposed matrix – Additive Schwarz (AS) with overlapping preconditioner (left) and Restricted Additive Schwarz (RAS) preconditioner (right) Details can be found in [46, 11, 45, 47, 52]. The library provides Additive Schwarz (called AS ) and Restricted Additive Schwarz (called RAS ) preconditioners. For both preconditioners, the user need to to specify the number of blocks, the size of the overlapping region and the type of preconditioners which should be used on the blocks. 1 2 3 4 5 6 7 8 9 // L i n e a r S o l v e r GMRES , L o c a l V e c t o r , ValueType > l s ; // P r e c o n d i t i o n e r // AS , L o c a l V e c t o r , ValueType > p ; RAS , L o c a l V e c t o r , ValueType > p ; // Second l e v e l p r e c o n d i t i o n e r s S o l v e r , L o c a l V e c t o r , ValueType > ∗∗ p2 ; 7.15. TRUNCATED NEUMANN SERIES PRECONDITIONER (TNS) 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 71 int n = 3; int ∗ size ; s i z e = new i n t [ n ] ; p2 = new S o l v e r , L o c a l V e c t o r , ValueType >∗[n ] ; f o r ( i n t i =0; i , L o c a l V e c t o r , ValueType > ∗mc ; mc = new MultiColoredILU , L o c a l V e c t o r , ValueType >; p2 [ i ] = mc ; } p . Set (n , 20 , p2 ) ; l s . S e t O p e r a t o r ( mat ) ; l s . SetPreconditioner (p) ; l s . Build ( ) ; l s . S o l v e ( rhs , &x ) ; ”(Restricted) Additive Schwarz defined with 3 blocks and 10 overlapping elements” 7.15 Truncated Neumann Series Preconditioner (TNS) The Truncated Neumann Series Preconditioner (TNS) is based on M −1 = K T D−1 K, (7.6) where K = (I − LD−1 + (LD−1 )2 ), D is the diagonal matrix of A and L is strictly the lower triangular part of A. The preconditioner can be computed in two forms - explicitly and implicitly. In the implicit form, the full construction of M is performed via matrix-matrix operations. Whereas in the explicit from, the application of the preconditioner is based on matrix-vector operations only. The matrix format for the stored matrices can be specified. ValueType D,F,C 1 2 3 4 5 6 7 8 9 10 11 12 Building phase H,C Solving phase H,C,O,X Available B,S,M CG , L o c a l V e c t o r