These are the release notes for the Mellanox HPC-X™ Rev 2.3. The Mellanox HPC-X™ Software Toolkit is a comprehensive software package that includes Open MPI, OpenSHMEM,
Mellanox HPC-XTM Software Toolkit Release Notes Rev 2.3 www.mellanox.com Mellanox Technologies NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ("PRODUCT(S)") AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES "AS-IS" WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS. THE CUSTOMER'S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT(S) AND/OR THE SYSTEM USING IT. THEREFORE, MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANTTHAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY. ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED. IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT, INDIRECT, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING, BUT NOT LIMITED TO, PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY FROM THE USE OF THE PRODUCT(S) AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Mellanox Technologies 350 Oakmead Parkway Suite 100 Sunnyvale, CA 94085 U.S.A. www.mellanox.com Tel: (408) 970-3400 Fax: (408) 970-3403 © Copyright 2018. Mellanox Technologies Ltd. All Rights Reserved. Mellanox®, Mellanox logo, Connec-tIB®, ConnectX®, CORE-Direct®, GPUDirect®, LinkX®, Mellanox Mul-tiHost®, Mellanox Socket Direct®, UFM®, and Virtual Protocol Interconnect® are registered trademarks of Mellano Technologies,Ltd. For the complete and most updated list of Mellanox trademarks, visit http://www.mellanox.com/page/trademarks. All other trademarks are property of their respective owners. Mellanox Technologies 2 Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List Of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Release Update History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1 HPC-XTM Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 HPC-XTM Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Important Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2 Changes and New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 4 Bug Fixes History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 5 Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1 HPC-X Toolkit Change Log History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 FCA Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 MXM Change Log History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.4 HPC-XTM Open MPI/OpenSHMEM Change Log History . . . . . . . . . . . . . . . 25 Rev 2.3 Mellanox Technologies 3 List Of Tables Table 1: Table 2: Table 3: Table 4: Table 5: Table 6: Table 7: Table 8: Release Update History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Changes and New Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Bug Fixes History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 HPC-X Toolkit Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 FCA Change Log History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 MXM Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 HPC-XTM Open MPI/OpenSHMEM Change Log History . . . . . . . . . . . . . . . . . . . . . . 25 4 Mellanox Technologies Rev 2.3 Release Update History Table 1 - Release Update History Release Rev 2.3 Date December 3, 2018 Description Initial version of this HPC-X version. Rev 2.3 Mellanox Technologies 5 1 Overview These are the release notes for the Mellanox HPC-XTM Rev 2.3. The Mellanox HPC-XTM Software Toolkit is a comprehensive software package that includes Open MPI, OpenSHMEM, PGAS, MXM, UCX, FCA tool suite for high performance computing environments. HPC-X provides enhancements to significantly increase the scalability and performance of message communications in the network. HPC-XTM enables you to rapidly deploy and deliver maximum application performance without the complexity and costs of licensed third-party tools and libraries. 1.1 HPC-XTM Requirements The platform and requirements for HPC-X are detailed in the following table: Platform Drivers and HCAs OFED / MLNX_OFED · OFED 1.5.3 · MLNX_OFED 4.3-x.x.x and above HCAs · ConnectX®-5 / ConnectX®-5 Ex · ConnectX®-4 / ConnectX®-4 Lx · ConnectX®-3 / ConnectX®-3 Pro · Connect-IB® 1.2 HPC-XTM Content The following communications libraries and acceleration packages are part of Mellanox HPCXTM Rev 2.3 package: Library/Acceleration Package Open MPI Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) HCOLL UCX Open SHMEM specification compliant Version Number 4.0.x 1.7.2 4.2 1.5 1.4a a. Open SHMEM v1.4 compliance is at beta level. 1.3 Important Notes When HPC-X is launched in an environment without resource manager (slurm, pbs, ...) installed, or from a compute node, it will use Open MPI default rsh/ssh based launcher which does not propagate the library path to the compute nodes. In such case, pass the LD_LIBRARY_PATH variable as following: % mpirun -x LD_LIBRARY_PATH -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c 6 Mellanox Technologies Rev 2.3 Changes and New Features 2 Changes and New Features HPC-XTM Rev 2.3 provides the following changes and new features: Table 2 - Changes and New Features Category HPC-X Content UCX HCOLL MXM OpenMPI Description Updated the following communications libraries and acceleration packages versions: · Open MPI version 4.0.x · Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7.2 · HCOLL version 4.2 · UCX version 1.5 · OpenSHMEM version 1.4 UCX is now compiled without JAVA bindings. Added support for running UCX over rdma-core, for DC transport and direct verbs. Emulation layer: Added the ability to run UCX over software emulation of remote memory access and atomic operations. This provides full support of SHMEM and MPI-RMA over shared memory, TCP, and older RDMA hardware, such as ConnectX-3 HCA. HCOLL and Mellanox SHARP are now compiled with CUDA support. Added support for CUDA buffers over SRA allreduce algorithm. Removed support for MXM library. Added the following configuration options to OMPI: · --with-libevent=internal · --enable-mpi1-compatibility Updated the configuration file platform/mellanox/optimized config in OMPI upstream by removing BTL OpenIB and UCT support and removing links to MXM/FCA usage. Removed PMI2 support. Rev 2.3 Mellanox Technologies 7 3 Known Issues The following is a list of general limitations and known issues of the various components of this HPC-X release. Table 3 - Known Issues (Sheet 1 of 3) Internal Ref. 1582208 - 2934 Issue Description: Sending data over multiple SHMEM contexts may lead to memory corruption or segmentation fault. Workaround: Add the command line "-x UCX_SELF_DEVICES=" when running oshrun, or set it before the run. Keywords: Open SHMEM, segmentation fault Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: OSC UCX module is not selected by default on ConnectX-4/ConnectX-5 HCAs. Workaround: Add the command line argument "-mca osc ucx" when running mpirun. Keywords: OSC UCX, one-sided, Open MPI Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: Zero-length OpenSHMEM collectives may fail due to incomplete implementation. Workaround: N/A Keywords: OpenSHMEM atomic, Open MPI Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: OpenSHMEM atomic operations AND/OR/XOR for datatypes int32/ int64/uint32/uint64 are not implemented, which may cause build failures. Workaround: N/A Keywords: OpenSHMEM atomic, Open MPI Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: OpenMPI and OpenSHMEM applications may hang with DC transport. (Github issue: https://github.com/openucx/ucx/issues/2934) Workaround: Use RC or UD instead of DC. To achieve that, make sure UCX pml (or UCX osc/smpl for one-sided applications) is used, and UCX_TLS variable is set. Example: UCX_TLS=rc,self,sm Keywords: UCX, Open MPI, DC Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) 8 Mellanox Technologies Rev 2.3 Known Issues Table 3 - Known Issues (Sheet 2 of 3) Internal Ref. 1307243 - 2226 Issue Description: One-sided tests may fail with a segmentation fault. Workaround: In order to run one-sided tests, make sure to add -mca osc ucx to the command line. Keywords: OSC UCX, Open MPI, one-sided Discovered in Version: 2.1 (Open MPI 3.1.x) Description: In ConnectX-4 and Connect-IB HCAs, when the DC transport is used on a large scale, "Retry exceeded" messages may be printed from UCX. Workaround: Configure SL2VL on your OpenSM in the fabric and make UCX use SL=1 when using the InfiniBand transports via '-x UCX_IB_SL=1'. Keywords: UCX, DC transport, ConnectX-4, Connect-IB Discovered in Version: 2.1 (UCX 1.3) Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX: "... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by 'ipcs -l". Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file. Keywords: UCX, memory Discovered in Version: 2.1 (UCX 1.3) Description: The following assertion may fail in certain cases: Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed 1162 (Gihub issue: https://github.com/openucx/ucx/issues/2226) Workaround: Set the DC transport using the UCX_TLS parameter. Keywords: UCX, assertion Discovered in Version: 2.1 (UCX 1.3) Description: Mellanox SHARP library is not available in HPC-X for the Community OFED and Inbox OFED. Workaround: N/A Keywords: Mellanox SHARP library Discovered in Version: 2.0 Description: UCX currently does not support canceling send requests. (Github issue: https://github.com/openucx/ucx/issues/1162) Workaround: N/A Keywords: UCX Discovered in Version: 2.0 Rev 2.3 Mellanox Technologies 9 Table 3 - Known Issues (Sheet 3 of 3) Internal Ref. - - Issue Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV. Workaround: Set UCX_IB_ADDR_TYPE=ib_global Keywords: UCX Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support. In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no Workaround: Recompile UCX with AVX disabled: $ ./utils/hpcx_rebuild.sh --rebuild-ucx --ucx-extra-config "-with-avx=no" Keywords: UCX 10 Mellanox Technologies Rev 2.3 Bug Fixes History 4 Bug Fixes History Table 4 lists the bugs fixed in this release. Table 4 - Bug Fixes History (Sheet 1 of 5) Internal Ref. - - 2111 Issue Description: Fixed the issue where using UCX on ARM hosts may result in hangs due to a known issue in Open MPI when running on ARM. Keywords: UCX Discovered in Version: 1.3 (Open MPI 1.8.2) Fixed in Version: 2.3 (Open MPI 4.0.x) Description: MCA options rmaps_dist_device and rmaps_base_mapping_policy are now functional. Keywords: Process binding policy, NUMA/HCA locality Discovered in Version: 2.0 (Open MPI 3.0.0) Fixed in Version: 2.3 (Open MPI 4.0.x) Description: Fixed the issue of when UCX was used in the multi-threaded mode, it might have taken the osu_latency_mt test a long time to be completed. 2267 (Github issue: https://github.com/openucx/ucx/issues/2111) Keywords: UCX, multi-threaded Discovered in Version: 2.1 (UCX 1.3) Fixed in Version: 2.3 (UCX 1.5) Description: Fixed the issue where the following error message might have appeared when running at the scale of 256 ranks with the RC transport, when UD is used for wireup only: "Fatal: send completion with error: Endpoint timeout". (Github issue: https://github.com/openucx/ucx/issues/2267) Keywords: UCX Discovered in Version: 2.1 (UCX 1.3) Fixed in Version: 2.3 (UCX 1.5) Rev 2.3 Mellanox Technologies 11 Table 4 - Bug Fixes History (Sheet 2 of 5) Internal Ref. 2702 Issue Description: Fixed the issue of when using the Hardware Tag Matching feature, the following error messages may have been printed: · "rcache.c:481 UCX WARN failed to register region 0xdec25a0 [0x2b7139ae0020..0x2b7139ae2020]: Input/output error" · "ucp_mm.c:105 UCX ERROR failed to register address 0x2b7139ae0020 length 8192 on md[1]=ib/mlx5_0: Input/output error" · "ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x2b7139ae0020 len 8192: Input/output error" 2454 (Github issue: https://github.com/openucx/ucx/issues/2702) Keywords: Hardware Tag Matching Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.3 (UCX 1.5) Description: Fixed the issue where some one-sided benchmarks may have hung when using "osc ucx". For example: osu-micro-benchmarks-5.3.2/osu_get_acc_latency (Latency Test for accumulate with Active/Passive Synchronization). 2670 (Github issue: https://github.com/openucx/ucx/issues/2454) Keywords: UCX, one_sided Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.3 (UCX 1.5) Description: Fixed the issue of when enabling the Hardware Tag Matching feature on a large scale, the following error message may have been printed due to the increased threshold for BCOPY messages: "mpool.c:177 UCX ERROR Failed to allocate memory pool chunk: Out of memory." 1295679 (Github issue: https://github.com/openucx/ucx/issues/2670) Keywords: Hardware Tag Matching Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.3 (UCX 1.5) Description: Fixed the issue where OpenSHMEM group cache had a default limit of 100 entries, which might have resulted in OpenSHMEM application exiting with the following message: "group cache overflow on rank xxx: cache_size = 100". Keywords: OpenSHMEM, Open MPI Discovered in Version: 2.1 (Open MPI 3.1.x) Fixed in Version: 2.2 (Open MPI 3.1.x) 12 Mellanox Technologies Rev 2.3 Bug Fixes History Table 4 - Bug Fixes History (Sheet 3 of 5) Internal Ref. - 1926 Issue Description: Fixed the issue where UCX did not work out-of-the-box with CUDA support. Keywords: UCX, CUDA Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.1 (UCX 1.3) Description: Fixed the issue of when using multiple transports, invalid data was sent outof-sync with Hardware Tag Matching traffic. 1949 884482 884508 884490 (Github issue: https://github.com/openucx/ucx/issues/1926) Keywords: Hardware Tag Matching Discovered in Version: 2.1 (UCX 1.3) Fixed in Version: 2.2 (UCX 1.4) Description: Fixed the issue where Hardware Tag Matching might not have functioned properly with UCX over DC transport. (Github issue: https://github.com/openucx/ucx/issues/1949) Keywords: UCX, Hardware Tag Matching, DC transport Discovered in Version: 2.0 Fixed in Version: 2.1 Description: Fixed job data transfer from SD to libsharp. Keywords: Mellanox SHARP library Discovered in Release: 1.9 Fixed in Release: 1.9.7 Description: Fixed internal HCOLL datatype mapping. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed internal HCOLL datatype lower bound calculation. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed allgather unpacking issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Rev 2.3 Mellanox Technologies 13 Table 4 - Bug Fixes History (Sheet 4 of 5) Internal Ref. 885009 882193 - - Salesforce: 316541 Salesforce: 316547 894346 898283 Issue Description: Fixed wrong answer in alltoallv. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed mcast group leak in HCOLL. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Added IN_PLACE support for alltoall, alltoallv, and allgatherv. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed an issue related to multi-threaded MPI_Bcast. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed a memory barrier issue in MPI_Barrier on Power PPC systems. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed multi-threaded MPI_COMM_DUP and MPI_COMM_SPLIT hanging issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed Quantum Espresso hanging issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed an issue which caused CP2K applications to hang when HCOLL was enabled. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 14 Mellanox Technologies Rev 2.3 Bug Fixes History Table 4 - Bug Fixes History (Sheet 5 of 5) Internal Ref. 906155 Issue Description: Fixed an issue which caused VASP applications to hang in MPI_Allreduce. Keywords: HCOLL, FCA Discovered in Release: 1.6 Fixed in Release: 1.7.406 Rev 2.3 Mellanox Technologies 15 5 5.1 Change Log History HPC-X Toolkit Change Log History Table 5 - HPC-X Toolkit Change Log History Category HPC-X Content MXM Support UCX HPC-X Setup HCOLL HPC-X Content UCX MXM Description Rev 2.2 Updated the following communications libraries and acceleration packages versions: · Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7 · HCOLL version 4.1 · UCX version 1.4 Added support for Singularity containerization. For further information, please refer to HPC-X User Manual. "osc ucx" is no longer the default one-sided-component in OpenMPI. Removed KNEM library from HPC-X package. UCX will use the KNEM available in MLNX_OFED. Open MPI and HCOLL are not compiled with MXM anymore. Both are compiled with UCX only and use it by default. Added support for the following UCX features: · New API for establishing client-server connection. · Out-of-box support for Memory In Chip (MEMIC) on ConnectX-5 HCAs. Added support for HPC-X to work on Huawei ARM architecture. Improved performance by utilizing zero-copy messaging for MPI Bcast. Rev 2.1 Updated the following communications libraries and acceleration packages versions: · Open MPI version 3.1.x · Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.5 · HCOLL version 4.0 · MXM version 3.7 · UCX version 1.3 · OpenSHMEM v1.3 specification compliant · UCX is now the default pml layer for Open MPI, default spml layer for OpenSHMEM, and default OSC component for MPI RMA. · Added the following UCX features: · Added support for GPU memory in UCX communication libraries · Added support for Multi-Rail protocol The UD_RNDV_ZCOPY parameter is set to `no' by default. This means that the zcopy mechanism for the UD transport is disabled when using the Rendezvous protocol. 16 Mellanox Technologies Rev 2.3 Change Log History Table 5 - HPC-X Toolkit Change Log History Category HCOLL Profiling IB verbs API (ibprof) UPC HPC-X Content UCX Description · UCX is now the default p2p transport in HCOLL · Improved multi-threaded performance · Improved shared memory performance · Added support for Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.5 · Added support for Mellanox SHARP software multi-channel/multi-rail capable algorithms · Improved Allreduce large message algorithm · Improved AlltoAll algorithm Removed ibprof tool from HPC-X toolkit. Removed UPC from HPC-X toolkit. Rev 2.0 Updated the following communications libraries and acceleration packages versions: · OpenMPI version 3.0.0 · Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) ver- sion 1.4 · HCOLL version 3.9 · UCX version 1.3 · UCX is now at GA level. · Added the following UCX features: · [ConnectX-5 only] Added support for hardware Tag Matching with DC transport. · [ConnectX-5 only] Added support for Out-of-order RDMA RC and DC to support adaptive routing with true RDMA. · Hardware Tag Matching (See section Hardware Tag Matching in the User Manual) · SR-IOV Support (See section SR-IOV Support in the User Manual) · Adaptive Routing (AR) (See section Adaptive Routing in the User Man- ual) · Error Handling (See section Error Handling in the User Manual) Rev 2.3 Mellanox Technologies 17 Table 5 - HPC-X Toolkit Change Log History Category HCOLL Open MPI / OpenSHMEM Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) HPC-X Content Description · Added support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.4 · Added support for NCCL on-host GPU based collectives. · Added support for Hierarchical GPU based allreduce using NCCL for scale-in and MXM/UCX for scale-out. · Improved shared memory performance for allreduce, barrier, and broad- cast. Targeting high thread count systems, e.g. Power9. · Improved large message allreduce (multi-radix, zero-copy fragmentation, CPU vectorization.) · Added new and improved AlltoAllv algorithm - hybrid logarithmic pair- wise exchange. · Added support for on-demand HCOLL memory. Improves HCOLL's mem- ory footprint on high thread count system e.g. Power9. · Added a high performance multithreaded implementation to support MPI_- THREAD_MULTIPLE applications. Designed specifically for high thread count systems, e.g. Power9. · HCOLL startup improvements. · Added support for Open MPI 3.0.0. · Added support for xpmem kernel module. · Added a high performance implementation of shmem_ptr() with UCX SPML. · Added a UCX allocator. The UCX allocator optimizes intra-node commu- nication by allowing direct access to memories of processes on the same node. The UCX allocator can only be used with the UCX SPML. · Added a UCX one-sided component to support MPI RMA operations. Rev 1.9.7 Bug Fixes, see Section 4, "Bug Fixes History", on page 11 Rev 1.9 Updated the following communications libraries and acceleration packages versions: · OpenMPI version 2.1.2a1 · Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) ver- sion 1.3.1 · HCOLL version 3.8.1652 · MXM version 3.6.3103 · UCX version 1.2.2947 18 Mellanox Technologies Rev 2.3 Change Log History Table 5 - HPC-X Toolkit Change Log History Category UCX MXM Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Shared memory Configuration FCA OS architecture MXM Cross Channel (CC) Description Point-to-point communication API, with tag matching, remote memory access, and atomic operations. This can be used to implement MPI, PGAS, and Big Data libraries and applications- IB transport A cleaner API with lower software overhead which provides better performance especially for small messages. Support for multitude of InifiniBand transports and Mellanox offloads to optimize data transfer performance: · RDMA · DC · Out-of-order · HW tag matching offload · Registration cache · ODP Shared memory communications for optimal intra-node data transfer: · SysV · posix · knem · CMA · xpmem Enabled Adaptive Routing for all the transport layers (UD/RC/DC). Memory registration optimization. Improved the Out-of-the-box performance of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). Improved the intranode performance of allreduce and barrier. Changed many default parameter setting in order to achieve best out-of-thebox experience for several applications including - CP2K, miniDFT, VASP, DL-POLY, Amber, Fluent, GAMES-UK, and LS-DYNA. As of HPC-X v1.9, FCA v2.5 is no longer included in the HPC-X package. Improved AlltoAllv algorithm. Improved large data allreduce. Improved UCX BCOL. Added support for ARM architecture. Rev 1.8.2 Updated MXM version to 3.6.2098 which includes memory registration optimization. Rev 1.8 Added Cross Channel (CC) AlltoAllv Added CC zcpy Ring Bcas Rev 2.3 Mellanox Technologies 19 Table 5 - HPC-X Toolkit Change Log History Category Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Shared memory POWER Mixed data types Non-contiguous Bcast UMR Unified Communication X Framework (UCX) HPC-X Content FCA Bug Fixes MXM FCA Collective FCA MXM v3.5 IB-Router FCA v3.5 Description Added Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) non-blocking collectives Added shared memory POWER optimizations for allreduce Added shared memory POWER optimizations for Barrier Added support for mixed data types Added support for non-contiguous Bcast with UMR or SGE in CC Added UMR support in CC bcol A new acceleration library, integrated into the Open MPI (as a pml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. Updated the following communications libraries and acceleration packages versions: · HCOLL updated to v3.7. Open MPI updated to v2.10 FCA 2.x is no longer the default FCA used in HPC-X. As of HPC-X v1.8, FCA 3.x (HCOLL) is the default FCA used and it replaces FCA v2.x. See Section 4, "Bug Fixes History", on page 11 Rev 1.7 Updated MXM version to 3.6 Added Cross-Channel based Allgather, Bcast, 8-byte Allreduce. Added MPI datatype support. Added optimizations for PPC platforms. Added support for multiple Mellanox SHARP technology leaders on a single host. Added support for collecting Mellanox SHARP technology usage statistics. Exposed cross-channel non-blocking collectives to the MPI level. Rev 1.6 See Section 5.3, "MXM Change Log History", on page 23 Allows hosts that are located on different IB subnets to communicate with each other. This support is currently available when using the 'openib btl' in Open MPI. Note: When using 'openib btl', RoCE and IB router are mutually exclusive. The Open MPI inside HPC-X 1.6 is not compiled with ib-router support, therefore it supports RoCE out-of-the-box. See Section 5.2, "FCA Change Log History", on page 21 Rev 1.5 20 Mellanox Technologies Rev 2.3 Change Log History Table 5 - HPC-X Toolkit Change Log History Category Description HPC-X Content MXM v3.4.369 FCA v3.4.799 Updated the following communications libraries and acceleration packages versions: · Open MPI updated to v1.10 · UPC update to 2.22.0 · MXM updated to v3.4.369 · FCA updated to v3.4.799 See Section 5.3, "MXM Change Log History", on page 23 See Section 5.2, "FCA Change Log History", on page 21 Rev 1.4 FCA v3.3 See Section 5.2, "FCA Change Log History", on page 21 MXM v3.4 See Section 5.3, "MXM Change Log History", on page 23 Rev 1.3 MLNX_OFED CPU Architecture LID Mask Control (LMC) Performance Adaptive Routing UD zero copy Added support for OFED Inbox drivers Added support for PPC architecture Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization. Performance improvements for all transport layers. Enhanced support for Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section "Adaptive Routing for UD Transport". UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage. 5.2 FCA Change Log History Table 6 - FCA Change Log History Category Description FCA Collective FCA Rev 3.5 Added MPI Allgatherv and MPI reduce Added support for Mellanox SHARP library (including SHARP allreduce, reduce and barrier) Enhanced scalability for CORE-Direct based collectives Added support for complex data types Rev 3.4 Rev 2.3 Mellanox Technologies 21 Table 6 - FCA Change Log History Category General Collectives General Collectives Collectives MPI collectives MPI-3 Blocking and Non-blocking collectives HCOLL Collective algorithm Performance MPI libraries Multicast Group Performance Performance Dynamic offloading rules Mixed MTU AMD/Interlagos CPUs Core-Direct® Description UCX support Communicator caching scheme with eviction: improves jobstart and communicator creation time Collectives: Added Alltoallv and Alltoall small message algorithms. Rev 3.3 Ported to PowerPC Thread safety added Improved large message allreduce algorithm (Enabled by default) Beta version of network topology awareness (Enabled by default) Rev 3.0 Offload collectives communication from MPI process onto Mellanox interconnect hardware Efficient collectives communication flow optimized to job and topology Significantly reduce MPI collectives runtime Native support for MPI-3 Support for blocking and nonblocking collectives Supports hierarchical communication algorithms (HCOLL) Supports multiple optimizations within a single collective algorithm Increase CPU availability and efficiency for increased application performance Seamless integration with MPI libraries and job schedulers Rev 2.5 Added MCG (Multicast Group) cleanup tool Performance improvements Rev 2.2 Performance improvements Enabled dynamic offloading rules configuration based on the data type and reduce operations Added support for mixed MTU Rev 2.1.1 Added support for AMD/Interlagos CPUs Rev 2.1 Added support for Mellanox Core-Direct® technology (enables offloading collective operations to the HCA.) 22 Mellanox Technologies Rev 2.3 Change Log History Table 6 - FCA Change Log History Category Non-contiguous data layouts PGI compilers Description Added support for non-contiguous data layouts Added support for PGI compilers 5.3 MXM Change Log History Table 7 - MXM Change Log History Category Description Rev 3.6 General Updated MXM version to 3.6 Rev 3.5 Performance Performance improvements Rev 3.4.369 Initialization Supported Transports Job startup performance optimization DC enhancements and startup optimizations Rev 3.4 Supported Transports Optimizations for the DC transport at scale Rev 3.3 LID Mask Control (LMC) Adaptive Routing UD zero copy Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization. Enhanced support for Adaptive Routing for the UD transport layer. UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage. Rev 3.2 Atomic Operations MXM API Added hardware atomic operations support in the RC and DC transport layers for 32bit and 64bit operands. This feature is set to ON by default. To disable it run: oshrun -x MXM_CIB_USE_HW_ATOMICS=n ... Note: If hardware atomic operations are disabled, the software atomic is used instead. Added two additional functions (mxm_ep_wireup() and mxm_ep_powerdown)to the MXM API to allow pre-connection establishment for MXM (rather than on-demand). For further information, please refer to the HPC-X User Manual section "MXM Performance Tuning". Rev 2.3 Mellanox Technologies 23 Table 7 - MXM Change Log History Category Event Interrupt Performance Adaptive Routing Service Level Adaptive Routing Supported Transports Performance Supported Transports Performance Reliable Connected MXM Binding On-demand connection establishment Performance MXM over Ethernet Multi-Rail Description Added solicited event interrupt for the rendezvous protocol for potential performance improvement. For further information, please refer to the HPC-X User Manual section "MXM Performance Tuning". Performance improvements for the DC transport layer. Added Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section "Adaptive Routing for UD Transport". Rev 3.0 Service Level support (at Alpha level) Adaptive Routing support in UD transport layers Dynamically Connected Transport (DC) (at GA level) Performance optimizations Rev 2.1 Dynamically Connected Transport (DC) (at Beta level) RC is currently fully supported KNEM support for Intra-node communication Performance optimizations Rev 2.0 Added Reliable Connection (RC) support (at beta level) MXM process can be pinned to a specific HCA port. MXM supports the following binding policies: · static - user can specify process-to-port map · cpu affinity based - HCA port will be bound automatically based on pro- cess affinity Added on-demand connection establishment between the processes Performance tuning improvements Rev 1.5 Added Ethernet support Added Multi-Rail support 24 Mellanox Technologies Rev 2.3 Change Log History 5.4 HPC-XTM Open MPI/OpenSHMEM Change Log History Table 8 - HPC-XTM Open MPI/OpenSHMEM Change Log History Category Description Rev 2.2 Performance Added Sandy Bridge performance optimizations. memheap Allocated memheap using contiguous memory provided by the HCA. ptmalloc allocator Replaced the buddy memheap by the ptmalloc allocator. multiple pSync arrays Added the option of using multiple pSync arrays instead of barrier synchronization between collective routines (fcollect, reduction routines) spml yoda Optimized small size puts Performance Performance optimization Memory footprint optimi- Added memory footprint optimizations zations Rev 1.8.2 Acceleration Packages Added support for new MXM, FCA, HCOLL versions Job start optimization Added job start optimization Performance Performance improvements Rev 2.3 Mellanox Technologies 25Acrobat Distiller 11.0 (Windows)