HPC-X Toolkit Release Notes v2.3.book

Mellanox Technologies

HPC-X Toolkit Release Notes v2.3

These are the release notes for the Mellanox HPC-X™ Rev 2.3. The Mellanox HPC-X™ Software Toolkit is a comprehensive software package that includes Open MPI, OpenSHMEM,

PDF preview unavailable. Download the PDF instead.

HPC-X Toolkit Release Notes v2.3
Mellanox HPC-XTM Software Toolkit Release Notes
Rev 2.3

www.mellanox.com

Mellanox Technologies

NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ("PRODUCT(S)") AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES "AS-IS" WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS. THE CUSTOMER'S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT(S) AND/OR THE SYSTEM USING IT. THEREFORE, MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANTTHAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY. ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED. IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT, INDIRECT, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING, BUT NOT LIMITED TO, PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY FROM THE USE OF THE PRODUCT(S) AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Mellanox Technologies 350 Oakmead Parkway Suite 100 Sunnyvale, CA 94085 U.S.A. www.mellanox.com Tel: (408) 970-3400 Fax: (408) 970-3403

© Copyright 2018. Mellanox Technologies Ltd. All Rights Reserved.
Mellanox®, Mellanox logo, Connec-tIB®, ConnectX®, CORE-Direct®, GPUDirect®, LinkX®, Mellanox Mul-tiHost®, Mellanox Socket Direct®, UFM®, and Virtual Protocol Interconnect® are registered trademarks of Mellano Technologies,Ltd.
For the complete and most updated list of Mellanox trademarks, visit http://www.mellanox.com/page/trademarks.
All other trademarks are property of their respective owners.

Mellanox Technologies

2

Table of Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List Of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Release Update History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1 HPC-XTM Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 HPC-XTM Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Important Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2 Changes and New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 4 Bug Fixes History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 5 Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1 HPC-X Toolkit Change Log History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 FCA Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 MXM Change Log History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.4 HPC-XTM Open MPI/OpenSHMEM Change Log History . . . . . . . . . . . . . . . 25

Rev 2.3

Mellanox Technologies

3

List Of Tables

Table 1: Table 2: Table 3: Table 4: Table 5: Table 6: Table 7: Table 8:

Release Update History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Changes and New Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Bug Fixes History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 HPC-X Toolkit Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 FCA Change Log History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 MXM Change Log History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 HPC-XTM Open MPI/OpenSHMEM Change Log History . . . . . . . . . . . . . . . . . . . . . . 25

4

Mellanox Technologies

Rev 2.3

Release Update History

Table 1 - Release Update History

Release
Rev 2.3

Date
December 3, 2018

Description
Initial version of this HPC-X version.

Rev 2.3

Mellanox Technologies

5

1 Overview
These are the release notes for the Mellanox HPC-XTM Rev 2.3. The Mellanox HPC-XTM Software Toolkit is a comprehensive software package that includes Open MPI, OpenSHMEM, PGAS, MXM, UCX, FCA tool suite for high performance computing environments. HPC-X provides enhancements to significantly increase the scalability and performance of message communications in the network. HPC-XTM enables you to rapidly deploy and deliver maximum application performance without the complexity and costs of licensed third-party tools and libraries.

1.1 HPC-XTM Requirements

The platform and requirements for HPC-X are detailed in the following table:

Platform

Drivers and HCAs

OFED / MLNX_OFED · OFED 1.5.3 · MLNX_OFED 4.3-x.x.x and above

HCAs

· ConnectX®-5 / ConnectX®-5 Ex · ConnectX®-4 / ConnectX®-4 Lx · ConnectX®-3 / ConnectX®-3 Pro · Connect-IB®

1.2 HPC-XTM Content

The following communications libraries and acceleration packages are part of Mellanox HPCXTM Rev 2.3 package:

Library/Acceleration Package Open MPI Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) HCOLL UCX Open SHMEM specification compliant

Version Number 4.0.x 1.7.2 4.2 1.5 1.4a

a. Open SHMEM v1.4 compliance is at beta level.

1.3 Important Notes
When HPC-X is launched in an environment without resource manager (slurm, pbs, ...) installed, or from a compute node, it will use Open MPI default rsh/ssh based launcher which does not propagate the library path to the compute nodes.
In such case, pass the LD_LIBRARY_PATH variable as following:
% mpirun -x LD_LIBRARY_PATH -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c

6

Mellanox Technologies

Rev 2.3

Changes and New Features

2 Changes and New Features

HPC-XTM Rev 2.3 provides the following changes and new features: Table 2 - Changes and New Features

Category HPC-X Content
UCX
HCOLL MXM OpenMPI

Description
Updated the following communications libraries and acceleration packages versions: · Open MPI version 4.0.x · Mellanox Scalable Hierarchical Aggregation and Reduction Protocol
(SHARP) version 1.7.2 · HCOLL version 4.2 · UCX version 1.5 · OpenSHMEM version 1.4
UCX is now compiled without JAVA bindings.
Added support for running UCX over rdma-core, for DC transport and direct verbs.
Emulation layer: Added the ability to run UCX over software emulation of remote memory access and atomic operations. This provides full support of SHMEM and MPI-RMA over shared memory, TCP, and older RDMA hardware, such as ConnectX-3 HCA.
HCOLL and Mellanox SHARP are now compiled with CUDA support.
Added support for CUDA buffers over SRA allreduce algorithm.
Removed support for MXM library.
Added the following configuration options to OMPI: · --with-libevent=internal · --enable-mpi1-compatibility
Updated the configuration file platform/mellanox/optimized config in OMPI upstream by removing BTL OpenIB and UCT support and removing links to MXM/FCA usage.
Removed PMI2 support.

Rev 2.3

Mellanox Technologies

7

3 Known Issues
The following is a list of general limitations and known issues of the various components of this HPC-X release. Table 3 - Known Issues (Sheet 1 of 3)

Internal Ref. 1582208
-
2934

Issue
Description: Sending data over multiple SHMEM contexts may lead to memory corruption or segmentation fault. Workaround: Add the command line "-x UCX_SELF_DEVICES=" when running oshrun, or set it before the run. Keywords: Open SHMEM, segmentation fault Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: OSC UCX module is not selected by default on ConnectX-4/ConnectX-5 HCAs. Workaround: Add the command line argument "-mca osc ucx" when running mpirun. Keywords: OSC UCX, one-sided, Open MPI Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: Zero-length OpenSHMEM collectives may fail due to incomplete implementation. Workaround: N/A Keywords: OpenSHMEM atomic, Open MPI Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: OpenSHMEM atomic operations AND/OR/XOR for datatypes int32/ int64/uint32/uint64 are not implemented, which may cause build failures. Workaround: N/A Keywords: OpenSHMEM atomic, Open MPI Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) Description: OpenMPI and OpenSHMEM applications may hang with DC transport.

(Github issue: https://github.com/openucx/ucx/issues/2934)
Workaround: Use RC or UD instead of DC. To achieve that, make sure UCX pml (or UCX osc/smpl for one-sided applications) is used, and UCX_TLS variable is set. Example: UCX_TLS=rc,self,sm
Keywords: UCX, Open MPI, DC
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4)

8

Mellanox Technologies

Rev 2.3

Known Issues

Table 3 - Known Issues (Sheet 2 of 3)

Internal Ref. 1307243 -
2226

Issue
Description: One-sided tests may fail with a segmentation fault.
Workaround: In order to run one-sided tests, make sure to add -mca osc ucx to the command line.
Keywords: OSC UCX, Open MPI, one-sided
Discovered in Version: 2.1 (Open MPI 3.1.x)
Description: In ConnectX-4 and Connect-IB HCAs, when the DC transport is used on a large scale, "Retry exceeded" messages may be printed from UCX.
Workaround: Configure SL2VL on your OpenSM in the fabric and make UCX use SL=1 when using the InfiniBand transports via '-x UCX_IB_SL=1'.
Keywords: UCX, DC transport, ConnectX-4, Connect-IB
Discovered in Version: 2.1 (UCX 1.3)
Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX: "... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by 'ipcs -l".
Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file.
Keywords: UCX, memory
Discovered in Version: 2.1 (UCX 1.3)
Description: The following assertion may fail in certain cases: Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed

1162

(Gihub issue: https://github.com/openucx/ucx/issues/2226) Workaround: Set the DC transport using the UCX_TLS parameter. Keywords: UCX, assertion Discovered in Version: 2.1 (UCX 1.3) Description: Mellanox SHARP library is not available in HPC-X for the Community OFED and Inbox OFED. Workaround: N/A Keywords: Mellanox SHARP library Discovered in Version: 2.0 Description: UCX currently does not support canceling send requests.

(Github issue: https://github.com/openucx/ucx/issues/1162) Workaround: N/A Keywords: UCX Discovered in Version: 2.0

Rev 2.3

Mellanox Technologies

9

Table 3 - Known Issues (Sheet 3 of 3)

Internal Ref. -
-

Issue
Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV.
Workaround: Set UCX_IB_ADDR_TYPE=ib_global
Keywords: UCX
Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support. In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no
Workaround: Recompile UCX with AVX disabled: $ ./utils/hpcx_rebuild.sh --rebuild-ucx --ucx-extra-config "-with-avx=no"
Keywords: UCX

10

Mellanox Technologies

Rev 2.3

Bug Fixes History

4 Bug Fixes History
Table 4 lists the bugs fixed in this release. Table 4 - Bug Fixes History (Sheet 1 of 5)

Internal Ref. -
-
2111

Issue
Description: Fixed the issue where using UCX on ARM hosts may result in hangs due to a known issue in Open MPI when running on ARM. Keywords: UCX Discovered in Version: 1.3 (Open MPI 1.8.2) Fixed in Version: 2.3 (Open MPI 4.0.x) Description: MCA options rmaps_dist_device and rmaps_base_mapping_policy are now functional. Keywords: Process binding policy, NUMA/HCA locality Discovered in Version: 2.0 (Open MPI 3.0.0) Fixed in Version: 2.3 (Open MPI 4.0.x) Description: Fixed the issue of when UCX was used in the multi-threaded mode, it might have taken the osu_latency_mt test a long time to be completed.

2267

(Github issue: https://github.com/openucx/ucx/issues/2111)
Keywords: UCX, multi-threaded
Discovered in Version: 2.1 (UCX 1.3)
Fixed in Version: 2.3 (UCX 1.5)
Description: Fixed the issue where the following error message might have appeared when running at the scale of 256 ranks with the RC transport, when UD is used for wireup only: "Fatal: send completion with error: Endpoint timeout".

(Github issue: https://github.com/openucx/ucx/issues/2267) Keywords: UCX Discovered in Version: 2.1 (UCX 1.3) Fixed in Version: 2.3 (UCX 1.5)

Rev 2.3

Mellanox Technologies

11

Table 4 - Bug Fixes History (Sheet 2 of 5)

Internal Ref. 2702

Issue
Description: Fixed the issue of when using the Hardware Tag Matching feature, the following error messages may have been printed: · "rcache.c:481 UCX WARN failed to register region 0xdec25a0
[0x2b7139ae0020..0x2b7139ae2020]: Input/output error" · "ucp_mm.c:105 UCX ERROR failed to register address
0x2b7139ae0020 length 8192 on md[1]=ib/mlx5_0: Input/output error" · "ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x2b7139ae0020 len 8192: Input/output error"

2454

(Github issue: https://github.com/openucx/ucx/issues/2702)
Keywords: Hardware Tag Matching
Discovered in Version: 2.2 (UCX 1.4)
Fixed in Version: 2.3 (UCX 1.5)
Description: Fixed the issue where some one-sided benchmarks may have hung when using "osc ucx". For example: osu-micro-benchmarks-5.3.2/osu_get_acc_latency (Latency Test for accumulate with Active/Passive Synchronization).

2670

(Github issue: https://github.com/openucx/ucx/issues/2454)
Keywords: UCX, one_sided
Discovered in Version: 2.2 (UCX 1.4)
Fixed in Version: 2.3 (UCX 1.5)
Description: Fixed the issue of when enabling the Hardware Tag Matching feature on a large scale, the following error message may have been printed due to the increased threshold for BCOPY messages: "mpool.c:177 UCX ERROR Failed to allocate memory pool chunk: Out of memory."

1295679

(Github issue: https://github.com/openucx/ucx/issues/2670) Keywords: Hardware Tag Matching Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.3 (UCX 1.5) Description: Fixed the issue where OpenSHMEM group cache had a default limit of 100 entries, which might have resulted in OpenSHMEM application exiting with the following message: "group cache overflow on rank xxx: cache_size = 100". Keywords: OpenSHMEM, Open MPI Discovered in Version: 2.1 (Open MPI 3.1.x) Fixed in Version: 2.2 (Open MPI 3.1.x)

12

Mellanox Technologies

Rev 2.3

Bug Fixes History

Table 4 - Bug Fixes History (Sheet 3 of 5)

Internal Ref. -
1926

Issue
Description: Fixed the issue where UCX did not work out-of-the-box with CUDA support. Keywords: UCX, CUDA Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.1 (UCX 1.3) Description: Fixed the issue of when using multiple transports, invalid data was sent outof-sync with Hardware Tag Matching traffic.

1949
884482 884508 884490

(Github issue: https://github.com/openucx/ucx/issues/1926) Keywords: Hardware Tag Matching Discovered in Version: 2.1 (UCX 1.3) Fixed in Version: 2.2 (UCX 1.4) Description: Fixed the issue where Hardware Tag Matching might not have functioned properly with UCX over DC transport. (Github issue: https://github.com/openucx/ucx/issues/1949) Keywords: UCX, Hardware Tag Matching, DC transport Discovered in Version: 2.0 Fixed in Version: 2.1 Description: Fixed job data transfer from SD to libsharp. Keywords: Mellanox SHARP library Discovered in Release: 1.9 Fixed in Release: 1.9.7 Description: Fixed internal HCOLL datatype mapping. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed internal HCOLL datatype lower bound calculation. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed allgather unpacking issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406

Rev 2.3

Mellanox Technologies

13

Table 4 - Bug Fixes History (Sheet 4 of 5)

Internal Ref. 885009
882193
-
-
Salesforce: 316541 Salesforce: 316547
894346
898283

Issue
Description: Fixed wrong answer in alltoallv. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed mcast group leak in HCOLL. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Added IN_PLACE support for alltoall, alltoallv, and allgatherv. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed an issue related to multi-threaded MPI_Bcast. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed a memory barrier issue in MPI_Barrier on Power PPC systems. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed multi-threaded MPI_COMM_DUP and MPI_COMM_SPLIT hanging issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed Quantum Espresso hanging issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Description: Fixed an issue which caused CP2K applications to hang when HCOLL was enabled. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406

14

Mellanox Technologies

Rev 2.3

Bug Fixes History

Table 4 - Bug Fixes History (Sheet 5 of 5)

Internal Ref. 906155

Issue Description: Fixed an issue which caused VASP applications to hang in MPI_Allreduce. Keywords: HCOLL, FCA Discovered in Release: 1.6 Fixed in Release: 1.7.406

Rev 2.3

Mellanox Technologies

15

5
5.1

Change Log History

HPC-X Toolkit Change Log History
Table 5 - HPC-X Toolkit Change Log History

Category HPC-X Content
MXM Support UCX HPC-X Setup HCOLL HPC-X Content
UCX MXM

Description
Rev 2.2
Updated the following communications libraries and acceleration packages versions: · Mellanox Scalable Hierarchical Aggregation and Reduction Protocol
(SHARP) version 1.7 · HCOLL version 4.1 · UCX version 1.4
Added support for Singularity containerization. For further information, please refer to HPC-X User Manual.
"osc ucx" is no longer the default one-sided-component in OpenMPI.
Removed KNEM library from HPC-X package. UCX will use the KNEM available in MLNX_OFED.
Open MPI and HCOLL are not compiled with MXM anymore. Both are compiled with UCX only and use it by default.
Added support for the following UCX features: · New API for establishing client-server connection. · Out-of-box support for Memory In Chip (MEMIC) on ConnectX-5 HCAs.
Added support for HPC-X to work on Huawei ARM architecture.
Improved performance by utilizing zero-copy messaging for MPI Bcast.
Rev 2.1
Updated the following communications libraries and acceleration packages versions: · Open MPI version 3.1.x · Mellanox Scalable Hierarchical Aggregation and Reduction Protocol
(SHARP) version 1.5 · HCOLL version 4.0 · MXM version 3.7 · UCX version 1.3 · OpenSHMEM v1.3 specification compliant
· UCX is now the default pml layer for Open MPI, default spml layer for OpenSHMEM, and default OSC component for MPI RMA.
· Added the following UCX features: · Added support for GPU memory in UCX communication libraries · Added support for Multi-Rail protocol
The UD_RNDV_ZCOPY parameter is set to `no' by default. This means that the zcopy mechanism for the UD transport is disabled when using the Rendezvous protocol.

16

Mellanox Technologies

Rev 2.3

Change Log History

Table 5 - HPC-X Toolkit Change Log History

Category HCOLL
Profiling IB verbs API (ibprof) UPC HPC-X Content
UCX

Description
· UCX is now the default p2p transport in HCOLL · Improved multi-threaded performance · Improved shared memory performance · Added support for Mellanox Scalable Hierarchical Aggregation and
Reduction Protocol (SHARP) v1.5 · Added support for Mellanox SHARP software multi-channel/multi-rail
capable algorithms · Improved Allreduce large message algorithm · Improved AlltoAll algorithm
Removed ibprof tool from HPC-X toolkit.
Removed UPC from HPC-X toolkit.
Rev 2.0
Updated the following communications libraries and acceleration packages versions: · OpenMPI version 3.0.0 · Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) ver-
sion 1.4 · HCOLL version 3.9 · UCX version 1.3
· UCX is now at GA level. · Added the following UCX features:
· [ConnectX-5 only] Added support for hardware Tag Matching with DC transport.
· [ConnectX-5 only] Added support for Out-of-order RDMA RC and DC to support adaptive routing with true RDMA.
· Hardware Tag Matching (See section Hardware Tag Matching in the User Manual)
· SR-IOV Support (See section SR-IOV Support in the User Manual) · Adaptive Routing (AR) (See section Adaptive Routing in the User Man-
ual) · Error Handling (See section Error Handling in the User Manual)

Rev 2.3

Mellanox Technologies

17

Table 5 - HPC-X Toolkit Change Log History

Category HCOLL
Open MPI / OpenSHMEM
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) HPC-X Content

Description
· Added support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.4
· Added support for NCCL on-host GPU based collectives. · Added support for Hierarchical GPU based allreduce using NCCL for
scale-in and MXM/UCX for scale-out. · Improved shared memory performance for allreduce, barrier, and broad-
cast. Targeting high thread count systems, e.g. Power9. · Improved large message allreduce (multi-radix, zero-copy fragmentation,
CPU vectorization.) · Added new and improved AlltoAllv algorithm - hybrid logarithmic pair-
wise exchange. · Added support for on-demand HCOLL memory. Improves HCOLL's mem-
ory footprint on high thread count system e.g. Power9. · Added a high performance multithreaded implementation to support MPI_-
THREAD_MULTIPLE applications. Designed specifically for high thread count systems, e.g. Power9. · HCOLL startup improvements.
· Added support for Open MPI 3.0.0. · Added support for xpmem kernel module. · Added a high performance implementation of shmem_ptr() with UCX
SPML. · Added a UCX allocator. The UCX allocator optimizes intra-node commu-
nication by allowing direct access to memories of processes on the same node. The UCX allocator can only be used with the UCX SPML. · Added a UCX one-sided component to support MPI RMA operations.
Rev 1.9.7
Bug Fixes, see Section 4, "Bug Fixes History", on page 11
Rev 1.9
Updated the following communications libraries and acceleration packages versions: · OpenMPI version 2.1.2a1 · Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) ver-
sion 1.3.1 · HCOLL version 3.8.1652 · MXM version 3.6.3103 · UCX version 1.2.2947

18

Mellanox Technologies

Rev 2.3

Change Log History

Table 5 - HPC-X Toolkit Change Log History

Category UCX
MXM Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Shared memory Configuration FCA
OS architecture MXM Cross Channel (CC)

Description
Point-to-point communication API, with tag matching, remote memory access, and atomic operations. This can be used to implement MPI, PGAS, and Big Data libraries and applications- IB transport
A cleaner API with lower software overhead which provides better performance especially for small messages.
Support for multitude of InifiniBand transports and Mellanox offloads to optimize data transfer performance: · RDMA · DC · Out-of-order · HW tag matching offload · Registration cache · ODP
Shared memory communications for optimal intra-node data transfer: · SysV · posix · knem · CMA · xpmem
Enabled Adaptive Routing for all the transport layers (UD/RC/DC).
Memory registration optimization.
Improved the Out-of-the-box performance of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP).
Improved the intranode performance of allreduce and barrier.
Changed many default parameter setting in order to achieve best out-of-thebox experience for several applications including - CP2K, miniDFT, VASP, DL-POLY, Amber, Fluent, GAMES-UK, and LS-DYNA.
As of HPC-X v1.9, FCA v2.5 is no longer included in the HPC-X package.
Improved AlltoAllv algorithm.
Improved large data allreduce.
Improved UCX BCOL.
Added support for ARM architecture.
Rev 1.8.2
Updated MXM version to 3.6.2098 which includes memory registration optimization.
Rev 1.8
Added Cross Channel (CC) AlltoAllv
Added CC zcpy Ring Bcas

Rev 2.3

Mellanox Technologies

19

Table 5 - HPC-X Toolkit Change Log History

Category Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Shared memory POWER Mixed data types Non-contiguous Bcast UMR Unified Communication X Framework (UCX) HPC-X Content
FCA
Bug Fixes
MXM FCA Collective FCA
MXM v3.5 IB-Router
FCA v3.5

Description
Added Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) non-blocking collectives
Added shared memory POWER optimizations for allreduce Added shared memory POWER optimizations for Barrier Added support for mixed data types Added support for non-contiguous Bcast with UMR or SGE in CC Added UMR support in CC bcol A new acceleration library, integrated into the Open MPI (as a pml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. Updated the following communications libraries and acceleration packages versions: · HCOLL updated to v3.7.
Open MPI updated to v2.10 FCA 2.x is no longer the default FCA used in HPC-X. As of HPC-X v1.8, FCA 3.x (HCOLL) is the default FCA used and it replaces FCA v2.x.
See Section 4, "Bug Fixes History", on page 11
Rev 1.7
Updated MXM version to 3.6 Added Cross-Channel based Allgather, Bcast, 8-byte Allreduce. Added MPI datatype support. Added optimizations for PPC platforms. Added support for multiple Mellanox SHARP technology leaders on a single host. Added support for collecting Mellanox SHARP technology usage statistics. Exposed cross-channel non-blocking collectives to the MPI level.
Rev 1.6
See Section 5.3, "MXM Change Log History", on page 23
Allows hosts that are located on different IB subnets to communicate with each other. This support is currently available when using the 'openib btl' in Open MPI. Note: When using 'openib btl', RoCE and IB router are mutually exclusive. The Open MPI inside HPC-X 1.6 is not compiled with ib-router support, therefore it supports RoCE out-of-the-box.
See Section 5.2, "FCA Change Log History", on page 21
Rev 1.5

20

Mellanox Technologies

Rev 2.3

Change Log History

Table 5 - HPC-X Toolkit Change Log History

Category

Description

HPC-X Content
MXM v3.4.369 FCA v3.4.799

Updated the following communications libraries and acceleration packages versions: · Open MPI updated to v1.10 · UPC update to 2.22.0 · MXM updated to v3.4.369 · FCA updated to v3.4.799
See Section 5.3, "MXM Change Log History", on page 23
See Section 5.2, "FCA Change Log History", on page 21

Rev 1.4

FCA v3.3

See Section 5.2, "FCA Change Log History", on page 21

MXM v3.4

See Section 5.3, "MXM Change Log History", on page 23

Rev 1.3

MLNX_OFED CPU Architecture LID Mask Control (LMC)
Performance Adaptive Routing
UD zero copy

Added support for OFED Inbox drivers
Added support for PPC architecture
Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization.
Performance improvements for all transport layers.
Enhanced support for Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section "Adaptive Routing for UD Transport".
UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage.

5.2 FCA Change Log History

Table 6 - FCA Change Log History

Category

Description

FCA Collective FCA

Rev 3.5
Added MPI Allgatherv and MPI reduce Added support for Mellanox SHARP library (including SHARP allreduce, reduce and barrier) Enhanced scalability for CORE-Direct based collectives Added support for complex data types
Rev 3.4

Rev 2.3

Mellanox Technologies

21

Table 6 - FCA Change Log History

Category General
Collectives
General
Collectives
Collectives
MPI collectives MPI-3 Blocking and Non-blocking collectives HCOLL Collective algorithm Performance MPI libraries
Multicast Group Performance
Performance Dynamic offloading rules Mixed MTU
AMD/Interlagos CPUs
Core-Direct®

Description UCX support Communicator caching scheme with eviction: improves jobstart and communicator creation time Collectives: Added Alltoallv and Alltoall small message algorithms.
Rev 3.3 Ported to PowerPC Thread safety added Improved large message allreduce algorithm (Enabled by default) Beta version of network topology awareness (Enabled by default)
Rev 3.0 Offload collectives communication from MPI process onto Mellanox interconnect hardware Efficient collectives communication flow optimized to job and topology Significantly reduce MPI collectives runtime Native support for MPI-3 Support for blocking and nonblocking collectives
Supports hierarchical communication algorithms (HCOLL) Supports multiple optimizations within a single collective algorithm Increase CPU availability and efficiency for increased application performance Seamless integration with MPI libraries and job schedulers
Rev 2.5 Added MCG (Multicast Group) cleanup tool Performance improvements
Rev 2.2 Performance improvements Enabled dynamic offloading rules configuration based on the data type and reduce operations Added support for mixed MTU
Rev 2.1.1 Added support for AMD/Interlagos CPUs
Rev 2.1 Added support for Mellanox Core-Direct® technology (enables offloading collective operations to the HCA.)

22

Mellanox Technologies

Rev 2.3

Change Log History

Table 6 - FCA Change Log History

Category
Non-contiguous data layouts PGI compilers

Description Added support for non-contiguous data layouts
Added support for PGI compilers

5.3

MXM Change Log History

Table 7 - MXM Change Log History

Category

Description

Rev 3.6

General

Updated MXM version to 3.6

Rev 3.5

Performance

Performance improvements

Rev 3.4.369

Initialization Supported Transports

Job startup performance optimization DC enhancements and startup optimizations

Rev 3.4

Supported Transports

Optimizations for the DC transport at scale

Rev 3.3

LID Mask Control (LMC)
Adaptive Routing UD zero copy

Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization.
Enhanced support for Adaptive Routing for the UD transport layer.
UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage.

Rev 3.2

Atomic Operations MXM API

Added hardware atomic operations support in the RC and DC transport layers for 32bit and 64bit operands. This feature is set to ON by default. To disable it run: oshrun -x MXM_CIB_USE_HW_ATOMICS=n ... Note: If hardware atomic operations are disabled, the software atomic is used instead.
Added two additional functions (mxm_ep_wireup() and mxm_ep_powerdown)to the MXM API to allow pre-connection establishment for MXM (rather than on-demand). For further information, please refer to the HPC-X User Manual section "MXM Performance Tuning".

Rev 2.3

Mellanox Technologies

23

Table 7 - MXM Change Log History

Category Event Interrupt
Performance Adaptive Routing
Service Level Adaptive Routing Supported Transports Performance
Supported Transports
Performance
Reliable Connected MXM Binding
On-demand connection establishment Performance
MXM over Ethernet Multi-Rail

Description
Added solicited event interrupt for the rendezvous protocol for potential performance improvement. For further information, please refer to the HPC-X User Manual section "MXM Performance Tuning". Performance improvements for the DC transport layer. Added Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section "Adaptive Routing for UD Transport".
Rev 3.0
Service Level support (at Alpha level) Adaptive Routing support in UD transport layers Dynamically Connected Transport (DC) (at GA level) Performance optimizations
Rev 2.1
Dynamically Connected Transport (DC) (at Beta level) RC is currently fully supported KNEM support for Intra-node communication Performance optimizations
Rev 2.0
Added Reliable Connection (RC) support (at beta level) MXM process can be pinned to a specific HCA port. MXM supports the following binding policies: · static - user can specify process-to-port map · cpu affinity based - HCA port will be bound automatically based on pro-
cess affinity Added on-demand connection establishment between the processes
Performance tuning improvements
Rev 1.5
Added Ethernet support Added Multi-Rail support

24

Mellanox Technologies

Rev 2.3

Change Log History

5.4

HPC-XTM Open MPI/OpenSHMEM Change Log History
Table 8 - HPC-XTM Open MPI/OpenSHMEM Change Log History

Category

Description

Rev 2.2

Performance

Added Sandy Bridge performance optimizations.

memheap

Allocated memheap using contiguous memory provided by the HCA.

ptmalloc allocator

Replaced the buddy memheap by the ptmalloc allocator.

multiple pSync arrays

Added the option of using multiple pSync arrays instead of barrier synchronization between collective routines (fcollect, reduction routines)

spml yoda

Optimized small size puts

Performance

Performance optimization

Memory footprint optimi- Added memory footprint optimizations zations

Rev 1.8.2

Acceleration Packages

Added support for new MXM, FCA, HCOLL versions

Job start optimization

Added job start optimization

Performance

Performance improvements

Rev 2.3

Mellanox Technologies

25


Acrobat Distiller 11.0 (Windows)