AIRI™ Configuration Guide With Cisco Nexus 9300 Switch AIRI Config
User Manual:
Open the PDF directly: View PDF .
Page Count: 15
Download | |
Open PDF In Browser | View PDF |
AIRI CONFIGURATION GUIDE WHITE PAPER SCALE-OUT AI-READY INFRASTRUCTURE ARCHITECTED BY PURE STORAGE AND NVIDIA WITH CISCO NEXUS 9300 SWITCH TABLE OF CONTENTS INTRODUCTION ......................................................................................................................................... 3 HOW TO USE THIS GUIDE ........................................................................................................................ 4 SYSTEM ARCHITECTURE ........................................................................................................................ 4 PREREQUISITES ......................................................................................................................................... 5 Install Additional Package Dependencies .................................................................................... 5 NETWORK CONFIGURATION ................................................................................................................. 5 Switch Configuration ........................................................................................................................... 6 DGX-1 Configuration ............................................................................................................................ 8 Configure Network Interfaces .......................................................................................................... 9 TCP Traffic Evaluation ....................................................................................................................... 10 RDMA Traffic Evaluation ................................................................................................................... 11 FlashBlade Configuration ................................................................................................................ 12 ADDITIONAL RESOURCES ..................................................................................................................... 13 APPENDIX A: HOROVOD CONTAINER IMAGE ................................................................................... 14 2 INTRODUCTION Recent advances in AI and deep learning hold tremendous promise for a new wave of innovation for enterprises, turning their data into intelligent applications and products. Researchers have fueled these advancements using high-performance GPUs and leveraging massive growth in available datasets. While enterprises seek to get started on their AI journey, they are stuck with legacy technologies like CPUs and spinning disk, and the complexities of building an AI-ready infrastructure with them. Designing, configuring, and maintaining infrastructure to satisfy the challenges of large-scale deep learning requires a significant investment of time and resources to avoid unforeseen delays, bottlenecks, or downtime. Engineers at NVIDIA® and Pure Storage® collaborated to address many of the complexities that come with scale-out infrastructure for AI workloads. AIRI™ is the industry’s first complete AI-ready infrastructure, architected by Pure Storage and NVIDIA to extend the power of NVIDIA® DGX™ systems, enabling AI-at-scale for every enterprise. AIRI is a converged infrastructure stack, purpose-built for large-scale deep learning environments. The entire stack is configured and tested as a complete solution, avoiding the intricate configuration and tuning required otherwise. To learn more, please refer to the AIRI Cisco Reference Architecture. FIGURE 1. “AIRI Mini” and AIRI with 100GbE Cisco Nexus 9000 Switches 3 HOW TO USE THIS GUIDE This guide describes how to configure an AIRI system containing 4x NVIDIA DGX-1 servers and a Pure Storage FlashBlade. To optimize DGX-1 node-to-node communication, the configuration uses an RDMA over Converged Ethernet (RoCE) fabric. The same high-performance fabric carries storage traffic from FlashBlade to the DGX-1 servers, simplifying system configuration and deployment. An Installation Engineer is available to assist with installing and configuring the system and can help with questions or assistance during setup. SYSTEM ARCHITECTURE The architecture for AIRI Mini and AIRI looks as follows: FIGURE 2. AIRI architecture The AIRI architecture is designed for scale-out deep learning workloads and is not restricted to these sizes. As datasets and workload requirements scale, additional DGX-1 servers can be provisioned and instantly access all available data. Similarly, as storage capacity or performance demands grow, additional blades can be added to the FlashBlade system with zero downtime or re-configuration. 4 PREREQUISITES Configure DGX-1 servers and FlashBlade according to their respective setup guides. For the initial setup, configure the 10Gb/s connections on the DGX-1 servers for management, and create a subnet and VIP on FlashBlade for this network. For the remainder of this guide, we assume four DGX-1 servers have been configured and given names (via /etc/hosts) dgx-X, where X is 1-4. Install Additional Package Dependencies These packages are necessary for the subsequent configuration of vlan networking and TCP/IP network testing. $ sudo apt-get update $ sudo apt-get install -y vlan iperf3 NETWORK CONFIGURATION The diagram below shows the overall network topology of AIRI, excluding a separate management network. In the diagram, FM-1/2 are the two internal fabric modules of FlashBlade, SW-1/2 are dual 100G switches, and DGX-1 (a), DGX-1 (b), DGX-1 (c), and DGX-1 (d) are the DGX-1 servers. Each of the numbers on SW-1/2 indicate the Ethernet port number, and the numbers on the DGX label which of the ethernet devices (enp5s0, enp12s0, etc.). The network topology for AIRI Mini is the same except that there are only two DGX-1 servers (DGX-1 (a) & DGX-1 (b)). FIGURE 3. Overall network topology Our reference configuration used Cisco 100G, 36-port Ethernet switches (Nexus 9336c-FX2, running NXOS 7.0(3)I7(3)), but other switch models may be used in the future. We configure the network to support two classes of traffic over the unified fabric: a VLAN for TCP flows, which includes NFS access from FlashBlade; and the RDMA-over-Converged- 5 Ethernet (RoCE) traffic used between DGX-1 servers during distributed deep learning. For more details on the network architecture and configuration, please refer to the AIRI Cisco Configuration Guide. Each DGX is connected into the overall topology by four distinct 100Gb/s links, with two ports connected to each of the two switches. The FlashBlade fabric modules have all 8 of their QSFP uplinks connected, with each fabric module connecting two ports into each switch. The two switches SW-1 and SW-2 have 4x 100Gb/s connections between them in an MLAG for inter-switch traffic. There are two logical networks configured on this fabric. First, the DGX-1s use an untagged VLAN (id 1) to carry RoCE traffic. All four 100G ports on each DGX-1 are configured on this VLAN and participate in node-to-node RoCE communications. Second, there is a TCP network that runs in VLAN 3000. On each DGX-1, two of the 100G ports – enp5s0 and enp132s0 – are selected to carry VLAN 3000. These are placed in an active-backup bond in the Linux bonding driver. The RoCE traffic is configured in a priority-flow-control class on both the server endpoints and switches, allowing the more latency-sensitive traffic between DGX-1 servers to take priority over the TCP traffic used for storage access. Switch Configuration For each of the 100Gb/s data switches (Nexus 9336C-FX2), log into the switch console and issue the following commands: ! QOS class-maps for matching marked ROCE & control frames class-map type qos match-all ROCE-class match cos 3 match dscp 24-31 class-map type qos match-all control-class match cos 5-7 match dscp 40-63 ! QOS policy-map setting qos-group for each class policy-map type qos marking-policy class control-class set qos-group 7 class ROCE-class set qos-group 3 class class-default set qos-group 0 ! Queuing policy defining priority queue & DWRR queues policy-map type queuing queuing-policy 6 class type queuing c-out-8q-q7 priority level 1 class type queuing c-out-8q-q6 bandwidth remaining percent 0 class type queuing c-out-8q-q5 bandwidth remaining percent 0 class type queuing c-out-8q-q4 bandwidth remaining percent 0 class type queuing c-out-8q-q3 bandwidth remaining percent 80 class type queuing c-out-8q-q2 bandwidth remaining percent 0 class type queuing c-out-8q-q1 bandwidth remaining percent 0 class type queuing c-out-8q-q-default bandwidth remaining percent 20 ! Network QOS policy identifying PFC-eligible traffic policy-map type network-qos ROCE-NQ-policy class type network-qos c-8q-nq7 mtu 1500 class type network-qos c-8q-nq3 pause pfc-cos 3 mtu 4200 class type network-qos c-8q-nq-default mtu 1500 ! Applies network QOS and queuing globally to the switch system qos service-policy type network-qos ROCE-NQ-policy service-policy type queuing output queuing-policy ! Enables PFC and applies QOS marking policy on interface(s) interface Ethernet x/x priority-flow-control mode auto mtu 9216 service-policy type qos input marking-policy In addition, add vlan 3000 to the switchport trunk allowed list for the port-channel connected to FlashBlade. 7 DGX-1 Configuration To switch VPI cards from Infiniband to Ethernet mode, first start the Mellanox Software Tools (MST) set: $ mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success Loading MST PCI configuration module - Success Create devices Unloading MST PCI module (unused) - Success Then, change the port type to Ethernet (LINK_TYPE = 2): $ mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2 Device #1: ---------- Device type: ConnectX4 PCI device: /dev/mst/mt4115_pciconf0 Configurations: LINK_TYPE_P1 Current New 1 2 Apply new Configuration? ? (y/n) [n] : y Applying... Done! -I- Please reboot machine to load new configurations. Repeat for the other three devices: $ mlxconfig -d /dev/mst/mt4115_pciconf1 set LINK_TYPE_P1=2 $ mlxconfig -d /dev/mst/mt4115_pciconf2 set LINK_TYPE_P1=2 $ mlxconfig -d /dev/mst/mt4115_pciconf3 set LINK_TYPE_P1=2 Reboot the server to complete. After rebooting, run: $ ibv_devinfo And confirm that all four devices show link_layer: Ethernet. 8 Configure Network Interfaces In the following, we use iproute2 commands (e.g., ip link add…). These are fast to execute and easy to debug, but are not persistent across reboots. Create a persistent version of the network setup by either writing a script to execute the following commands on startup, or create the equivalent using /etc/network/interfaces (man 5 interfaces). The commands below will configure: • VLAN 3000 virtual interfaces on NICs enp5s0 and enp132s0 • A bond0 active/backup pair over the VLAN3000 interfaces • A private 198.18.0.0/24 address for bond0 • MTU 9000 (jumbo frames) on all interfaces Note that the example below uses 198.18.0.11 for one server. Substitute unique addresses for each server in the cluster. All other settings are common across servers. All commands are assumed to be run as root. $ modprobe bonding mode=active-backup miimon=100 $ ip link add name enp5s0.3000 link enp5s0 type vlan id 3000 $ ip link add name enp132s0.3000 link enp132s0 type vlan id 3000 $ ip address add 198.18.0.11/24 dev bond0 # unique address / DGX $ ip link set dev enp5s0 mtu 9000 $ ip link set dev enp12s0 mtu 9000 $ ip link set dev enp132s0 mtu 9000 $ ip link set dev enp139s0 mtu 9000 $ ip link set dev enp5s0.3000 mtu 9000 $ ip link set dev enp132s0.3000 mtu 9000 $ ip link set dev enp5s0 up $ ip link set dev enp12s0 up $ ip link set dev enp132s0 up $ ip link set dev enp139s0 up $ ip link set dev enp5s0.3000 up $ ip link set dev enp132s0.3000 up $ ip link set dev bond0 up $ ifenslave bond0 enp5s0.3000 enp132s0.3000 Configure the VLAN interfaces used for TCP traffic as priority 0 (low) egress traffic: $ for I in {0..7}; do vconfig set_egress_map enp5s0.3000 $I 0; done $ for I in {0..7}; do vconfig set_egress_map enp132s0.3000 $I 0; done 9 And configure the raw VPI devices to use PFC priority 3: $ mlnx_qos -i enp5s0 --pfc 0,0,0,1,0,0,0,0 $ tc_wrap.py -i enp5s0 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 $ mlnx_qos -i enp12s0 --pfc 0,0,0,1,0,0,0,0 $ tc_wrap.py -i enp12s0 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 $ mlnx_qos -i enp132s0 --pfc 0,0,0,1,0,0,0,0 $ tc_wrap.py -i enp132s0 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 $ mlnx_qos -i enp139s0 --pfc 0,0,0,1,0,0,0,0 $ tc_wrap.py -i enp139s0 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 TCP Traffic Evaluation We use iperf3 to exercise the 100Gb TCP VLAN (3000) and ensure all DGX-DGX pairs are operating at full network bandwidth for TCP traffic. Because iperf3 is a single-threaded process, we must use multiple client-server pairs to saturate the 100Gb link. From one of the DGX-1 systems, run two server processes as daemons. These commands start two listener processes on ports 5201 and 5202. For the example below, assume we run these on the DGX-1 with IP address 198.18.0.11. $ iperf3 -s -p 5201 -D $ iperf3 -s -p 5202 -D On another DGX-1, run the following commands concurrently: $ iperf3 -c 198.18.0.11 -t 30 -l 1M -p 5201 -R $ iperf3 -c 198.18.0.11 -t 30 -l 1M -p 5202 -R These client processes will report total bandwidth sent, similar to the following: [ ID] Interval Transfer Bandwidth Retr 96 [ 4] 0.00-30.00 sec 161 GBytes 46.2 Gbits/sec [ 4] 0.00-30.00 sec 161 GBytes 46.2 Gbits/sec [ ID] Interval Transfer receiver Bandwidth Retr 15 [ 4] 0.00-30.00 sec 174 GBytes 49.9 Gbits/sec [ 4] 0.00-30.00 sec 174 GBytes 49.9 Gbits/sec sender sender receiver As shown, the total bandwidth sent is 46.2 + 49.9 = 96.1 Gb/s. Repeating these results with all pairs, we confirm the TCP network is operational and able to saturate the bond as configured. 10 RDMA Traffic Evaluation The DGX-1 ships with the OpenFabrics’ perftests suite for evaluating native IB verbs performance. These tools run exclusively on RDMA-capable networks, supporting both Infiniband and RDMA-over-Converged-Ethernet physical layers. On one of the DGX-1 systems – in the example below, this is “dgx-1” – start a server process with to run a bandwidth test: $ ib_send_bw --report_gbits -aF -d mlx5_0 On a second DGX-1 system, run the client: $ ib_send_bw --report_gbits -aF -d mlx5_0 dgx-1 ----------------------------------------------------------------------------Send BW Test Dual-port : OFF Device Number of qps : 1 Transport type : IB Connection type : RC Using SRQ TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 1 : mlx5_0 : OFF Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet ----------------------------------------------------------------------------local address: LID 0000 QPN 0x1d26 PSN 0xae3ea9 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:18:01:02 remote address: LID 0000 QPN 0x2589 PSN 0x554df6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:18:01:01 ----------------------------------------------------------------------------#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 1000 0.12 0.12 7.411412 4 1000 0.25 0.23 7.068284 8 1000 0.52 0.46 7.141944 16 1000 1.03 0.91 7.140323 32 1000 1.32 1.31 5.099323 64 1000 3.97 3.62 7.067015 128 1000 8.14 8.12 7.930033 11 256 1000 15.88 15.24 7.439436 512 1000 31.55 31.40 7.667043 1024 1000 54.00 49.16 6.001562 2048 1000 78.86 73.32 4.475048 4096 1000 91.97 91.61 2.795752 8192 1000 93.04 92.99 1.418913 16384 1000 95.80 95.77 0.730660 32768 1000 96.57 96.57 0.368391 65536 1000 97.04 96.85 0.184723 131072 1000 97.25 97.24 0.092740 262144 1000 97.43 97.42 0.046453 524288 1000 97.48 97.48 0.023240 1048576 1000 97.52 97.52 0.011625 2097152 1000 97.69 97.69 0.005823 4194304 1000 97.81 97.80 0.002915 8388608 1000 97.81 97.81 0.001457 ----------------------------------------------------------------------------- The previous pair of commands can be repeated for each of the four devices on each DGX-1 host: mlx5_0, mlx5_1, mlx5_2, mlx5_3. Furthermore, all DGX-1 nodes can be tested pairwise to confirm full RoCE connectivity. FlashBlade Configuration On the Settings > Network tab of the FlashBlade UI, or using the corresponding CLI commands, create a subnet for 198.18.0.0/24 as follows: Name: data-net1 Prefix: 198.18.0.0/24 VLAN: 3000 Gateway: (leave empty) MTU: 9000 Add an interface for data traffic in 198.18.0.0/24 as follows: Name: data1 Address: 198.18.0.100 From the Storage tab, or using the corresponding CLI commands, create a filesystem to use for storing datasets: Name: datasets Protocols: enable NFS 12 On each of the DGX-1 servers, mount FlashBlade using the following commands: $ mkdir -p /mnt/datasets $ mount -t nfs 198.18.0.100:/datasets /mnt/datasets Note that all NFS mount options are left at the OS defaults, which corresponds to a 512kB read- and write-size with FlashBlade and the local file caching option (fsc) disabled. ADDITIONAL RESOURCES • AIRI Github Site • AIRI Reference Architecture • AIRI Product Page 13 APPENDIX A: HOROVOD CONTAINER IMAGE The following Dockerfile is used to create an image, derived from the base NVIDIA container for Tensorflow, to run Horovod in an OpenMPI environment. FROM nvcr.io/nvidia/tensorflow:17.12 RUN mkdir /build WORKDIR /build RUN apt-get update && apt-get install -y --no-install-recommends \ libibverbs1 \ libibverbs-dev \ libmlx5-1 \ librdmacm-dev \ librdmacm1 \ openssh-client \ openssh-server \ file \ && \ rm -rf /var/lib/apt/lists/* ENV OPENMPI_VERSION 3.0.0 ENV OPENMPI_TAR openmpi-${OPENMPI_VERSION}.tar.gz ENV OPENMPI_URL https://www.open-mpi.org/software/ompi/v3.0/downloads RUN wget -q -O - ${OPENMPI_URL}/${OPENMPI_TAR} | tar -xzf - && \ cd openmpi-${OPENMPI_VERSION} && \ ./configure --enable-orterun-prefix-by-default \ --with-cuda --with-verbs \ --prefix=/usr/local/mpi --disable-getpwuid && \ make -j"$(nproc)" install && \ cd .. && rm -rf openmpi-${OPENMPI_VERSION} ENV PATH /usr/local/mpi/bin:$PATH RUN mkdir -p /var/run/sshd && \ mkdir -p /root/.ssh && \ echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config && \ echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \ echo "LogLevel quiet" >> /etc/ssh/ssh_config && \ sed -i 's/^Port 22/Port 2222/' /etc/ssh/sshd_config && \ 14 echo "HOST *" > /root/.ssh/config && \ echo "PORT 2222" > /root/.ssh/config && \ mkdir -p /root/.ssh && \ ssh-keygen -t rsa -b 4096 -f /root/.ssh/id_rsa -N "" && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys && \ chmod 700 /root/.ssh && \ chmod 600 /root/.ssh/* RUN export HOROVOD_GPU_ALLREDUCE=NCCL && \ export HOROVOD_NCCL_INCLUDE=/usr/include && \ export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu && \ ln -s /usr/local/cuda/lib64/stubs/libcuda.so ./libcuda.so.1 && \ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD && \ pip2 install --no-cache-dir horovod && \ rm ./libcuda.so.1 RUN ldconfig EXPOSE 2222 © 2018 Pure Storage, Inc. All rights reserved. AIRI, the AIRI logo, Pure Storage, the P Logo, and FlashBlade are trademarks or registered trademarks of Pure Storage, Inc. in the U.S. and other countries. NVIDIA, DGX-1, and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation. All other trademarks are registered marks of their respective owners. The Pure Storage and NVIDIA products and programs described in this documentation are distributed under a license agreement restricting the use, copying, distribution, and decompilation/reverse engineering of the products. No part of this documentation may be reproduced in any form by any means without prior written authorization from Pure Storage, Inc. and its licensors, if any. Pure Storage and NVIDIA may make improvements and/or changes in the Pure Storage and NVIDIA products and/or the programs described in this documentation at any time without notice. THIS DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. PURE STORAGE SHALL NOT BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES IN CONNECTION WITH THE FURNISHING, PERFORMANCE, OR USE OF THIS DOCUMENTATION. THE INFORMATION CONTAINED IN THIS DOCUMENTATION IS SUBJECT TO CHANGE WITHOUT NOTICE. ps_wp15p_airi-configuration-guide_ltr_02 SALES@PURESTORAGE.COM | 800-379-PURE | @PURESTORAGE
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Language : en-US XMP Toolkit : Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30 Create Date : 2018:07:19 16:23:54-04:00 Metadata Date : 2018:07:19 16:24:28-04:00 Modify Date : 2018:07:19 16:24:28-04:00 Creator Tool : Adobe InDesign CC 13.0 (Macintosh) Instance ID : uuid:ce8026a3-3577-8b45-a0ad-9243440a4dd0 Original Document ID : xmp.did:c633bb9d-fe92-492c-a6e4-0dc7e0cb8934 Document ID : xmp.id:5a94b968-17e6-4370-b62d-f84452697765 Rendition Class : proof:pdf Derived From Instance ID : xmp.iid:1eec326a-f03f-4445-85b0-70fba3c516f9 Derived From Document ID : xmp.did:f86c6f42-0f7d-4492-a823-de8be9fbe49e Derived From Original Document ID: xmp.did:c633bb9d-fe92-492c-a6e4-0dc7e0cb8934 Derived From Rendition Class : default History Action : converted History Parameters : from application/x-indesign to application/pdf History Software Agent : Adobe InDesign CC 13.0 (Macintosh) History Changed : / History When : 2018:07:19 16:23:54-04:00 Format : application/pdf Title : AIRI™ Configuration Guide with Cisco Nexus 9300 Switch Description : Download a step-by-step how to guide to configure an AIRI system containing 4x NVIDIA DGX-1 servers and a Pure Storage FlashBlade with a Cisco Nexus 9300 Switch. Creator : Pure Storage Subject : configuration guide, AI-ready infrastructure, Cisco Nexus 9300 Switch, NVIDIA DGX-1 servers, FlashBlade Producer : Adobe PDF Library 15.0 Trapped : False Page Count : 15 Author : Pure Storage Keywords : configuration guide, AI-ready infrastructure, Cisco Nexus 9300 Switch, NVIDIA DGX-1 servers, FlashBlade Warning : [Minor] Ignored duplicate Info dictionaryEXIF Metadata provided by EXIF.tools