Onload User Guide (2013) SF 104474 CD 15 Issue
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 199
Download | |
Open PDF In Browser | View PDF |
Onload User Guide Onload User Guide Copyright © 2013 SOLARFLARE Communications, Inc. All rights reserved. The software and hardware as applicable (the “Product”) described in this document, and this document, are protected by copyright laws, patents and other intellectual property laws and international treaties. The Product described in this document is provided pursuant to a license agreement, evaluation agreement and/or non-disclosure agreement. The Product may be used only in accordance with the terms of such agreement. The software as applicable may be copied only in accordance with the terms of such agreement. Onload is licensed under the GNU General Public License (Version 2, June 1991). See the LICENSE file in the distribution for details. The Onload Extensions Stub Library is Copyright licensed under the BSD 2-Clause License. Onload contains algorithms and uses hardware interface techniques which are subject to Solarflare Communications Inc patent applications. Parties interested in licensing Solarflare's IP are encouraged to contact Solarflare's Intellectual Property Licensing Group at: Director of Intellectual Property Licensing Intellectual Property Licensing Group Solarflare Communications Inc, 7505 Irvine Center Drive Suite 100 Irvine, California 92618 You will not disclose to a third party the results of any performance tests carried out using Onload or EnterpriseOnload without the prior written consent of Solarflare. The furnishing of this document to you does not give you any rights or licenses, express or implied, by estoppel or otherwise, with respect to any such Product, or any copyrights, patents or other intellectual property rights covering such Product, and this document does not contain or represent any commitment of any kind on the part of SOLARFLARE Communications, Inc. or its affiliates. The only warranties granted by SOLARFLARE Communications, Inc. or its affiliates in connection with the Product described in this document are those expressly set forth in the license agreement, evaluation agreement and/or non-disclosure agreement pursuant to which the Product is provided. EXCEPT AS EXPRESSLY SET FORTH IN SUCH AGREEMENT, NEITHER SOLARFLARE COMMUNICATIONS, INC. NOR ITS AFFILIATES MAKE ANY REPRESENTATIONS OR WARRANTIES OF ANY KIND (EXPRESS OR IMPLIED) REGARDING THE PRODUCT OR THIS DOCUMENTATION AND HEREBY DISCLAIM ALL IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT, AND ANY WARRANTIES THAT MAY ARISE FROM COURSE OF DEALING, COURSE OF PERFORMANCE OR USAGE OF TRADE. Unless otherwise expressly set forth in such agreement, to the extent allowed by applicable law (a) in no event shall SOLARFLARE Communications, Inc. or its affiliates have any liability under any legal theory for any loss of revenues or profits, loss of use or data, or business interruptions, or for any indirect, special, incidental or consequential damages, even if advised of the possibility of such damages; and (b) the total liability of SOLARFLARE Communications, Inc. or its affiliates arising from or relating to such agreement or the use of this document shall not exceed the amount received by SOLARFLARE Communications, Inc. or its affiliates for that copy of the Product or this document which is the subject of such liability. The Product is not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. SF-104474-CD Issue 15 Issue 15 © Solarflare Communications 2013 i Onload User Guide Table of Contents Chapter 1: What’s New . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Low Latency Quickstart Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 3: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 4: Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Onload Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Hardware and Software Supported Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4 Onload and the Network Adapter Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.5 Pre-install Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.6 EnterpriseOnload - Build and Install from SRPM . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.7 OpenOnload DKMS Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.8 Build OpenOnload Source RPM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.9 OpenOnload - Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.10 Onload Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.11 Configuring the Network Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.12 Installing Netperf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.13 How to run Onload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.14 Testing the Onload Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.15 Apply an Onload Patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 5: Tuning Onload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 System Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.3 Standard Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.4 Performance Jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.5 Advanced Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 6: Features & Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.2 Onload Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.3 Onload Stacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.4 Virtual Network Interface (VNIC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.5 Functional Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.6 Onload with Mixed Network Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.7 Maximum Number of Network Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.8 Onloaded PIDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.9 Onload and File Descriptors, Stacks and Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.10 System calls intercepted by Onload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Issue 15 © Solarflare Communications 2013 ii Onload User Guide 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 6.25 6.26 6.27 6.28 6.29 6.30 Linux Sysctls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Changing Onload Control Plane Table Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 TCP Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 UDP Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 User Level recvmmsg for UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 User-Level sendmmsg for UDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Multiplexed I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Stack Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Multicast Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Multicast Operation and Stack Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Multicast Loopback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bonding, Link aggregation and Failover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 VLANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Accelerated pipe() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Zero-Copy API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Receive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Packet Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Programmed I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Templated Sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Debug and Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 7: Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Changes to Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3 Limits to Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.4 epoll - Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Configuration Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 8: Change History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.2 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.3 Module Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Appendix A: Parameter Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Appendix B: Meta Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Appendix C: Build Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Appendix D: Onload Extensions API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Appendix E: onload_stackdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Appendix F: Solarflare sfnettest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Issue 15 © Solarflare Communications 2013 iii Onload User Guide Appendix G: onload_tcpdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Appendix H: ef_vi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Appendix I: onload_iptables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Appendix J: Solarflare efpio Test Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Issue 15 © Solarflare Communications 2013 iv Onload User Guide Chapter 1: What’s New This section identifies new features introduced in the OpenOnload 201310 release, the OpenOnload 201310-u1 release and latest additions to this user guide. This version of the user guide introduces the Solarflare Flareon™ Ultra SFN7122F and SFN7322F Dual-Port 10GbE PCIe 3.0 Server I/O Adapters. This user guide is also applicable to EnterpriseOnload v2.1.0.3 - refer to Change History on page 85 to confirm feature availability in the Enterprise release. For a complete list of features and enhancements refer to the release notes and the release change log available from: http://www.openonload.org/download.html. New Features 201310-u1 Receive Packet Hardware Timestamping The SO_TIMESTAMPING socket option allow an application to recover hardware generated timestamps for received packets. This is supported for Onload applications running on the Solarflare SFN7000 series adapters. For details refer to SO_TIMESTAMPING (hardware timestamps) on page 50. The Onload environment variable EF_RX_TIMESTAMPING controls whether hardware timestamps are enabled on all received packets. Refer to Appendix A: Parameter Reference on page 93 for details. Low Latency Transmit The Onload environment variable EF_TX_PUSH_THRESHOLD is used together with EF_TX_PUSH to improve the performance of the low-latency transmit feature. This feature is supported on SFN7000 series adapters. For details of this variable refer to Appendix A: Parameter Reference on page 93. Onload Extensions API - Java Native Interface The JNI wrapper provides access to the Onload extensions API from within Java. Users should refer to the README file and Java source files in /onload-/src/tools/jni directory. Onload Extensions API - MSG WARM Detection Onload versions before onload-201310 do not support the ONLOAD_MSG_WARM flag and using the flag on TCP send calls will cause the ’warm’ packet to be actually sent on the wire. The onload_fd_check_feature() function can be called before using the ONLOAD_MSG_WARM flag on TCP send calls to ensure the Onload socket being used does actually support the ONLOAD_MSG_WARM feature. For details refer to onload_fd_check_feature() on page 114. Adapter Net Driver The OpenOnload 201310-u1 release includes an updated net driver. The v4.0.2.6628 driver supports all Solarflare adapters. Issue 15 © Solarflare Communications 2013 1 Onload User Guide New Features 201310 The OpenOnload 201310 release includes new features to support the Solarflare Flareon Ultra SFN7122F and SFN7322F adapters. Multicast Replication The SFN7000 series adapters support Multicast Replication where packets are replicated in hardware and delivered to multiple receive queues. This feature allows any number of Onload clients, listening to the same multicast data stream to receive a copy of the received packets without the need to share Onload stacks. See Multicast Replication on page 56 for details. Large Buffer Table Support The SFN7000 series adapters support large buffer table entries, which for many applications, can help eliminate limitations on the number of packet buffers that existed on previous generation Solarflare adapters. Refer to Large Buffer Table Support on page 62 for further details. PIO The SFN7000 series adapters support programmed input/output on the transmit path where packets are pushed directly to the adapter by the CPU. See “Programmed I/O” on page 70. Templated Sends Building on top of PIO, the templated sends extensions API can further improve transmit latency using a packet template which contains the bulk of the data to be sent. The template is created on the adapter and populated with the remaining data immediately before sending. For more information about templated sends and the use of packet templates see Templated Sends on page 71. ONLOAD_MSG_WARM The ONLOAD_MSG_WARM flag, when used on the TCP send calls, will ’exercise’ the TCP fast send path to maintain low latency performance on TCP connections that rarely send data. See ONLOAD_MSG_WARM on page 46 for details. Onload Handling dup3() A system call to dup3() will now be intercepted in Onload-201310. Adapter Net Driver The OpenOnload 201310 release includes an updated net driver. The v4.0.0.6585 driver supports all Solarflare adapters. Issue 15 © Solarflare Communications 2013 2 Onload User Guide Change History The Change History section is updated with every revision of this document to include the latest Onload features, changes or additions to environment variables and changes or additions to onload module options. Refer to Change History on page 85. Documentation Changes Appendix J: Solarflare efpio Test Application on page 180 describes the Solarflare layer 2 benchmark efpio with command line examples and options. Issue 15 © Solarflare Communications 2013 3 Onload User Guide Chapter 2: Low Latency Quickstart Guide Introduction This section demonstrates how to achieve very low latency coupled with minimum jitter on a system fitted with the Solarflare SFN7122F network adapter and using Solarflare’s kernel-bypass network acceleration middleware, OpenOnload. The procedure will focus on the performance of the network adapter for TCP and UDP applications running on Linux using the industry-standard Netperf network benchmark application and the Solarflare supplied open source sfnettest network benchmark suite. Please read the Solarflare LICENSE file regarding the disclosure of benchmark test results. Software Installation Before running Low Latency benchmark tests ensure that correct driver and firmware versions are installed e.g. (minimum driver and firmware versions are shown): [root@server-N]# ethtool -i eth3 driver: sfc version: 4.0.0.6585 firmware-version: 4.0.0.6585 Netperf Netperf can be downloaded from http://www.netperf.org/netperf/ Unpack the compressed tar file using the tar command: [root@system-N]# tar -zxvf netperf- .tar.gz This will create a sub-directory called netperf- from which the configure and make commands can be run (as root): ./configure make install Following installation the netperf and netserver applications are located in the src subdirectory. Solarflare sfnettest Download the sfnettest- .tgz source file from www.openonload.org Unpack the tar file using the tar command: [root@system-N]# tar -zxvf sfnettest- .tgz Run the make utility from the sfnettest- /src subdirectory to build the sfntpingpong application. Issue 15 © Solarflare Communications 2013 4 Onload User Guide Solarflare Onload Before Onload network and kernel drivers can be built and installed the system must support a build environment capable of compiling kernel modules. Refer to Appendix C - Build Dependencies in the Onload User Guide for more details. Download the Onload- .tgz file from www.openonload.org Unpack the tar file using the tar command: [root@system-N]# tar -zxvf onload- .tgz Run the onload_install command from the Onload- /scripts subdirectory: [root@system-N]# ./onload_install Test Setup The diagram below identifies the required physical configuration of two servers equipped with Solarflare network adapters connected back-to-back in order to measure the latency of the adapter, drivers and acceleration middleware. If required, tests can be repeated with a 10G switch on the link to measure the additional latency delta using a particular switch. Requirements: • Two servers are equipped with Solarflare network adapters and connected with a single cable between the Solarflare interfaces. • The Solarflare interfaces are configured with an IP address so that traffic can pass between them. Use ping to verify connection. • Onload, netperf and sfnettest are installed on both machines. Issue 15 © Solarflare Communications 2013 5 Onload User Guide Pre-Test Configuration On both machines: 1 Isolate the CPU cores that will be used from the general SMP balancing and scheduler algorithms. Add the following option to the kernel line in /boot/grub/grub.conf: isolcpus= 2 Stop the cpuspeed service to prevent power saving modes from reducing CPU clock speed. [root@system-N]# service cpuspeed stop 3 Stop the irqbalance service to prevent the OS from rebalancing interrupts between available CPU cores. [root@system-N]# service irqbalance stop 4 Stop the iptables service to eliminate overheads incurred by the firewall. Solarflare recommend this step on RHEL6 for improved latency when using the kernel network driver. [root@system-N]# service iptables stop 5 Disable interrupt moderation. [root@system-N]# ethtool -C eth rx-usecs 0 adaptive-rx off Where N is the identifier of the Solarflare adapter ethernet interface e.g. eth4 6 Refer to the Reference System Specification below for BIOS features. Reference System Specification The following latency measurements were recorded on twin Intel® Sandy Bridge servers. The specification of the test systems is as follows: • DELL PowerEdge R210 servers equipped with Intel® Xeon® CPU E3-1280 @3.60GHz, 2 x 2GB DIMMs. • BIOS: Turbo mode ENABLED, cstates DISABLED, IOMMU DISABLED. • Red Hat Enterprise Linux V6.4 (x86_64 kernel, version 2.6.32-358.el6.x86_64). • Solarflare SFN7122F NIC (driver and firmware – see Software Installation) Direct attach cable at 10G. • OpenOnload distribution: openonload-201310. It is expected that similar results will be achieved on any Intel based, PCIe Gen 3 server or compatible system. Issue 15 © Solarflare Communications 2013 6 Onload User Guide UDP Latency: Netperf Run the netserver application on system-1: [root@system-1]# pkill -f netserver [root@system-1]# onload --profile=latency taskset -c 0 ./netserver Run the netperf application on system -2: [root@system-2]# onload --profile=latency taskset -c 0 ./netperf -t UDP_RR -H -l 10 -- -r 32 Socket Send bytes 229376 Size Recv Bytes 229376 Request Size bytes 32 Resp. Size bytes 32 Elapsed Time secs. 10.00 Trans Rate per sec 294170.83 294170 transactions/second means that each transaction takes 1/294170 seconds resulting in a RTT/2 latency of (1/294170)/2 or 1.69μs UDP Latency: sfnt-pingpong Run the sfnt-pingpong application on both systems: [root@system-1]# onload --profile=latency taskset -c 0 ./sfnt-pingpong [root@system-2]# onload --profile=latency taskset -c 0 ./sfnt-pingpong --affinity "0;0" udp # size 0 1 2 4 8 16 32 64 128 256 mean 1648 1643 1644 1643 1646 1635 1672 1706 1824 1981 min 1603 1598 1600 1598 1600 1591 1612 1649 1744 1911 median 1642 1638 1638 1638 1640 1629 1666 1701 1820 1975 max 12340 14532 9641 13210 8707 9753 11198 9638 8620 8238 %ile 1742 1732 1734 1731 1736 1724 1756 1796 1938 2090 stddev 48 46 42 43 44 41 48 42 53 48 iter 904000 907000 907000 907000 905000 911000 891000 873000 817000 753000 The output identifies mean, minimum, median and maximum (nanosecond) RTT/2 latency for increasing TCP packet sizes including the 99% percentile and standard deviation for these results. A message size of 32 bytes has a mean latency of 1.67μs with a 99%ile latency under 1.8μs. Issue 15 © Solarflare Communications 2013 7 Onload User Guide TCP Latency: Netperf Run the netserver application on system-1: [root@system-1]# pkill -f netserver [root@system-1]# onload --profile=latency taskset -c 0 ./netserver Run the netperf application on system-2: [root@system-2]# onload --profile=latency taskset -c 0 ./netperf -t TCP_RR -H -l 10 -- -r 32 Socket Size Send Recv bytes Bytes 16384 87380 Request Size bytes Resp. Size bytes 32 32 Elapsed Time secs. 10.00 Trans. Rate per sec 271057.10 271057 transactions/second means that each transaction takes 1/271057 seconds resulting in a RTT/2 latency of (1/271057)/2 or 1.84μs. TCP Latency: sfnt-pingpong Run the sfnt-pingpong application on both systems: [root@system-1]# onload --profile=latency taskset -c 0 ./sfnt-pingpong [root@system-2]# onload --profile=latency taskset -c 0 ./sfnt-pingpong --affinity "0;0" tcp # # size 1 2 4 8 16 32 64 128 256 mean 1810 1809 1812 1810 1816 1835 1900 2000 2139 min 1744 1744 1743 1745 1746 1767 1804 1907 2057 median 1795 1793 1797 1794 1801 1820 1886 1985 2123 max 15775 11080 12595 11773 9081 8049 9764 8440 482577 %ile 1959 1961 1964 1959 1967 1989 2065 2167 2304 stddev 62 60 60 58 56 58 67 67 578 iter 823000 824000 822000 824000 821000 812000 785000 746000 698000 The output identifies mean, minimum, median and maximum (nanosecond) RTT/2 latency for increasing TCP packet sizes including the 99% percentile and standard deviation for these results. A message size of 32 bytes has a mean latency of 1.83μs with a 99%ile latency under 2.0μs. Issue 15 © Solarflare Communications 2013 8 Onload User Guide Layer 2 ef_vi Latency The efpio UDP test application, supplied with the onload-201310 package, can be used to measure latency of the Solarflare ef_vi layer 2 API. efpio uses PIO. Using the same back-to-back configuration described above, efpio latency tests were recorded on DELL PowerEdge R210 servers. # ef_vi_version_str: 201306-7122preview2 # udp payload len: 28 # iterations: 100000 # frame len: 70 round-trip time: 2.65 μs (1.32 RTT/2) Appendix J: Solarflare efpio Test Application on page 180 describes the efpio application, command line options and provides example command lines. Issue 15 © Solarflare Communications 2013 9 Onload User Guide Comparative Data Dual Package Server Latency tests recorded in this document, were conducted on a single CPU package server. Results may differ between different server types and may be different from servers having multiple NUMA nodes. Many factors influence the latency on a server so some experimentation is required to identify the CPU core producing the lowest latency. To enable comparison, the latency benchmark tests were repeated with a pair of dual CPU package servers having the following specification: • DELL PowerEdge R620 servers equipped with Intel® Xeon® CPU E5-2690 @2.90GHz, 4 x 8GB DIMMs. • BIOS: Turbo mode ENABLED, cstates DISABLED, IOMMU DISABLED. • Red Hat Enterprise Linux V6.4 (x86_64 kernel, version 2.6.32-358.el6.x86_64). • Solarflare SFN7122F NIC (see Software Installation above) Direct attach cable at 10G. • OpenOnload distribution: openonload-201310. Table 1: Dual CPU Server Data Application UDP μs TCP μs Netperf 1.83 1.99 efpio 1.42 N/A sfnt-pingpong 1.78 1.97 Adapter Comparison The following table shows a comparison between latency tests conducted on the SFN6122F and the SFN7122F adapters - values shown are the RTT/2 value in microseconds. Table 2: Latency Tests - Comparative Data Test SFN6122F SFN7122F UDP 2.2 1.6 27% TCP 2.4 1.8 25% efpingpong - 2.0 efpio - 1.3 40% ef_vi UDP Issue 15 © Solarflare Communications 2013 Latency gain 10 Onload User Guide Testing Without Onload The benchmark performance tests can be run without Onload using the regular kernel network drivers. To do this remove the onload --profile=latency part from the command line. To get the best response and comparable latency results using kernel drivers, Solarflare recommend setting interrupt affinity such that interrupts and the application are running on different CPU cores but on the same processor package - examples below. Use the following command to identify receive queues created for an interface e.g: # cat /proc/interrupts | grep eth2 33: 0 0 0 0 IR-PCI-MSI-edge eth2-0 34: 0 0 0 0 IR-PCI-MSI-edge eth2-1 Direct IRQ 33 to CPU core 0 and IRQ 34 to CPU core 1: # echo 1 > /proc/irq/33/smp_affinity # echo 2 > /proc/irq/34/smp_affinity Latency figures are shown in microseconds. RHEL 6.4 (TCP) RHEL 6.4 (UDP) Mean 99%ile Mean 99%ile 6.3 7.0 5.4 5.8 Further Information For installation of Solarflare adapters and performance tuning of the network driver when not using Onload refer to the Solarflare Server Adapter User Guide (SF-103837-CD) available from https:// support.solarflare.com/ The Onload feature set and detailed performance tuning information can be found in the Onload User Guide (SF-104474-CD) available from https://support.solarflare.com/ Questions regarding Solarflare products, Onload and this user guide can be emailed to support@solarflare.com. Issue 15 © Solarflare Communications 2013 11 Onload User Guide Chapter 3: Background 3.1 Introduction. NOTE: This guide should be read in conjunction with the Solarflare Server Adapter User’s Guide, SF-103837-CD, which describes procedures for hardware and software installation of Solarflare network interfaces cards, network device drivers and related software. NOTE: Throughout this user guide the term Onload refers to both OpenOnload and EnterpriseOnload unless otherwise stated. Onload is the Solarflare accelerated network middleware. It is an implementation of TCP and UDP over IP which is dynamically linked into the address space of user-mode applications, and granted direct (but safe) access to the network-adapter hardware. The result is that data can be transmitted to and received from the network directly by the application, without involvement of the operating system. This technique is known as 'kernel bypass'. Kernel bypass avoids disruptive events such as system calls, context switches and interrupts and so increases the efficiency with which a processor can execute application code. This also directly reduces the host processing overhead, typically by a factor of two, leaving more CPU time available for application processing. This effect is most pronounced for applications which are network intensive, such as: • Market-data and trading applications • Computational fluid dynamics (CFD) • HPC (High Performance Computing) • HPMPI (High Performance Message Passing Interface), Onload is compatible with MPICH1 and 2, HPMPI, OpenMPI and SCALI • Other physical models which are moderately parallelizable • High-bandwidth video-streaming • Web-caching, Load-balancing and Memcached applications • Other system hot-spots such as distributed lock managers or forced serialization points The Onload library dynamically links with the application at runtime using the standard BSD sockets API, meaning that no modifications are required to the application being accelerated. Onload is the first and only product to offer full kernel bypass for POSIX socket-based applications over TCP/IP and UDP/IP protocols Issue 15 © Solarflare Communications 2013 12 Onload User Guide Contrasting with Conventional Networking When using conventional networking, an application calls on the OS kernel to send and receive data to and from the network. Transitioning from the application to the kernel is an expensive operation, and can be a significant performance barrier. When an application accelerated using Onload needs to send or receive data, it need not access the operating system, but can directly access a partition on the network adapter. The two schemes are shown in Figure 1. Figure 1: Contrast with Conventional Networking. An important feature of the conventional model is that applications do not get direct access to the networking hardware and so cannot compromise system integrity. Onload is able to preserve system integrity by partitioning the NIC at the hardware level into many, protected 'Virtual NICs' (VNIC). An application can be granted direct access to a VNIC without the ability to access the rest of the system (including other VNICs or memory that does not belong to the application). Thus Onload with a Solarflare NIC allows optimum performance without compromising security or system integrity. Issue 15 © Solarflare Communications 2013 13 Onload User Guide In summary, Onload can significantly reduce network processing overheads. As an example, the table below shows application to application latency (1/2 RTT) being reduced by 60% over the regular kernel stack. Back to Back 1/2 RTT latency TCP (us) UDP (us) Solarflare (Onload) 1.83 1.67 Solarflare (Kernel) 6.3 5.4 How Onload Increases Performance Onload can significantly reduce the costs associated with networking by reducing CPU overheads and improving performance for latency, bandwidth and application scalability. Overhead Transitioning into and out of the kernel from a user-space application is a relatively expensive operation: the equivalent of hundreds or thousands of instructions. With conventional networking such a transition is required every time the application sends and receives data. With Onload, the TCP/IP processing can be done entirely within the user-process, eliminating expensive application/ kernel transitions, i.e. system calls. In addition, the Onload TCP/IP stack is highly tuned, offering further overhead savings. The overhead savings of Onload mean more of the CPU's computing power is available to the application to do useful work. Latency Conventionally, when a server application is ready to process a transaction it calls into the OS kernel to perform a 'receive' operation, where the kernel puts the calling thread 'to sleep' until a request arrives from the network. When such a request arrives, the network hardware 'interrupts' the kernel, which receives the request and 'wakes' the application. All of this overhead takes CPU cycles as well as increasing cache and translation lookaside-buffer (TLB) footprint. With Onload, the application can remain at user level waiting for requests to arrive at the network adapter and process them directly. The elimination of a kernel-to-user transition, an interrupt, and a subsequent user-to-kernel transition can significantly reduce latency. In short, reduced overheads mean reduced latency. Bandwidth Because Onload imposes less overhead, it can process more bytes of network traffic every second. Along with specially tuned buffering and algorithms designed for 10 gigabit networks, Onload allows applications to achieve significantly improved bandwidth. Issue 15 © Solarflare Communications 2013 14 Onload User Guide Scalability Modern multi-core systems are capable of running many applications simultaneously. However, the advantages can be quickly lost when the multiple cores contend on a single resource, such as locks in a kernel network stack or device driver. These problems are compounded on modern systems with multiple caches across many CPU cores and Non-Uniform Memory Architectures. Onload results in the network adapter being partitioned and each partition being accessed by an independent copy of the TCP/IP stack. The result is that with Onload, doubling the cores really can result in doubled throughput as demonstrated by Figure 2. Figure 2: Onload Partitioned Network Adapter Further Information For detailed information about Onload operation and functionality refer to Features & Functionality on page 34. Issue 15 © Solarflare Communications 2013 15 Onload User Guide Chapter 4: Installation 4.1 Introduction This chapter covers the following topics: • Onload Distributions...Page 16 • Hardware and Software Supported Platforms...Page 17 • Onload and the Network Adapter Driver...Page 17 • Onload and the Network Adapter Driver...Page 17 • Pre-install Notes...Page 17 • EnterpriseOnload - Build and Install from SRPM...Page 18 • OpenOnload DKMS Installation...Page 19 • Build OpenOnload Source RPM...Page 19 • OpenOnload - Installation...Page 20 • Onload Kernel Modules...Page 22 • Configuring the Network Interfaces...Page 22 • Installing Netperf...Page 23 • Testing the Onload Installation...Page 23 • Apply an Onload Patch...Page 23 4.2 Onload Distributions Onload is available in two distributions • “OpenOnload” is a free version of Onload available from http://www.openonload.org/ distributed as a source tarball under the GPLv2 license. OpenOnload is subject to a linear development cycle where major releases every 3-4 months include the latest development features. • “EnterpriseOnload” is a commercial enterprise version of Onload distributed as a source RPM under the GPLv2 license. EnterpriseOnload differs from OpenOnload in that it is offered as a mature commercial product that is downstream from OpenOnload having undergone a comprehensive software product test cycle resulting in tested, hardened and validated code. The Solarflare product range offers a flexible and broad range of support options, users should consult their reseller for details and refer to the Solarflare Enterprise Service and Support information at http://www.solarflare.com/Enterprise-Service-Support. Issue 15 © Solarflare Communications 2013 16 Onload User Guide 4.3 Hardware and Software Supported Platforms • Onload is supported for all Solarflare Flareon Adapters, Onload Network Adapters, Solarflare mezzanine adapters and the SFA6902F ApplicationOnload™ Engine. Refer to the Solarflare Server Adapter User Guide ’Product Specifications’ for details. • Onload can be installed on the following x86, 32 bit and 64 bit platforms: Red Hat Enterprise Linux 5, 6 and Red Hat MRG, MRG 2 update 3 SUSE Linux Enterprise Server 10, 11 and SLERT Whilst the Onload QA test cycle predominantly focuses on the Linux OS versions documented above, Solarflare are not aware of any issues preventing Onload installation on other Linux variants such as Ubuntu, Centos, Gentoo, Fedora and Debian variants. • All lntel and AMD x86 processors. 4.4 Onload and the Network Adapter Driver The Solarflare network adapter driver, the "net driver", is generally available from three sources: • Download as source RPM from support.solarflare.com. • Packaged ’in box’ in many Linux distributions e.g Red Hat Enterprise Linux. • Packaged in the OpenOnload distribution. When using Onload you must use the adapter driver distributed with that version of Onload. 4.5 Pre-install Notes NOTE: If Onload is to accelerate a 32bit application on a 64bit architecture, the 32bit libc development headers should be installed before building Onload. Refer to Appendix C for install instructions. NOTE: You must remove any existing Solarflare RPM driver packages before installing Onload. NOTE: When migrating between Onload versions or between OpenOnload and EnterpriseOnload, a previously installed version must first be removed using the onload_uninstall command. NOTE: The Solarflare drivers are currently classified as unsupported in SLES11, the certification process is underway. To overcome this add ’allow_unsupported_modules 1’ to the /etc/ modprobe.d/unsupported-modules file Issue 15 © Solarflare Communications 2013 17 Onload User Guide 4.6 EnterpriseOnload - Build and Install from SRPM The following steps identify the procedures to build and install EnterpriseOnload. SRPMs can be built by the ’root’ or ’non-root’ user, but the user must have superuser privileges to install RPMs. Customers should contact their Solarflare customer representative for access to the EnterpriseOnload SRPM resources. Build the RPM NOTE: Refer to Appendix C for details of build dependencies. As root: rpmbuild --rebuild enterpriseonload- .src.rpm Or as a non-root user: It is advised to use _topdir to ensure that RPMs are built into a directory to which the user has permissions. The directory structure must pre-exist for the rpmbuild command to succeed. mkdir -p /tmp//myrpm/{SOURCES,BUILD,RPMS,SRPMS} rpmbuild --define "_topdir /tmp/myrpm" \ --rebuild enterpriseonload- .src.rpm NOTE: On some non-standard kernels the rpmbuild might fail because of build dependencies. In this event retry, adding the --nodeps option to the command line. Building the source RPM will produce 2 binary RPM files which can be found in the • /usr/src/*/RPMS/ directory • or, when built by a non-root user in _topdir/RPMS • or, when _topdir was defined in the rpmbuild command line in /tmp/myrpm/RPMS/x86_64/ for example the EnterpriseOnload user-space components: /usr/src/redhat/RPMS/x86_64/enterpriseonload- .x86_64.rpm and the EnterpriseOnload kernel components: /usr/src/redhat/RPMS/x86_64/enterpriseonload-kmod-2.6.18-92.el5 .x86_64.rpm Install the EnterpriseOnload RPM The EnterpriseOnload RPM and the kernel RPM must be installed for EnterpriseOnload to function correctly. rpm -ivf enterpriseonload- .x86_64.rpm Issue 15 © Solarflare Communications 2013 18 Onload User Guide rpm -ivf enterpriseonload-kmod-2.6.18-92.el5- .x86_64.rpm NOTE: EnterpriseOnload is now installed but the kernel modules are not yet loaded. NOTE: The EnterpriseOnload-kmod filename is specific to the kernel that it is built for. Installing the EnterpriseOnload Kernel Module This will load the EnterpriseOnload kernel driver and other driver dependencies and create any device nodes needed for EnterpriseOnload drivers and utilities. The command should be run as root. /etc/init.d/openonload start Following successful execution this command produces no output, but the ’onload’ script will identify that the kernel module is now loaded. onload EnterpriseOnload Copyright 2006-2013 Solarflare Communications, 2002-2005 Level 5 Networks Built: Oct 15 2013 09:19:23 12:23:12 (release) Kernel module: NOTE: At this point EnterpriseOnload is loaded, but until the network interface has been configured and brought into service EnterpriseOnload will be unable to accelerate traffic. 4.7 OpenOnload DKMS Installation OpenOnload from version 201210 is available in DKMS RPM format. The OpenOnload DKMS distribution package is not available directly from www.openonload.org, users who wish to install from DKMS should send an email to support@solarflare.com. 4.8 Build OpenOnload Source RPM A source RPM can be built from the OpenOnload distribution tar file. 1 Download the required tar file from the following location: http://www.openonload.org/download.html Copy the file to a directory on the machine where the source RPM is to be created. 2 As root, execute the following command: rpmbuild -ts openonload- .tgz* x86_64 Wrote: /root/rpmbuild/SRPMS/openonload- .src.rpm Issue 15 © Solarflare Communications 2013 19 Onload User Guide The output identifies the location of the source RPM. Use the -ta option to get a binary RPM. 4.9 OpenOnload - Installation The following procedure demonstrates how to download, untar and install OpenOnload. Download and untar OpenOnload 1 Download the required tar file from the following location: http://www.openonload.org/download.html The compressed tar file (.tgz) should be downloaded/copied to a directory on the machine on which it will be installed. 2 As root, unpack the tar file using the tar command. tar -zxvf openonload- .tgz This will unpack the tar file and, within the current directory, create a sub-directory called openonload- which contains other sub-directories including the scripts directory from which subsequent install commands can be run. Building and Installing OpenOnload NOTE: Refer to Appendix C for details of build dependencies. The following command will build and install OpenOnload and required drivers in the system directories: ./onload_install Successful installation will output the following 3 lines from the onload_install process: onload_install: Build complete. onload_install: Installing OpenOnload. onload_install: Install complete. NOTE: The onload_install script does not create RPMs. Load Onload Drivers Following installation it is necessary to load the Onload drivers: onload_tool reload When used with OpenOnload this command will replace any previously loaded network adapter driver with the driver from the OpenOnload distribution. Check that Solarflare drivers are loaded using the following commands: lsmod | grep sfc Issue 15 © Solarflare Communications 2013 20 Onload User Guide lsmod | grep onload An alternative to the reload command is to reboot the system to load Onload drivers. Confirm Onload Installation When the Onload installation is complete run the onload command from the openonload /scripts subdirectory to confirm installation of Onload software and kernel module: [root@server1 scripts] onload Will display the Onload product banner and usage: OpenOnload 201310 Copyright 2006-2012 Solarflare Communications, 2002-2005 Level 5 Networks Built: Oct 15 2013 09:19:23 (release) Kernel module: 201310 usage: onload [options] options: --profile= --force-profiles --no-app-handler --app= --version -v -h --help Issue 15 -------- comma sep list of config profile(s) profile settings override environment do not use app-specific settings identify application to run under onload print version information verbose this help message © Solarflare Communications 2013 21 Onload User Guide 4.10 Onload Kernel Modules To identify Solarflare drivers already installed on the server: modprobe -l | grep sfc (grep onload) Driver Name Description sfc.ko A Linux net driver provides the interface between the Linux network stack and the Solarflare network adapter. sfc_char.ko Provides low level access to the Solarflare network adapter virtualized resources. Supports direct access to the network adapter for applications that use the ef_vi user-level interface for maximum performance. sfc_tune.ko This is used to prevent the kernel during idle periods from putting the CPUs into a sleep state. sfc_affinity.ko Used to direct traffic flow managed by a thread to the core the thread is running on, inserts packet filters that override the RSS behaviour. sfc_resource.ko Manages the virtualization resources of the adapter and shares the resources between other drivers. onload.ko The kernel component of Onload. To unload any loaded drivers: onload_tool unload To remove the installed files of a previous Onload: onload_uninstall To load the Solarflare and Onload drivers (if not already loaded): modprobe sfc Reload drivers following upgrade or changed settings: onload_tool reload 4.11 Configuring the Network Interfaces Network interfaces should be configured according to the Solarflare Server Adapter User’s Guide. When the interface(s) have been configured, the dmesg command will display output similar to the following (one entry for each Solarflare interface): sfc 0000:13:00.0: INFO: Issue 15 eth2 Solarflare Communications NIC PCI(1924:803) © Solarflare Communications 2013 22 Onload User Guide sfc 0000:13:00.1: INFO: eth3 Solarflare Communications NIC PCI(1924:803) NOTE: IP address configuration should be carried out using normal OS tools e.g. system-confignetwork (Red Hat) or yast (SUSE). 4.12 Installing Netperf Refer to the Low Latency Quickstart Guide on page 4 for instructions to install Netperf and Solarflare sfnettest applications. 4.13 How to run Onload Once Onload has been installed there are different ways to accelerate applications. Exporting LD_PRELOAD will mean that all applications started in the same environment will be accelerated. # export LD_PRELOAD=libonload.so Pre-fixing the application command line with the onload command will accelerate the application. # onload [app_options] 4.14 Testing the Onload Installation The the Low Latency Quickstart Guide on page 4 demonstrates testing of Onload with Netperf and the Solarflare sfnettest benchmark tools. 4.15 Apply an Onload Patch Occasionally, the Solarflare Support Group may issue a software ’patch’ which is applied to onload to resolve a specific bug or investigate a specific issue. The following procedure describes how a patch should be applied to the installed OpenOnload software. 1 Copy the patch to a directory on the server where onload is already installed. 2 Go to the onload directory and apply the patch e.g. cd openonload- [openonload- ]$ patch -p1 < ~/ / .patch 3 Uninstall the old onload drivers [openonload- ]$ onload_uninstall 4 Build and re-install the onload drivers [openonload- ]$ ./scripts/onload_install [openonload- ]$ onload_tool reload Issue 15 © Solarflare Communications 2013 23 Onload User Guide The following procedure describes how a patch should be applied to the installed EnterpriseOnload RPM. (This example patches EnterpriseOnload version 2.1.0.3). 1 Copy the patch to the directory on the server where the EnterpriseOnload RPM package exists and carry out the following commands: rpm2cpio enterpriseonload-2.1.0.3-1.src.rpm | cpio –id tar -xzf enterpriseonload-2.1.0.3.tgz cd enterpriseonload-2.1.0.3 patch -p1 < $PATCHNAME 2 This can now be installed directory from this directory: ./scripts/onload_install 3 Or it can be repackaged as a new RPM: cd .. tar cz fenterpriseonload-2.1.0.3.tgz enterpriseonload-2.1.0.3 rpmbuild -ts enterpriseonload-2.1.0.3.tgz 4 The rpmbuild procedure will display a ’Wrote’ line identifying the location of the built RPM e.g Wrote: /root/rpmbuild/SRPMS/enterpriseonload-2.1.0.3-1.el6.src.rpm 5 Install the RPM in the usual way: rpm -ivh /root/rpmbuild/SRPMS/enterpriseonload-2.1.0.3-1.el6.src.rpm Issue 15 © Solarflare Communications 2013 24 Onload User Guide Chapter 5: Tuning Onload 5.1 Introduction This chapter documents the available tuning options for Onload, and the expected results. The options can be split into the following categories: • System Tuning • Standard Latency Tuning. • Advanced Tuning driven from analysis of the Onload stack using onload_stackdump. Most of the Onload configuration parameters, including tuning parameters, are set by environment variables exported into the accelerated applications environment. Environment variables can be identified throughout this manual as they begin with EF_. All environment variables are described in Appendices A and B of this manual. Examples throughout this guide assume the use of the bash or sh shells; other shells may use different methods to export variables into the applications environment. Section 5.2 describes tools and commands which can be used to tune the server and OS. Section 5.3 describes how to perform standard heuristic tuning, which can help improve the application’s performance. There are also benchmark examples running specific tests to demonstrate the improvements Onload can have on an application. Section 5.5 introduces advanced tuning options using onload_stackdump. There are worked examples to demonstrate how to achieve the application tuning goals. NOTE: Onload tuning and kernel driver tuning are subject to different requirements. This section describes the steps to tune Onload. For details on how to tune the Solarflare kernel driver, refer to the 'Performance Tuning on Linux' section of the Solarflare Server Adapter User Guide. 5.2 System Tuning This section details steps to tune the server and operating system for lowest latency. Sysjitter The Solarflare sysjitter utility measures the extent to which the system introduces jitter and so impacts on the user-level process. Sysjitter runs a thread on each processor core and when the thread is de-scheduled from the core it measures for how long. Sysjitter produces summary statistics for each processor core. The sysjitter utility can be downloaded from www.openonload.org Sysjitter should be run on a system that is idle. When running on a system with cpusets enabled run sysjitter as root. Refer to the sysjitter README file for further information on building and running sysjitter. Issue 15 © Solarflare Communications 2013 22 Onload User Guide The following is an example of the output from sysjitter on a single CPU socket server with 4 CPU cores. ./sysjitter --runtime 10 200 | column -t core_i: threshold(ns): cpu_mhz: runtime(ns): runtime(s): int_n: int_n_per_sec: int_min(ns): int_median(ns): int_mean(ns): int_90(ns): int_99(ns): int_999(ns): int_9999(ns): int_99999(ns): int_max(ns): int_total(ns): int_total(%): 0 200 3215 9987653973 9.988 10001 1001.336 1333 1390 1424 1437 1619 5065 31260 40613 40613 14244846 0.143 1 200 3215 9987652245 9.988 10130 1014.252 1247 1330 1452 1372 5046 22977 39017 45065 45065 14719972 0.147 2 200 3215 9987652070 9.988 10012 1002.438 1299 1329 1452 1357 2392 15604 184305 347097 347097 14541991 0.146 3 200 3215 9987652027 9.988 10001 1001.336 1446 1470 1502 1519 1688 3694 36419 49998 49998 15031294 0.150 The table below describes the output fields of the sysjitter utility. Issue 15 Field Description threshold (ns) ignore any interrupts shorter than this period cpu_mhz CPU speed runtime (ns) runtime of sysjitter - nanoseconds runtime (s) runtime of sysjitter - seconds int_n number of interruptions to the user thread int_n_per_sec number of interruptions to the user thread per second int_min (ns) minimum time taken away from the user thread due to an interruption int_median (ns) median time taken away from the user thread due to an interruption int_mean (ns) mean time taken away from the user thread due to an interruption © Solarflare Communications 2013 23 Onload User Guide Field Description int_90 (ns) 90%percentile value int_99 (ns) 99% percentile value int_999 (ns) 99.9% percentile value int_9999 (ns) 99.99% percentile value int_99999 (ns) 99.999% percentile value int_max (ns) max time taken away from the user thread int_total (ns) total time spent not processing the user thread int_total (%) int_total(ns) as a percentage of total runtime CPU Power Saving Mode Modern processors utilize design features that enable a CPU core to drop into lowering power states when instructed by the operating system that the CPU core is idle. When the OS schedules work on the idle CPU core (or when other CPU cores or devices need to access data currently in the idle CPU core’s data cache) the CPU core is signalled to return to the fully-on power state. These changes in CPU core power states create additional network latency and jitter. Solarflare therefore recommend that customers wishing to achieve the lowest latency and lowest jitter disable the "C1E power state" or "CPU power saving mode" within the machine's BIOS. If this is not possible, as an alternative, Solarflare provide a software mechanism to prevent each CPU core entering these lower power states. This is achieved by using the following command (as root): onload_tool disable_cstates [persist] Use the persist option to retain the setting following system restarts. This command prevents Linux from indicating to the CPU core it is idle, but can result in the processor using additional power compared with disabling the lower power states in the BIOS. It can therefore cause the processor to operate at a higher temperature; this is why disabling processor power states in the BIOS is recommended where available. Disabling the CPU power saving modes is required if the application is to realize low latency with low jitter. NOTE: The onload_tool disable_cstates command relies on the idle_enable=2 option in the onload.conf file. In Linux 3.x kernels this does not function and the user should replace the line options sfc_tune idle_enable=2 line in /etc/modprobe.d/onload.conf with intel_idle.max_cstate=0 idle=poll. Issue 15 © Solarflare Communications 2013 24 Onload User Guide Customers should consult their system vendor and documentation for details concerning the disabling of C1E, C states or CPU power saving states. Issue 15 © Solarflare Communications 2013 25 Onload User Guide 5.3 Standard Tuning This section details standard tuning steps for Onload. Spinning (busy-wait) Conventionally, when an application attempts to read from a socket and no data is available, the application will enter the OS kernel and block. When data becomes available, the network adapter will interrupt the CPU, allowing the kernel to reschedule the application to continue. Blocking and interrupts are relatively expensive operations, and can adversely affect bandwidth, latency and CPU efficiency. Onload can be configured to spin on the processor in user mode for up to a specified number of microseconds waiting for data from the network. If the spin period expires the processor will revert to conventional blocking behaviour. Onload uses the EF_POLL_USEC environment variable to configure the length of the spin timeout. export EF_POLL_USEC=100000 will set the busy-wait period to 100 milliseconds. See Appendix B: Meta Options on page 121 for more details. Enabling spinning To enable spinning in Onload: Set EF_POLL_USEC. This causes Onload to spin on the processor for up to the specified number of microseconds before blocking. This setting is used in TCP and UDP and also in recv(), select(), pselect() and poll(), ppoll() and epoll_wait(), epoll_pwait(). Use the following command: export EF_POLL_USEC=100000 NOTE: If neither of the spinning options EF_POLL_USEC and EF_SPIN_USEC are set, Onload will resort to default interrupt driven behaviour because the EF_INT_DRIVEN environment variable is enabled by default. Setting the EF_POLL_USEC variable also sets the following environment variables. EF_SPIN_USEC=EF_POLL_USEC EF_SELECT_SPIN=1 EF_EPOLL_SPIN=1 EF_POLL_SPIN=1 EF_PKT_WAIT_SPIN=1 EF_TCP_SEND_SPIN=1 EF_UDP_RECV_SPIN=1 EF_UDP_SEND_SPIN=1 EF_TCP_RECV_SPIN=1 EF_BUZZ_USEC=EF_POLL_USEC EF_SOCK_LOCK_BUZZ=1 Issue 15 © Solarflare Communications 2013 26 Onload User Guide EF_STACK_LOCK_BUZZ=1 Turn off adaptive moderation and set interrupt moderation to 60 microseconds. Use the following command: /sbin/ethtool -C eth2 rx-usecs 60 adaptive-rx off See Appendix B: Meta Options on page 121 for more details When to Use Spinning The optimal setting is dependent on the nature of the application. If an application is likely to find data soon after blocking, or the system does not have any other major tasks to perform, spinning can improve latency and bandwidth significantly. In general, an application will benefit from spinning if the number of active threads is less than the number of available CPU cores. However, if the application has more active threads than available CPU cores, spinning can adversely affect application performance because a thread that is spinning (and therefore idle) takes CPU time away from another thread that could be doing work. If in doubt, it is advisable to try an application with a range of settings to discover the optimal value. Polling vs. Interrupts Interrupts are useful because they allow the CPU to do other useful work while simultaneously waiting for asynchronous events (such as the reception of packets from the network). The historical alternative to interrupts was for the CPU to periodically poll for asynchronous events and on single processor systems this could result in greater latency than would be observed with interrupts. Historically it was accepted that interrupts were "good for latency". On modern, multicore systems the tradeoffs are different. It is often possible to dedicate an entire CPU core to the processing of a single source of asynchronous events (such as network traffic). The CPU dedicated to processing network traffic can be spinning (aka busy waiting), continuously polling for the arrival of packets. When a packet arrives, the CPU can begin processing it almost immediately. Contrast the polling model to an interrupt-driven model. Here the CPU is likely in its "idle loop" when an interrupt occurs. The idle loop is interrupted, the interrupt handler executes, typically marking a worker task as runnable. The OS scheduler will then run and switches to the kernel thread that will process the incoming packet. There is typically a subsequent task switch to a user-mode thread where the real work of processing the event (e.g. acting on the packet payload) is performed. Depending on the system, it can take on the order of a microsecond to respond to an interrupt and switch to the appropriate thread context before beginning the real work of processing the event. A dedicated CPU spinning in a polling loop can begin processing the asynchronous event in a matter of nanoseconds. It follows that spinning only becomes an option if a CPU core can be dedicated to the asynchronous event. If there are more threads awaiting events than CPU cores (i.e. if all CPU cores are oversubscribed to application worker threads), then spinning is not a viable option, (at least, not for all events). One thread will be spinning, polling for the event while another could be doing useful work. Spinning in such a scenario can lead to (dramatically) increased latencies. But if a CPU core can Issue 15 © Solarflare Communications 2013 27 Onload User Guide be dedicated to each thread that blocks waiting for network I/O, then spinning is the best method to achieve the lowest possible latency. 5.4 Performance Jitter On any system reducing or eliminating jitter is key to gaining optimum performance, however the causes of jitter leading to poor performance can be difficult to define and difficult to remedy. The following section identifies some key points that should be considered. • A first step towards reducing jitter should be to consider the configuration settings specified in the Low Latency Quickstart Guide on page 4 - this includes the disabling of the irqbalance service, interrupt moderation settings and measures to prevent CPU cores switching to power saving modes. • Use isolcpus to isolate CPU cores that the application - or at least the critical threads of the application will use and prevent OS housekeeping tasks and other non-critical tasks from running on these cores. • Set an application thread running on one core and the interrupts for that thread on a separate core - but on the same physical CPU package. Even when spinning, interrupts may still occur, for example, if the application fails to call into the Onload stack for extended periods because it is busy doing other work. • Ideally each spinning thread will be allocated a separate core so that, in the event that it blocks or is de-scheduled, it will not prevent other important threads from doing work. A common cause of jitter is more than one spinning thread sharing the same CPU core. Jitter spikes may indicate that one thread is being held off the CPU core by another thread. • When EF_STACK_LOCK_BUZZ=1, threads will spin for the EF_BUZZ_USEC period while they wait to acquire the stack lock. Lock buzzing can lead to unfairness between threads competing for a lock, and so result in resource starvation for one. Occurrences of this are counted in the 'stack_lock_buzz' counter. • If a multi-thread application is doing lots of socket operations, stack lock contention will lead to send/receive performance jitter. In such cases improved performance can be had when each contending thread has its own stack. This can be managed with EF_STACK_PER_THREAD which creates a separate Onload stack for the sockets created by each thread. If separate stacks are not an option then it may be beneficial to reduce the EF_BUZZ_USEC period or to disable stack lock buzzing altogether. • It is always important that threads that need to communicate with each other are running on the same CPU package so that these threads can share a memory cache. • Jitter may also be introduced when some sockets are accelerated and others are not. Onload will ensure that accelerated sockets are given priority over non-accelerated sockets, although this delay will only be in the region of a few microseconds - not milliseconds, the penalty will always be on the side of the non-accelerated sockets. The environment variables EF_POLL_FAST_USEC and EF_POLL_NONBLOCK_FAST_USEC can be configured to manage the extent of priority of accelerated sockets over non-accelerated sockets. • If traffic is sparse, spinning will deliver the same latency benefits, but the user should ensure that the spin timeout period, configured using the EF_POLL_USEC variable, is sufficiently long to ensure the thread is still spinning when traffic is received. Issue 15 © Solarflare Communications 2013 28 Onload User Guide • When applications only need to send and receive occasionally it may be beneficial to implement a keepalive - heartbeat mechanism between peers. This has the effect of retaining the process data in the CPU memory cache. Calling send or receive after a delay can result in the call taking measurably longer, due to the cache effects, than if this is called in a tight loop. • On some servers BIOS settings such as power and utilization monitoring can cause unnecessary jitter by performing monitoring tasks on all CPU cores. The user should check the BIOS and decide if periodic tasks (and the related SMIs) can be disabled. • The Solarflare sysjitter utility can be used to identify and measure jitter on all cores of an idle system - refer to Sysjitter on page 22 for details. Using Onload Tuning Profiles Environment variables set in the application user-space can be used configure and control aspects of the accelerated application’s performance. These variables can be exported using the Linux export command e.g. export EF_POLL_USEC=100000 Onload supports tuning profile script files which are used to group environment variables within a single file to be called from the Onload command line. The latency profile sets the EF_POLL_USEC=100000 setting the busy-wait spin timeout to 100 milliseconds. The profile also disables TCP faststart for new or idle connections where additional TCP ACKs will add latency to the receive path. To use the profile include it on the onload command line e.g onload --profile=latency netperf -H onload2-sfc -t TCP_RR Following Onload installation, profiles provided by Solarflare are located in the following directory this directory will be deleted by the onload_uninstall command: /usr/libexec/onload/profiles User-defined environment variables can be written to a user-defined profile script file (having a .opf extension) and stored in any directory on the server. The full path to the file should then be specified on the onload command line e.g. onload --profile=/tmp/myprofile.opf netperf -H onload2-sfc -t TCP_RR As an example the latency profile, provided by the Onload distribution is shown below: # Onload low latency profile. # Enable polling / spinning. When the application makes a blocking call # such as recv() or poll(), this causes Onload to busy wait for up to 100ms # before blocking. onload_set EF_POLL_USEC=100000 # Disable FASTSTART when connection is new or has been idle for a while. # The additional acks it causes add latency on the receive path. onload_set EF_TCP_FASTSTART_INIT 0 onload_set EF_TCP_FASTSTART_IDLE 0 For a complete list of environment variables refer to See “Appendix A: Parameter Reference” on page 93. Issue 15 © Solarflare Communications 2013 29 Onload User Guide Benchmark Testing Benchmark procedures using Onload, netperf and sfnt_pingpong are described in the Low Latency Quickstart Guide on page 4. 5.5 Advanced Tuning Advanced tuning requires closer examination of the application performance. The application should be tuned to achieve the following objectives: • To have as much processing at user-level as possible. • To have as few interrupts as possible. • To eliminate drops. • To minimize lock contention. Onload includes a diagnostic application called onload_stackdump, which can be used to monitor Onload performance and to set tuning options. The following sections demonstrate the use of onload_stackdump to examine aspects of the system performance and set environment variables to achieve the tuning objectives. For further examples and use of onload_stackdump refer to Appendix E: onload_stackdump on page 137. Monitoring Using onload_stackdump To use onload_stackdump, enter the following command: onload_stackdump To list available commands and view documentation for onload_stackdump enter the following commands: onload_stackdump doc onload_stackdump -h A specific stack number can also be provided on the onload_stackdump command line. Worked Examples Processing at User-Level Many applications can achieve better performance when most processing occurs at user-level rather than kernel-level. To identify how an application is performing, enter the following command: onload_stackdump lots | grep poll Issue 15 © Solarflare Communications 2013 30 Onload User Guide Output: $ onload_stackdump lots | grep poll time: netif=52a6fc7 poll=52a6fc7 now=52a6fd8 k_polls: 673 u_polls: 2 periodic_polls: 655 evq_wakeup_polls: 12 deferred_polls: 0 evq_timeout_polls: 4 (diff=0.017sec) $ The output identifies many more k_polls than u_polls indicating that the stack is operating mainly at kernel-level and not achieving optimal performance. Solution Terminate the application and set the EF_POLL_USEC parameter to 100000. Re-start the application and re-run onload_stackdump: export EF_POLL_USEC=100000 onload_stackdump lots | grep polls $ onload_stackdump lots | grep polls time: netif=52debb1 poll=52debb1 now=52debb1 k_polls: 18 u_polls: 181272 periodic_polls: 35 evq_wakeup_polls: 3 deferred_polls: 0 evq_timeout_polls: 1 $ (diff=0.000sec) The output identifies that the number of u_polls is far greater than the number of k_polls indicating that the stack is now operating mainly at user-level. As Few Interrupts as Possible All applications achieve better performance when subject to as few interrupts as possible. The level of interrupts is reported as the evq_interrupt_wakes value reported by onload_stackdump. For example: onload_stackdump lots | grep _evs Issue 15 © Solarflare Communications 2013 31 Onload User Guide Output: $ onload_stackdump lots | grep _evs evq: cap=2048 current=70 is_32_evs=0 is_ev=0 rx_evs: 77274 tx_evs: 6583 periodic_evs: 0 evq_interrupt_wakes: 52 evq_timeout_evs: 3 $ The output identifies the number of packets sent and received tx_evs and rx_evs,together with the number of interrupts, evq_interrupt_wakes. Solution If an application is observed taking lots of interrupts it may be beneficial to increase the spin time with the EF_POLL_USEC variable or setting a high interrupt moderation value for the net driver using ethtool. The number of interrupts on the system can also be identified from /proc/interrupts. Eliminating Drops The performance of networks is impacted by any packet loss. This is especially pronounced for reliable data transfer protocols that are built on top of unicast or multicast UDP sockets. First check to see if packets have been dropped by the network adapter before reaching the Onload stack. Use ethtool to collect stats directly from the network adapter: $ ethtool -S eth2 | egrep "(nodesc)|(bad)" tx_bad_bytes: 0 tx_bad: 0 rx_bad_bytes: 0 rx_bad: 0 rx_bad_lt64: 0 rx_bad_64_to_15xx: 0 rx_bad_15xx_to_jumbo: 0 rx_bad_gtjumbo: 0 rx_nodesc_drop_cnt: 0 $ The rx_nodesc_drop_cnt increasing over time is an indication that packets are being dropped by the adapter due to a lack of Onload-provided receive buffering. Packets can also be dropped with UDP due to datagrams arriving when the socket buffer is full i.e. traffic is arriving faster than the application can consume. To check for dropped packets at the socket level, enter: onload_stackdump lots | grep drop Issue 15 © Solarflare Communications 2013 32 Onload User Guide Output: $ onload_stackdump lots | grep drop rcv: drop=0(0%) eagain=0 pktinfo=0 $ The output identifies that packets are not being dropped on the system. Solution If packet loss is observed at the network level due to a lack of receive buffering try increasing the size of the receive descriptor queue size via EF_RXQ_SIZE. If packet drops are observed at the socket level consult the application documentation - it may also be worth experimenting with socket buffer sizes (see EF_UDP_RCVBUF). Setting the EF_EVS_PER_POLL variable to a higher value may also improve efficiency - refer to Appendix A for a description of this variable. Minimizing Lock Contention Lock contention can greatly affect performance. Use onload_stackdump to identify instances of lock contention: onload_stackdump lots | egrep "(lock_)|(sleep)" Output: $ onload_stackdump lots | egrep "(lock_)|(sleep)" sleep_seq=1 wake_rq=TxRx flags= sock_sleeps: 1 unlock_slow: 0 lock_wakes: 0 stack_lock_buzz: 0 sock_lock_sleeps: 0 sock_lock_buzz: 0 tcp_send_ni_lock_contends: 0 udp_send_ni_lock_contends: 0 getsockopt_ni_lock_contends: 0 setsockopt_ni_lock_contends: 0 $ The output identifies that very little lock contention is occurring. Solution If high values are observed for any of the lock variables, try increasing the value of EF_BUZZ_USEC to reduce the 'sleeps' value. If stacks are being shared across processes, try using a different stack per process. Issue 15 © Solarflare Communications 2013 33 Onload User Guide Chapter 6: Features & Functionality 6.1 Introduction This chapter provides detailed information about specific aspects of Solarflare Onload operation and functionality. 6.2 Onload Transparency Onload provides significantly improved performance without the need to rewrite or recompile the user application, whilst retaining complete interoperability with the standard TCP and UDP protocols. In the regular kernel TCP/IP architecture an application is dynamically linked to the libc library. This OS library provides support for the standard BSD sockets API via a set of ’wrapper’ functions with real processing occurring at the kernel-level. Onload also supports the standard BSD sockets API. However, in contrast to the kernel TCP/IP, Onload moves protocol processing out of the kernel-space and into the user-level Onload library itself. As a networking application invokes the standard socket API function calls e.g. socket(), read(), write() etc, these are intercepted by the Onload library making use of the LD_PRELOAD mechanism on Linux. From each function call, Onload will examine the file descriptor identifying those sockets using a Solarflare interface - which are processed by the Onload stack, whilst those not using a Solarflare interface are transparently passed to the kernel stack. 6.3 Onload Stacks An Onload 'stack' is an instance of a TCP/IP stack. The stack includes transmit and receive buffers, open connections and the associated port numbers and stack options. Each stack has associated with it one or more Virtual NICs (typically one per physical port that stack is using). In normal usage, each accelerated process will have its own Onload stack shared by all connections created by the process. It is also possible for multiple processes to share a single Onload stack instance (refer to Stack Sharing on page 55), and for a single application to have more than one Onload stack. Refer to Appendix D: Onload Extensions API on page 112. 6.4 Virtual Network Interface (VNIC) The Solarflare network adapter supports 1024 transmit queues, 1024 receive queues, 1024 event queues and 1024 timer resources per network port. A VNIC (virtual network interface) consists of one unique instance of each of these resources which allows the VNIC client i.e. the Onload stack, an isolated and safe mechanism of sending and receiving network traffic. Received packets are steered to the correct VNIC by means of IP/MAC filter tables on the network adapter and/or Receive Side Scaling (RSS). An Onload stack allocates one VNIC per Solarflare network port so it has a dedicated send and receive channel from user mode. Issue 15 © Solarflare Communications 2013 34 Onload User Guide Following a reset of the Solarflare network adapter driver, all virtual interface resources including Onload stacks and sockets will be re-instated. The reset operation will be transparent to the application, but traffic will be lost during the reset. 6.5 Functional Overview When establishing its first socket, an application is allocated an Onload stack which allocates the required VNICs. When a packet arrives, IP filtering in the adapter identifies the socket and the data is written to the next available receive buffer in the corresponding Onload stack. The adapter then writes an event to an “event queue” managed by Onload. If the application is regularly making socket calls, Onload is regularly polling this event queue, and then processing events rather than interrupts are the normal means by which an application is able to rendezvous with its data. User-level processing significantly reduces kernel/user-level context switching and interrupts are only required when the application blocks - since when the application is making socket calls, Onload is busy processing the event queue picking up new network events. 6.6 Onload with Mixed Network Adapters A server may be equipped with Solarflare network interfaces and non-Solarflare network interfaces. When an application is accelerated, Onload reads the Linux kernel routing table (Onload will only consider the kernel default routing table) to identify which network interface is required to make a connection. If a non-Solarflare interface is required to reach a destination Onload will pass the connection to the kernel TCP/IP stack. No additional configuration is required to achieve this as Onload does this automatically by looking in the IP route table. 6.7 Maximum Number of Network Interfaces Onload supports up to 6 Solarflare network interfaces by default. If an application requires more Solarflare interfaces the following values can be altered in the source code: src/include/ci/ internal/transport_config_opt.h header file: CI_CFG_MAX_INTERFACES and CI_CFG_MAX_REGISTER_INTERFACES. Following changes to these values it is necessary to rebuild and reinstall Onload. 6.8 Onloaded PIDs To identify processes accelerated by Onload use the onload_fuser command: # onload_fuser -v 9886 ping Only processes that have created an Onload stack are present. Processes which are loaded under Onload, but have not created any sockets are not present. The onload_stackdump command can also list accelerated processes - see List Onloaded Processes on page 138 for details. Issue 15 © Solarflare Communications 2013 35 Onload User Guide 6.9 Onload and File Descriptors, Stacks and Sockets For an Onloaded process it is possible to identify the file descriptors, Onload stacks and sockets being accelerated by Onload. Use the /proc/ /fd file - supplying the PID of the accelerated process e.g. # ls -l /proc/9886/fd total 0 lrwx------ 1 root root lrwx------ 1 root root lrwx------ 1 root root lrwx------ 1 root root lrwx------ 1 root root lrwx------ 1 root root lrwx------ 1 root root 64 64 64 64 64 64 64 May May May May May May May 14 14 14 14 14 14 14 14:09 14:09 14:09 14:09 14:09 14:09 14:09 0 1 2 3 4 5 6 -> -> -> -> -> -> -> /dev/pts/0 /dev/pts/0 /dev/pts/0 onload:[tcp:6:3] /dev/pts/0 /dev/onload onload:[udp:6:2] Accelerated file descriptors are listed as symbolic links to /dev/onload. Accelerated sockets are described in [protocol:stack:socket] format. 6.10 System calls intercepted by Onload System calls intercepted by the Onload library are listed in the following file: [onload]/src/include/onload/declare_syscalls.h.tmpl 6.11 Linux Sysctls The Linux directory/proc/sys/net/ipv4 contains default settings which tune different parts of the IPv4 networking stack. In many cases Onload takes its default settings from the values in this directory. These defaults can be overridden, for a specified processes or socket, using socket options or with Onload environment variables. The following tables identify the default Linux values and how Onload tuning parameters can override the Linux settings. Issue 15 Kernel Value tcp_slow_start_after_idle Description controls congestion window validation as per RFC2861. This is "off" by default in Onload, as it's not usually useful in modern switched networks Onload value #define CI_CFG_CONGESTION_WINDOW_VALIDATION Comments in transport_config_opt.h - recompile after changing. Kernel Value tcp_congestion_control Description determines what congestion control algorithm is used by TCP. Valid settings include reno, bic and cubic Onload value no direct equivalent - see the section on TCP Congestion Control © Solarflare Communications 2013 36 Onload User Guide Comments see EF_CONG_AVOID_SCALE_BACK Kernel Value tcp_adv_win_scale Description defines how quickly the TCP window will advance Onload value no direct equivalent - see the section on TCP Congestion Control Comments see EF_TCP_ADV_WIN_SCALE_MAX Kernel Value tcp_rmem Description the default size of sockets' receive buffers (in bytes) Onload value defaults to the currently active Linux settings Comments can be overriden with the SO_RCVBUF socket option. can be set with EF_TCP_RCVBUF Kernel Value tcp_wmem Description the default size of sockets' send buffers (in bytes) Onload value defaults to the currently active Linux settings Comments EF_TCP_SNDBUF overrides SO_SNDBUF which overrides tcp_wmem Kernel Value tcp_dsack Description allows TCP to send duplicate SACKS Onload value uses the currently active Linux settings Comments Kernel Value tcp_fack Description enables fast retransmissions Onload value fast retransmissions are always enabled Comments Issue 15 Kernel Value tcp_sack Description enable TCP select acknowledgements, as per RFC2018 Onload value enabled by default Comments clear bit 2 of EF_TCP_SYN_OPTS to disable © Solarflare Communications 2013 37 Onload User Guide Kernel Value tcp_max_syn_backlog Description the maximum size of a listening socket's backlog Onload value set with EF_TCP_BACKLOG_MAX Comments Refer to the Appendix A: Parameter Reference on page 93 for details of environment variables. 6.12 Changing Onload Control Plane Table Sizes Onload supports the following runtime configurable options which determine the size of control plane tables: Option Description Default max_layer2_interfaces Sets the maximum number of network interfaces, including physical interfaces, VLANs and bonds, supported in Onload’s control plane. 50 max_neighs Sets the maximum number of rows in the Onload ARP/neighbour table. The value is rounded up to a power of two. 1024 max_routes Sets the maximum number of entries in the Onload route table. 256 The table above identifies the default values for the Onload control plane tables. The default values are normally sufficient for the majority of applications and creating larger tables may impact application performance. If non-default values are needed, the user should create a file in the /etc/ modprobe.d directory. The file must have a .conf extension and Onload options can be added to the file, a single option per line, in the following format: options onload max_neighs=512 Following changes Onload should be restarted using the reload command: onload_tool reload Issue 15 © Solarflare Communications 2013 38 Onload User Guide 6.13 TCP Operation The table below identifies the Onload TCP implementation RFC compliance. RFC Title Compliance 793 Transmission Control Protocol Yes 813 Window and Acknowledgement Strategy in TCP Yes 896 Congestion Control in IP/TCP Yes 1122 Requirements for Hosts Yes 1191 Path MTU Discovery Yes 1323 TCP Extensions for High Performance Yes 2018 TCP Selective Acknowledgment Options Yes 2581 TCP Congestion Control Yes 2582 The NewReno Modification to TCP’s Fast Recovery Algorithm Yes 2988 Computing TCP’s Retransmission Timer Yes 3128 Protection Against a Variant of the Tiny Fragment Attack Yes 3168 The Addition of Explicit Congestion Notification (ECN) to IP Yes 3465 TCP Congestion Control with Appropriate Byte Counting (ABC) Yes TCP Handshake - SYN, SYNACK During the TCP connection establishment 3-way handshake, Onload negotiates the MSS, Window Scale, SACK permitted, ECN, PAWS and RTTM timestamps. For TCP connections Onload will negotiate an appropriate MSS for the MTU configured on the interface. However, when using jumbo frames, Onload will currently negotiate an MSS value up to a maximum of 2048 bytes minus the number of bytes required for packet headers. This is due to the fact that the size of buffers passed to the Solarflare network interface card is 2048 bytes and the Onload stack cannot currently handle fragmented packets on its TCP receive path. TCP options advertised during the handshake can be selected using the EF_TCP_SYN_OPTS environment variable. Refer to Appendix A: Parameter Reference on page 93 for details of environment variables. Issue 15 © Solarflare Communications 2013 39 Onload User Guide TCP Socket Options Onload TCP supports the following socket options which can be used in the setsockopt() and getsockopt() function calls. Issue 15 Option Description SO_ACCEPTCONN determines whether the socket can accept incoming connections - true for listening sockets. (Only valid as a getsockopt()). SO_BINDTODEVICE bind this socket to a particular network interface. SO_CONNECT_TIME number of seconds a connection has been established. (Only valid as a getsockopt()). SO_DEBUG enable protocol debugging. SO_DONTROUTE outgoing data should be sent on whatever interface the socket is bound to and not routed via another interface. SO_ERROR the errno value of the last error occurring on the socket. (Only valid as a getsockopt()). SO_EXCLUSIVEADDR USE prevents other sockets using the SO_REUSEADDR option to bind to the same address and port. SO_KEEPALIVE enable sending of keep-alive messages on connection oriented sockets. SO_LINGER when enabled a close() or shutdown() will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise the close() or shutdown() returns immediately and sockets are closed in the background. SO_OOBINLINE indicates that out-of-bound data should be returned in-line with regular data. This option is only valid for connection-oriented protocols that support out-of-band data. SO_PRIORITY set the priority for all packets sent on this socket. Packets with a higher priority may be processed first depending on the selected device queueing discipline. SO_RCVBUF sets or gets the maximum socket receive buffer in bytes. The value set is doubled by the kernel and by Onload to allow for bookkeeping overheads when it is set by the setsockopt() function call. Note that EF_TCP_RCVBUF overrides this value. SO_RCVLOWAT sets the minimum number of bytes to process for socket input operations. SO_RCVTIMEO sets the timeout for input function to complete. SO_RECVTIMEO sets the timeout in milliseconds for blocking receive calls. SO_REUSEADDR can reuse local port numbers i.e. another socket can bind to the same port except when there is an active listening socket bound to the port. © Solarflare Communications 2013 40 Onload User Guide SO_SNDBUF sets or gets the maximum socket send buffer in bytes. The value set is doubled by the kernel and by Onload to allow for bookkeeping overhead when it is set by the setsockopt() function call. Note that EF_TCP_SNDBUF and EF_TCP_SNDBUF_MODE override this value. SO_SNDLOWAT sets the minimum number of bytes to process for socket output operations. Always set to 1 byte. SO_SNDTIMEO set the timeout for sending function to send before reporting an error. SO_TIMESTAMP enable/disable receiving the SO_TIMESTAMP control message. SO_TIMESTAMPNS enable/disable receiving the SO_TIMESTAMP control message. SO_TIMESTAMPING enable/disable hardware timestamps for received packets. See SO_TIMESTAMPING (hardware timestamps) on page 50. SO_TYPE returns the socket type (SOCK_STREAM or SOCK_DGRAM). (Only valid as a getsockopt()). TCP Level Options Onload TCP supports the following TCP options which can be used in the setsockopt() and getsockopt() function calls Issue 15 Option Description TCP_CORK stops sends on segments less than MSS size until the connection is uncorked. TCP_DEFER_ACCEPT a connection is ESTABLISHED after handshake is complete instead of leaving it in SYN-RECV until the first real data packet arrives. The connection is placed in the accept queue when the first data packet arrives. TCP_INFO populates an internal data structure with tcp statistic values. TCP_KEEPALIVE_AB ORT_THRESHHOLD how long to try to produce a successful keepalive before giving up. TCP_KEEPALIVE_TH RESHHOLD specifies the idle time for keepalive timers. TCP_KEEPCNT number of keepalives before giving up. TCP_KEEPIDLE idle time for keepalives. TCP_KEEPINTVL time between keepalives. TCP_MAXSEG gets the MSS size for this connection. TCP_NODELAY disables Nagle’s Algorithm and small segments are sent without delay and without waiting for previous segments to be acknowledged. © Solarflare Communications 2013 41 Onload User Guide TCP_QUICKACK when enabled ACK messages are sent immediately following reception of the next data packet. This flag will be reset to zero following every use i.e. it is a one time option. Default value is 1 (enabled). TCP File Descriptor Control Onload supports the following options in socket() and accept() calls. Issue 15 Option Description SOCK_CLOEXEC supported in socket() and accept(). Sets the O_NONBLOCK file status flag on the new open file descriptor saving extra calls to fcntl(2) to achieve the same result. SOCK_NONBLOCK supported in accept(). Sets the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. © Solarflare Communications 2013 42 Onload User Guide TCP Congestion Control Onload TCP implements congestion control in accordance with RFC3465 and employs the NewReno algorithm with extensions for Appropriate Byte Counting (ABC). On new or idle connections and those experiencing loss, Onload employs a Fast Start algorithm in which delayed acknowledgments are disabled, thereby creating more ACKs and subsequently ’growing’ the congestion window rapidly. Two environment variables; EF_TCP_FASTSTART_INIT and EF_TCP_FASTSTART_LOSS are associated with the fast start - Refer to Appendix A: Parameter Reference on page 93 for details. During Slow Start, the congestion window is initially set to 2 x maximum segment size (MSS) value. As each ACK is received the congestion window size is increased by the number of bytes acknowledged up to a maximum 2 x MSS bytes. This allows Onload to transmit the minimum of the congestion window and advertised window size i.e. transmission window (bytes) = min(CWND,receiver advertised window size) If loss is detected - either by retransmission timeout (RTO), or the reception of duplicate ACKs, Onload will adopt a congestion avoidance algorithm to slow the transmission rate. In congestion avoidance the transmission window is halved from its current size - but will not be less than 2 x MSS. If congestion avoidance was triggered by an RTO timeout the Slow Start algorithm is again used to restore the transmit rate. If triggered by duplicate ACKs Onload employs a Fast Retransmit and Fast Recovery algorithm. If Onload TCP receives 3 duplicate ACKs this indicates that a segment has been lost - rather than just received out of order and causes the immediate retransmission of the lost segment (Fast Retransmit). The continued reception of duplicate ACKs is an indication that traffic still flows within the network and Onload will follow Fast Retransmit with Fast Recovery. During Fast Recovery Onload again resorts to the congestion avoidance (without Slow Start) algorithm with the congestion window size being halved from its present value. Onload supports a number of environment variables that influence the behaviour of the congestion window and recovery algorithms Refer to Appendix A: Parameter Reference on page 93.: EF_TCP_INITIAL_CWND - sets the initial size (bytes) of congestion window EF_TCP_LOSS_MIN_CWND - sets the minimum size of the congestion window following loss. EF_CONG_AVOID_SCALE_BACK - slows down the rate at which the TCP congestion window is opened to help reduce loss in environments already suffering congestion and loss. The congestion variables should be used with caution so as to avoid violating TCP protocol requirements and degrading TCP performance. Issue 15 © Solarflare Communications 2013 43 Onload User Guide TCP SACK Onload will employ TCP Selective Acknowledgment (SACK) if the option has been negotiated and agreed by both ends of a connection during the connection establishment 3-way handshake. Refer to RFC 2018 for further information. TCP QUICKACK TCP will generally aim to defer the sending of ACKs in order to minimize the number of packets on the network. Onload supports the standard TCP_QUICKACK socket option which allows some control over this behaviour. Enabling TCP_QUICKACK causes an ACK to be sent immediately in response to the reception of the following data packet. This is a one-shot operation and TCP_QUICKACK self clears to zero immediately after the ACK is sent. TCP Delayed ACK By default TCP stacks delay sending acknowledgments (ACKs) to improve efficiency and utilization of a network link. Delayed ACKs also improve receive latency by ensuring that ACKs are not sent on the critical path. However, if the sender of TCP packets is using Nagle’s algorithm, receive latency will be impaired by using delayed ACKs. Using the EF_DELACK_THRESH environment variable the user can specify how many TCP segments can be received before Onload will respond with a TCP ACK. Refer to the Parameter List on page 93 for details of the Onload environment delayed TCP ACK variables. TCP Dynamic ACK The sending of excessive TCP ACKs can impair performance and increase receive side latency. Although TCP generally aims to defer the sending of ACKs, Onload also supports a further mechanism. The EF_DYNAMIC_ACK_THRESH environment variable allows Onload to dynamically determine when it is non-detrimental to throughput and efficiency to send a TCP ACK. Onload will force an TCP ACK to be sent if the number of TCP ACKs pending reaches the threshold value. Refer to the Parameter List on page 93 for details of the Onload environment delayed TCP ACK variables. NOTE: When used together with EF_DELACK_THRESH or EF_DYNAMIC_ACK_THRESH, the socket option TCP_QUICKACK will behave exactly as stated above. Both onload environment variables identify the maximum number of segments that can be received before an ACK is returned. Sending an ACK before the specified maximum is reached is allowed. NOTE: TCP ACKS should be transmitted at a sufficient rate to ensure the remote end does not drop the TCP connection. Issue 15 © Solarflare Communications 2013 44 Onload User Guide TCP Loopback Acceleration Onload supports the acceleration of TCP loopback connections, providing an accelerated mechanism through which two processes on the same host can communicate. Accelerated TCP loopback connections do not invoke system calls, reduce the overheads for read/write operations and offer improved latency over the kernel implementation. The server and client processes who want to communicate using an accelerated TCP loopback connection do not need to be configured to share an Onload stack. However, the server and client TCP loopback sockets can only be accelerated if they are in the same Onload stack. Onload has the ability to move a TCP loopback socket between Onload stacks to achieve this. TCP loopback acceleration is configured via the environment variables EF_TCP_CLIENT_LOOPBACK and EF_TCP_SERVER_LOOPBACK.As well as enabling TCP loopback acceleration these environment variables control Onload’s behavior when the server and client sockets do not originate in the same Onload stack. This gives the user greater flexibility and control when establishing loopback on TCP sockets either from the listening (server) socket or from the connecting (client) socket. The connecting socket can use any local address or specify the loopback address. The following diagram illustrates the client and server loopback options. Refer to Appendix A: Parameter Reference on page 93 for a description of the loopback variables. Figure 3: EF_TCP_CLIENT/SERVER_LOOPBACK Issue 15 © Solarflare Communications 2013 45 Onload User Guide The client loopback option EF_TCP_CLIENT_LOOPBACK=4, when used with the server loopback option EF_TCP_SERVER_LOOPBACK=2, differs from other loopback options such that rather than move sockets between existing stacks they will create an additional stack and move sockets from both ends of the TCP connection into this new stack. This avoids the possibility of having many loopback sockets sharing and contending for the resources of a single stack. TCP Striping Onload supports a Solarflare proprietary TCP striping mechanism that allows a single TCP connection to use both physical ports of a network adapter. Using the combined bandwidth of both ports means increased throughput for TCP streaming applications. TCP striping can be particularly beneficial for Message Passing Interface (MPI) applications. If the TCP connection’s source IP address and destination IP address are on the same subnet as defined by EF_STRIPE_NETMASK then Onload will attempt to negotiate TCP striping for the connection. Onload TCP striping must be configured at both ends of the link. TCP striping allows a single TCP connection to use the full bandwidth of both physical ports on the same adapter. This should not be confused with link aggregation/port bonding in which any one TCP connection within the bond can only use a single physical port and therefore more than one TCP connection would be required to realize the full bandwidth of two physical ports. NOTE: TCP striping is disabled by default. To enable this feature set the parameter CI_CFG_PORT_STRIPING=1 in the onload distribution source directory src/include/ internal/tranport_config_opt.h file. TCP Connection Reset on RTO Under certain circumstances it may be preferable to avoid re-sending TCP data to a peer service when data delivery has been delayed. Once data has been sent, and for which no acknowledgment has been received, the TCP retransmission timeout period represents a considerable delay. When the retransmission timeout (RTO) eventually expires it may be preferable not to retransmit the original data. Onload can be configured to reset a TCP connection rather than attempt to retransmit data for which no acknowledgment has be received. This feature is enabled with the EF_TCP_RST_DELAYED_CONN per stack environment variable and applies to all TCP connections in the onload stack. On any TCP connection in the onload stack, if the RTO timer expires before an ACK is received the TCP connection will be reset. ONLOAD_MSG_WARM Applications that send data infrequently may see increased send latency compared to an application that is making frequent sends. This is due to the send path and associated data structures not being cache and TLB resident (which can occur even if the CPU has been otherwise idle since the previous send call). Issue 15 © Solarflare Communications 2013 46 Onload User Guide Onload therefore supports applications repeatedly calling send to keep the TCP fast send path ’warm’ in the cache without actually sending data. This is particularly useful for applications that only send infrequently and helps to maintain low latency performance for those TCP connections that do not send often. These "fake" sends are performed by setting the ONLOAD_MSG_WARM flag when calling the TCP send calls. The message warm feature does not transmit any packets. char buf[10]; send(fd, buf, 10, ONLOAD_MSG_WARM); Onload stackdump supports new counters to indicate the level of message warm use: tx_msg_warm_try is a count of the number of times a message warm send function was called, but the sendpath was not exercised due to Onload locking constraints. tx_msg_warm is a count of the number of times a message warm send function was called when the send path was exercised. NOTE: If the ONLOAD_MSG_WARM flag is used on sockets which are not accelerated - including those handed off to the kernel by Onload, it may cause the message warm packets to be actually sent. This is due to a limitation in some Linux distributions which appear to ignore this flag. The Onload extensions API can be used to check whether a socket supports the MSG_WARM feature via the onload_fd_check_feature() API (onload_fd_check_feature() on page 114). NOTE: Onload versions earlier than 201310 do not support the ONLOAD_MSG_WARM socket flag, therefore setting the flag will cause message warm packets to be sent. Issue 15 © Solarflare Communications 2013 47 Onload User Guide 6.14 UDP Operation The table below identifies the Onload UDP implementation RFC compliance. RFC Title Compliance 768 User Datagram Protocol Yes 1122 Requirements for Hosts Yes 3678 Socket Interface Extensions for Multicast Source Filters Partial See Source Specific Socket Options on page 50 Socket Options Onload UDP supports the following socket options which can be used in the setsockopt() and getsockopt() function calls. Issue 15 Option Description SO_BINDTODEVICE bind this socket to a particular network interface. See SO_BINDTODEVICE below. SO_BROADCAST when enabled datagram sockets can send and receive packets to/from a broadcast address. SO_DEBUG enable protocol debugging. SO_DONTROUTE outgoing data should be sent on whatever interface the socket is bound to and not routed via another interface. SO_ERROR the errno value of the last error occurring on the socket. (Only valid as a getsockopt()). SO_EXCLUSIVEADDR USE prevents other sockets using the SO_REUSEADDR option to bind to the same address and port. SO_LINGER when enabled a close()or shutdown()will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise the call returns immediately and sockets are closed in the background. SO_PRIORITY set the priority for all packets sent on this socket. Packets with a higher priority may be processed first depending on the selected device queueing discipline. SO_RCVBUF sets or gets the maximum socket receive buffer in bytes. The value set is doubled by the kernel and by Onload to allow for bookkeeping overhead when it is set by the setsockopt() function call. Note that EF_UDP_RCVBUF overrides this value. SO_RCVLOWAT sets the minimum number of bytes to process for socket input operations. © Solarflare Communications 2013 48 Onload User Guide Issue 15 SO_RECVTIMEO sets the timeout for input function to complete. SO_REUSEADDR can reuse local ports i.e. another socket can bind to the same port number except when there is an active listening socket bound to the port. SO_SNDBUF sets or gets the maximum socket send buffer in bytes. The value set is doubled by the kernel and by Onload to allow for bookkeeping overhead when it is set by the setsockopt() function call. Note that EF_UDP_SNDBUF overrides this value. SO_SNDLOWAT sets the minimum number of bytes to process for socket output operations. Always set to 1 byte. SO_SNDTIMEO set the timeout for sending function to send before reporting an error. SO_TIMESTAMP enable or disable receiving the SO_TIMESTAMP control message (microsecond resolution). See SO_TIMESTAMP and SO_TIMESTAMPNS (software timestamps) below. SO_TIMESTAMPNS enable or disable receiving the SO_TIMESTAMP control message (nanosecond resolution). See SO_TIMESTAMP and SO_TIMESTAMPNS (software timestamps) below. SO_TIMESTAMPING enable/disable hardware timestamps for received packets. See SO_TIMESTAMPING (hardware timestamps) on page 50. SO_TYPE returns the socket type (SOCK_STREAM or SOCK_DGRAM). (Only valid as a getsockopt()). © Solarflare Communications 2013 49 Onload User Guide Source Specific Socket Options The following table identifies source specific socket options supported from onload-201210-u1 onwards. Refer to release notes for Onload specific behaviour regarding these options. Option Description IP_ADD_SOURCE_MEMBERSHIP Join the supplied multicast group on the given interface and accept data from the supplied source address. IP_DROP_SOURCE_MEMBERSHIP Drops membership to the given multicast group, interface and source address. MCAST_JOIN_SOURCE_GROUP Join a source specific group. MCAST_LEAVE_SOURCE_GROUP Leave a source specific group. SO_TIMESTAMP and SO_TIMESTAMPNS (software timestamps) The setsockopt() function call may be passed an option SO_TIMESTAMP to enable timestamping on the specified socket. Functions such as cmesg() recvmsg() and recvmmsg() can then extract timestamp data for packets received at the socket. Onload implements a software timestamping mechanism, providing microsecond resolution. Onload software timestamps avoid the need for a per-packet system call and thereby reduce the normal timestamp overheads. The Solarflare adapter will always deliver received packets to the receive ring buffer in the order that these arrive from the network. Onload will append a software timestamp to the packet meta data when it retrieves a packet from the ring buffer - before the packet is transferred to a waiting socket buffer. This implies that the timestamps will reflect the order the packets arrive from the network, but there may be a delay between the packet being received by the hardware and the software timestamp being applied. Applications preferring timestamps with nanosecond resolution can use SO_TIMESTAMPNS in place of the normal (micro second resolution) SO_TIMESTAMP value. SO_TIMESTAMPING (hardware timestamps) The setsockopt() function call may be passed an option SO_TIMESTAMPING to enable hardware timestamps to be generated by the adapter for each received packet. Functions such as cmesg(), recvmsg() and recvmmsg() can then extract timestamp data for packets received at the socket. This support is only available on SFN7000 series adapters. Before receiving hardware timestamps: • A valid AppFlex license for hardware timestamps must be installed on the Solarflare Flareon Ultra adapter. The PTP/timestamping license is installed on the SFN7322F during manufacture, such a license can be installed on the SFN7122F adapter by the user. • The Onload stack for the socket must have the environment variable EF_RX_TIMESTAMPING set - see Appendix A: Parameter Reference on page 92 for settings. Issue 15 © Solarflare Communications 2013 50 Onload User Guide Received packets are timestamped when they enter the MAC on the SFN7000 series adapter. Interested users should read the kernel SO_TIMESTAMPING documentation for more details of how to use this socket API – kernel documentation can be found, for example, at: http://lxr.linux.no/linux/Documentation/networking/timestamping. SO_BINDTODEVICE In response to the setsockopt() function call with SO_BINDTODEVICE, sockets identifying nonSolarflare interfaces will be handled by the kernel and all sockets identifying Solarflare interfaces will be handled by Onload. All sends from a socket are sent via the bound interface and all TCP, UDP and Multicast packets received via the bound interface are delivered only to the socket bound to the interface. UDP Send and Receive Paths For each UDP socket, Onload creates both an accelerated socket and a kernel socket. There is usually no file descriptor for the kernel socket visible in the user’s file descriptor table. When a UDP process is ready to transmit data, Onload will check a cached ARP table which maps IP addresses to MAC addresses. A cache ’hit’ results in sending via the Onload accelerated socket. A cache ‘miss’ results in a syscall to populate the user mode cached ARP table. If no MAC address can be identified via this process the packet is sent via the kernel stack to provoke ARP resolution. Therefore, it is possible that some UDP traffic will be sent occasionally via the kernel stack. Figure 4: UDP Send and Receive Paths Figure 4 illustrates the UDP send and receive paths. Lighter arrows indicate the accelerated ’kernel bypass’ path. Darker arrows identify fragmented UDP packets received by the Solarflare adapter and Issue 15 © Solarflare Communications 2013 51 Onload User Guide UDP packets received from a non-Solarflare adapter. UDP packets arriving at the Solarflare adapter are filtered on source and destination address and port number to identify a VNIC the packet will be delivered to. Fragmented UDP packets are received by the application via the kernel UDP socket. UDP packets received by a non-Solarflare adapter are always received via the kernel UDP socket. Fragmented UDP When sending datagrams which exceed the MTU, the Onload stack will send multiple Ethernet packets. On hosts running Onload, fragmented datagrams are always received via the kernel stack. 6.15 User Level recvmmsg for UDP The recvmmsg() function is intercepted for UDP sockets which are accelerated by Onload. The Onload user-level recvmmsg() is available to systems that do not have kernel/libc support for this function. The recvmmsg()is not supported for TCP sockets. 6.16 User-Level sendmmsg for UDP The sendmmsg() function is intercepted for UDP sockets which are accelerated by Onload. The Onload user-level sendmmsg() is available to systems that do not have kernel/libc support for this function. The sendmmsg() is not supported for TCP sockets. 6.17 Multiplexed I/O Linux supports various common methods for handling multiplexed I/O operation: • poll(), ppoll() • select(), pselect() • the epoll set of functions. The general behaviour of the poll(), ppoll(), select(), pselect(), epoll_wait() and epoll_pwait()functions with Onload is as follows: • If there are operations ready on any Onload accelerated file descriptor these functions will return immediately. Refer to the relevant subsections below for specific behaviour details. • If there are no Onload accelerated file descriptors ready, and spinning is not enabled these functions will enter the kernel and block. • If the set contains file descriptors that are not accelerated sockets, Onload must make a system call to determine the readiness of these sockets. If there are no file descriptors ready and spinning is enabled, Onload will spin to ensure that accelerated sockets are polled a specified number of times before unaccelerated sockets are examined. This reduces the overhead incurred when Onload has to call into the kernel and reduces latency on accelerated sockets. The following subsections discuss the use of these I/O functions and Onload environment variables that can be used to manipulate behaviour of the I/O operation. Issue 15 © Solarflare Communications 2013 52 Onload User Guide Poll, ppoll The poll(), ppoll() file descriptor set can consist of both accelerated and non-accelerated file descriptors. The environment variable EF_UL_POLL enables/disables acceleration of the poll(), ppoll() function calls. Onload supports the following options for the EF_UL_POLL variable: Value Behaviour 0 Disable acceleration at user-level. Calls to poll(), ppoll() are handled by the kernel. Spinning cannot be enabled. 1 Enable acceleration at user-level. Calls to poll(), ppoll() are processed at user level. Spinning can be enabled and interrupts are avoided until an application blocks. Additional environment variables can be employed to control the poll(), ppoll() functions and to give priority to accelerated sockets over non-accelerated sockets and other file descriptors. Refer to EF_POLL_FAST, EF_POLL_FAST_USEC and EF_POLL_SPIN in Appendix A: Parameter Reference on page 93. Select, pselect The select(), pselect() file descriptor set can consist of both accelerated and non-accelerated file descriptors. The environment variable EF_UL_SELECT enables/disables acceleration of the select(), pselect() function calls. Onload supports the following options for the EF_UL_SELECT variable: Value Epoll Behaviour 0 Disable acceleration at user-level. Calls to select(), pselect() are handled by the kernel. Spinning cannot be enabled. 1 Enable acceleration at user-level. Calls to select() pselect() are processed at user-level. Spinning can be enabled and interrupts are avoided until an application blocks. Additional environment variables can be employed to control the select(), pselect() functions and to give priority to accelerated sockets over non-accelerated sockets and other file descriptors. Refer to EF_SELECT_FAST and EF_SELECT_SPIN in Appendix A: Parameter Reference on page 93. Issue 15 © Solarflare Communications 2013 53 Onload User Guide Epoll The epoll set of functions, epoll_create(), epoll_ctl, epoll_wait(), epoll_pwait(), are accelerated in the same way as poll and select. The environment variable EF_UL_EPOLL enables/disables epoll acceleration. Refer to the release change log for enhancements and changes to EPOLL behaviour. Using Onload an epoll set can consist of both Onload file descriptors and kernel file descriptors. Onload supports the following options for the EF_UL_EPOLL environment variable: Value Epoll Behaviour 0 Accelerated epoll is disabled and epoll_ctl(), epoll_wait() and epoll_pwait() function calls are processed in the kernel. Other functions calls such as send() and recv() are still accelerated. Interrupt avoidance does not function and spinning cannot be enabled. If a socket is handed over to the kernel stack after it has been added to an epoll set, it will be dropped from the epoll set. 1 Function calls to epoll_ctl(), epoll_wait(), epoll_pwait() are processed at user level. Delivers best latency except when the number of accelerated file descriptors in the epoll set is very large. This option also gives the best acceleration of epoll_ctl()calls. Spinning can be enabled and interrupts are avoided until an application blocks. CPU overhead and latency increase with the number of file descriptors in the epoll set. 2 Calls to epoll_ctl(), epoll_wait(), epoll_pwait() are processed in the kernel. Delivers best performance for large numbers of accelerated file descriptors. Spinning can be enabled and interrupts are avoided until an application blocks. CPU overhead and latency are independent of the number of file descriptors in the epoll set. The relative performance of epoll options 1 and 2 depends on the details of application behaviour as well as the number of accelerated file descriptors in the epoll set. Behaviour may also differ between earlier and later kernels and between Linux realtime and non-realtime kernels. Generally the OS will allocate short time slices to a user-level CPU intensive application which may result in performance (latency spikes). A kernel-level CPU intensive process is less likely to be de-scheduled resulting in better performance. Solarflare recommend the user evaluate both options for an applications that manages many file descriptors. Additional environment variables can be employed to control the epoll_ctl(), epoll_wait() and epoll_pwait() functions and to give priority to accelerated sockets over non-accelerated sockets and other file descriptors. Refer to EF_EPOLL_CTL_FAST, EF_EPOLL_SPIN and Issue 15 © Solarflare Communications 2013 54 Onload User Guide EF_EPOLL_MT_SAFE in Appendix A: Parameter Reference on page 93. Refer to epoll - Known Issues on page 78. 6.18 Stack Sharing By default each process using Onload has its own 'stack'. Refer to Onload Stacks for definition. Several processes can be made to share a single stack, using the EF_NAME environment variable. Processes with the same value for EF_NAME in their environment will share a stack. Stack sharing is necessary to enable multiple processes using Onload to be accelerated when receiving the same multicast stream or to allow one application to receive a multicast stream generated locally by a second application. Stacks may also be shared by multiple processes in order to preserve and control resources within the system. Stack sharing can be employed by processes handling TCP as well as UDP sockets. Stack sharing should only be requested if there is a trust relationship between the processes. If two processes share a stack then they are not completely isolated: a bug in one process may impact the other, or one process can gain access to the other's privileged information (i.e. breach security). Once the EF_NAME variable is set, any process on the local host can set the same value and gain access to the stack. By default Onload stacks can only be shared with processes having the same UID. The EF_SHARE_WITH environment variable provides additional security while allowing a different UID to share a stack. Refer to Appendix A: Parameter Reference on page 93 for a description of the EF_NAME and EF_SHARE_WITH variables. Processes sharing an Onload stack should also not use huge pages. Onload will issue a warning at startup and prevent the allocation of huge pages if EF_SHARE_WITH identifies a UID of another process or is set to -1. If a process P1 creates an Onload stack, but is not using huge pages and another process P2 attempts to share the Onload stack by setting EF_NAME, the stack options set by P1 will apply, allocation of huge pages in P2 will be prevented. An alternative method of implementing stack sharing is to use the Onload Extensions API and the onload_set_stackname() function which, through its scope parameter, can limit stack access to the processes created by a particular user. Refer to Appendix D: Onload Extensions API on page 112 for details. Issue 15 © Solarflare Communications 2013 55 Onload User Guide 6.19 Multicast Replication The Solarflare SFN7000 series adapters support multicast replication where received packets are replicated in hardware and delivered to multiple receive queues. This feature allows any number of Onload clients, listening to the same multicast data stream, to receive their own copy of the packets, without an additional software copy and without the need to share Onload stacks. The packets are delivered multiple times by the controller to each receive queue that has installed a hardware filter to receive the specified multicast stream. Multicast replication is performed in the adapter transparently and does not need to be explicitly enabled. This feature removes the need to share Onload stacks using the EF_NAME environment variable. Users using EF_NAME exclusively for sharing multicast traffic can now remove EF_NAME from the configurations. 6.20 Multicast Operation and Stack Sharing To illustrate shared stacks, the following examples describe Onload behaviour when two processes, on the same host, subscribe to the same multicast stream: • Multicast Receive Using Different Onload Stacks...Page 57 • Multicast Transmit Using Different Onload Stacks...Page 57 • Multicast Receive Sharing an Onload Stack...Page 58 • Multicast Transmit Sharing an Onload Stack...Page 58 • Multicast Receive - Onload Stack and Kernel Stack...Page 58. NOTE: The following subsections use two processes to demonstrate Onload behaviour. In practice multiple processes can share the same Onload stack. Stack sharing is not limited to multicast subscribers and can be employed by any TCP and UDP applications. Issue 15 © Solarflare Communications 2013 56 Onload User Guide Multicast Receive Using Different Onload Stacks Onload will notice if two Onload stacks on the same host subscribe to the same multicast stream and will respond by redirecting the stream to go through the kernel. Handing the stream to the kernel, though still using Onload stacks, allows both subscribers to receive the datagrams, but user-space acceleration is lost and the receive rate is lower that it could otherwise be. Figure 5 below illustrates the configuration. Arrows indicate the receive path and fragmented UDP path. Figure 5: Multicast Receive Using Different Onload Stacks. The reason for this behaviour is because the Solarflare NIC will not deliver a single received multicast packet multiple times to multiple stacks – the packet is delivered only once. If a received packet is delivered to kernel-space, then the kernel TCP/IP stack will copy the received data multiple times to each socket listening on the corresponding multicast stream. If the received packet were delivered directly to Onload, where the stacks are mapped to user-space, it would only be delivered to a single subscriber of the multicast stream. Multicast Transmit Using Different Onload Stacks Referring to Figure 5, if one process were to transmit multicast datagrams, these would not be received by the second process. Onload is only able to accelerate transmitted multicast datagrams when they do not need to be delivered to other applications in the same host. Or more accurately, the multicast stream can only be delivered within the same Onload stack. Onload by default changes the default state of the IP_MULTICAST_LOOP socket option to 0 rather than 1. This change allows Onload to accelerate multicast transmit for most applications, but means that multicast traffic is not delivered to other applications on the same host unless the subscriber sockets are in the same stack. The normal behaviour can be restored by setting Issue 15 © Solarflare Communications 2013 57 Onload User Guide EF_FORCE_SEND_MULTICAST=0, but this limits multicast acceleration on transmit to sockets that have manually set the IP_MULTICAST_LOOP socket option to zero. Multicast Receive Sharing an Onload Stack Setting the EF_NAME environment variable to the same string in both processes means they can share an Onload stack. The stream is no longer redirected through the kernel resulting in a much higher receive rate than can be observed with the kernel TCP/IP stack (or with separate Onload stacks where the data path is via the kernel TCP/IP stack). This configuration is illustrated in Figure 6 below. Lighter arrows indicate the accelerated (kernel bypass) path. Darker arrows indicate the fragmented UDP path. Figure 6: Sharing an Onload Stack Multicast Transmit Sharing an Onload Stack Referring to Figure 6, datagrams transmitted by one process would be received by the second process because both processes share the Onload stack. Multicast Receive - Onload Stack and Kernel Stack If a multicast stream is being accelerated by Onload, and another application that is not using Onload subscribes to the same stream, then the second application will not receive the associated datagrams. Therefore if multiple applications subscribe to a particular multicast stream, either all or none should be run with Onload. Issue 15 © Solarflare Communications 2013 58 Onload User Guide To enable multiple applications accelerated with Onload to subscribe to the same multicast stream, the applications must share the same Onload stack. Stack sharing is achieved by using the EF_NAME environment variable. Multicast Receive and Multiple Sockets When multiple sockets join the same multicast group, received packets are delivered to these sockets in the order that they joined the group. When multiple sockets are created by different threads and all threads are spinning on recv(), the thread which is able to receive first will also deliver the packets to the other sockets. If a thread ’A’ is spinning on poll(), and another thread ’B’, listening to the same group, calls recv() but does not spin, ’A’ will notice a received packet first and deliver the packet to ’B’ without an interrupt occurring. 6.21 Multicast Loopback The socket option IP_MULTICAST_LOOP controls whether multicast traffic sent on a socket can be received locally on the machine. With Onload, the default value of the IP_MULTICAST_LOOP socket option is 0 (the kernel stack defaults IP_MULTICAST_LOOP to 1). Therefore by default with Onload multicast traffic sent on a socket will not be received locally. As well as setting IP_MULTICAST_LOOP to 1, receiving multicast traffic locally requires both the sender and receiver to be using the same Onload stack. Therefore, when a receiver is in the same application as the sender it will receive multicast traffic. If sender and receiver are in different applications then both must be running Onload and must be configured to share the same Onload stack. For two processes to share an Onload stack both must set the same value for the EF_NAME parameter. If one local process is to receive the data sent by a sending local process, EF_MULTICAST_LOOP_OFF must be disabled (set to zero) on the sender. 6.22 Bonding, Link aggregation and Failover Bonding (aka teaming) allows for improved reliability and increased bandwidth by combining physical ports from one or more Solarflare adapters into a bond. A bond has a single IP address, single MAC address and functions as a single port or single adapter to provide redundancy. Onload monitors the OS configuration of the standard kernel bonding module and accelerates traffic over bonds that are detected as suitable (see limitations). As a result no special configuration is required to accelerate traffic over bonded interfaces. e.g. To configure an 802.3ad bond of two SFC interfaces (eth2 and eth3): modprobe bonding miimon=100 mode=4 xmit_hash_policy=layer3+4 ifconfig bond0 up Interfaces must be down before adding to the bond. echo +eth2 > /sys/class/net/bond0/bonding/slaves Issue 15 © Solarflare Communications 2013 59 Onload User Guide echo +eth3 > /sys/class/net/bond0/bonding/slaves ifconfig bond0 192.168.1.1/24 The file /var/log/messages should then contain a line similar to: [onload] Accelerating bond0 using Onload Traffic over this interface will then be accelerated by Onload. To disable Onload acceleration of bonds set CI_CFG_TEAMING=0 in the file transport_config_opt.h at compile time. Refer to the Limitations section, Bonding, Link aggregation on page 75 for further information. 6.23 VLANS The division of a physical network into multiple broadcast domains or VLANs offers improved scalability, security and network management. Onload will accelerate traffic over suitable VLAN interfaces by default with no additional configuration required. e.g. to add an interface for VLAN 5 over an SFC interface (eth2) onload_loaddrivers modprobe 8021q vconfig add eth2 5 ifconfig eth2.5 192.168.1.1/24 Traffic over this interface will then be transparently accelerated by Onload. Refer to the Limitations section, VLANs on page 75 for further information. 6.24 Accelerated pipe() Onload supports the acceleration of pipes, providing an accelerated IPC mechanism through which two processes on the same host can communicate using shared memory at user-level. Accelerated pipes do not invoke system calls. Accelerated pipes therefore, reduce the overheads for read/write operations and offer improved latency over the kernel implementation. To create a user-level pipe, and before the pipe() or pipe2() function is called, a process must be accelerated by Onload and must have created an Onload stack. By default, an accelerated process that has not created an Onload stack is granted only a non-accelerated pipe. See EF_PIPE for other options. The accelerated pipe is created from the pool of available 2Kbyte socket buffers and expanded as size requires to a maximum size of 64Kbytes. The following function calls, related to pipes, will be accelerated by Onload and will not enter the kernel unless they block; pipe(), read(), write(), readv(), writev(), send(), recv(), recvmsg(), sendmsg(), poll(), select(), epoll_ctl(), epoll_wait() Issue 15 © Solarflare Communications 2013 60 Onload User Guide As with TCP/UDP sockets, the Onload tuning options such as EF_POLL_USEC and EF_SPIN_USEC will also influence performance of the user-level pipe. Refer also to EF_PIPE, EF_PIPE_RECV_SPIN, EF_PIPE_SEND_SPIN in Appendix A: Parameter Reference on page 93. NOTE: Only anonymous pipes created with the pipe() or pipe2() function calls will be accelerated. 6.25 Zero-Copy API The Onload Extensions API includes support for zero-copy of TCP transmit packets and UDP receive packets. Refer to Zero-Copy API on page 123 for detailed descriptions and example source code of the API. 6.26 Receive Filtering The Onload Extensions API supports a Receive Filtering API which allows user-defined filters to determine if data received on a UDP socket should be discarded before it enters the socket receive buffer. Refer to the Receive Filtering API Overview on page 134 for a detailed description and example source code of the API. 6.27 Packet Buffers Introduction Packet buffers describe the memory used by the Onload stack (and Solarflare adapter) to receive, transmit and queue network data. Packet buffers provide a method for user-mode accessible memory to be directly accessed by the network adapter without compromising system integrity. Onload will request huge pages if these are available when allocating memory for packet buffers. Using huge pages can lead to improved performance for some applications by reducing the number of Translation Lookaside Buffer (TLB) entries needed to describe packet buffers and therefore minimize TLB ’thrashing’. NOTE: Onload huge page support should not be enabled if the application uses IPC namespaces and the CLONE_NEWIPC flag. Onload offers two configuration modes for network packet buffers: Network Adapter Buffer Table Mode Solarflare network adapters employ a proprietary hardware-based buffer address translation mechanism to provide memory protection and translation to Onload stacks accessing a VNIC on the adapter. This is the default packet buffer mode and is suitable for the majority of applications using Onload. Issue 15 © Solarflare Communications 2013 61 Onload User Guide This scheme employs a buffer table residing on the network adapter to control the memory an Onload stack can use to send and receive packets. While the adapter’s buffer table is sufficient for the majority of applications, on adapters prior to the SFN7000 series, it is limited to approximately 120,000 x 2Kbyte buffers which have to be shared between all Onload stacks. If the total packet buffer requirements of all applications using Onload require more than the number of packet buffers supported by the adapter’s buffer table, the user should consider changing to the Scalable Packet Buffers configuration. Large Buffer Table Support The Solarflare SFN7000 series adapters alleviate the packet buffer limitations of previous generation Solarflare adapters and support many more than the 120,000 packet buffer without the need to switch to Scalable Packet Buffer Mode. Each buffer table entry in the SFN7000 series adapter can describe a 4kbyte, 64kbyte, 1Mbyte or 4Mbyte block of memory where each table entry is the page size as directed by the operating system. Scalable Packet Buffer Mode Scalable Packet Buffer Mode is an alternative packet buffer mode which allows a much higher number of packet buffers to be used by Onload. Using the Scalable Packet Buffer Mode Onload stacks employ Single Root I/O Virtualization (SR-IOV) virtual functions (VF) to provide memory protection and translation. This mechanism removes the 120K buffers limitation imposed by the Network Adapter Buffer Table Mode. For deployments where using SR-IOV and/or the IOMMU is not an option, Onload also supports an alternative Scalable Packet Buffer Mode scheme called Physical Addressing Mode. Physical addressing also removes the 120K packet buffer limitation, however physical addressing does not provide the memory protection provided by SR-IOV and an IOMMU. For details of Physical Addressing Mode see Physical Addressing Mode on page 70. NOTE: Enabling SR-IOV, which is needed for Scalable Packet Buffer Mode, has a latency impact which depends on the adapter model. For the SFN5000 adapter series, latency increases by approximately 50ns for the 1/2 RTT latency. The SFN6000 adapter series has equivalent latency to the SFN5000 adapter series when operating in this mode. NOTE: SR-IOV and therefore Scalable Packet Buffer Mode is not supported on the Solarflare SFN7122F network adapter at this time, but will be made available in a future revision. NOTE: SR-IOV and therefore Scalable Packet Buffer Mode is not supported on the Solarflare SFN4112F network adapter. NOTE: MRG users should refer to Red Hat MRG 2 and SR-IOV on page 83. For further details on SR-IOV configuration refer to Configuring Scalable Packet Buffers on page 66. Issue 15 © Solarflare Communications 2013 62 Onload User Guide How Packet Buffers Are Used by Onload Each packet buffer is allocated to exactly one Onload stack and is used to receive, transmit or queue network data. Packet buffers are used by Onload in the following ways: 1 Receive descriptor rings. By default the RX descriptor ring will hold 512 packet buffers at all times. This value is configurable using the EF_RXQ_SIZE (per stack) variable. 2 Transmit descriptor rings. By default the TX descriptor ring will hold up to 512 packet buffers. This value is configurable using the EF_TXQ_SIZE (per stack) variable. 3 To queue data held in receive and transmit socket buffers. 4 TCP sockets can also hold packet buffers in the socket’s retransmit queue and in the reorder queue. NOTE: User-level pipes do not consume packet buffer resources. Identifying Packet Buffer Requirements When deciding the number of packet buffers required by an Onload stack consideration should be given to the resource needs of the stack to ensure that the available packet buffers can be shared efficiently between all Onload stacks. Example 1: If we consider a hypothetical case of a single host: - which employs multiple Onload stacks e.g 10 - each stack has multiple sockets e.g 6 - and each socket uses many packet buffers e.g 2000 This would require a total of 120000 packet buffers Example 2: If on a stack the TCP receive queue is 1 Mbyte and the MSS value is 1472 bytes, this would require at least 700 packet buffers - (and a greater number if segments smaller that the MSS were received). Example 3: A UDP receive queue of 200 Kbytes where received datagrams are each 200 bytes would hold 1000 packet buffers. The examples above use only approximate calculated values. The onload_stackdump command provides accurate measurements of packet buffer allocation and usage. Consideration should be given to packet buffer allocation to ensure that each stack is allocated the buffers it will require rather than a ’one size fits all’ approach. When using the Buffer Table Mode the system is limited to 120K packet buffers - these are allocated symmetrically across all Solarflare interfaces. Issue 15 © Solarflare Communications 2013 63 Onload User Guide NOTE: Packet buffers are accessible to all network interfaces and each packet buffer requires an entry in every network adapters’ buffer table. Adding more network adapters - and therefore more interfaces does not increase the number of packet buffers available. For large scale applications the Scalable Packet Buffer Mode removes the limitations imposed by the network adapter buffer table. See Configuring Scalable Packet Buffers on page 66 for details. Running Out of Packet Buffers When Onload detects that a stack is close to allocating all available packet buffers it will take action to try and avoid packet buffer exhaustion. Onload will automatically start dropping packets on receive and, where possible, will reduce the receive descriptor ring fill level in an attempt to alleviate the situation. A ’memory pressure’ condition can be identified using the onload_stackdump lots command where the pkt_bufs field will display the CRITICAL indicator. See Identifying Memory Pressure below. Complete packet buffer exhaustion can result in deadlock. In an Onload stack, if all available packet buffers are allocated (for example currently queued in socket buffers) the stack is prevented from transmitting further data as there are no packet buffers available for the task. If all available packet buffers are allocated then Onload will also fail to keep its adapters receive queues replenished. If the queues fall empty further data received by the adapter is instantly dropped. On a TCP connection packet buffers are used to hold unacknowledged data in the retransmit queue, and dropping received packets containing ACKs delays the freeing of these packet buffers back to Onload. Setting the value of EF_MIN_FREE_PACKETS=0 can result in a stack having no free packet buffers and this, in turn, can prevent the stack from shutting down cleanly. Identifying Memory Pressure The following extracts from the onload_stackdump command identify an Onload stack under memory pressure. The EF_MAX_PACKETS value identifies the maximum number of packet buffers that can be used by the stack. EF_MAX_RX_PACKETS is the maximum number of packet buffers that can be used to hold packets received. EF_MAX_TX_PACKETS is the maximum number of packet buffers that can be used to hold packets to send. These two values are always less that EF_MAX_PACKETS to ensure that neither the transmit or receive paths can starve the other of packet buffers. Refer to Appendix A: Parameter Reference on page 93 for detailed descriptions of these per stack variables. The example Onload stack has the following default environment variable values: EF_MAX_PACKETS: 32768 EF_MAX_RX_PACKETS: 24576 EF_MAX_TX_PACKETS: 24576 The onload_stackdump lots command identifies packet buffer allocation and the onset of a memory pressure state: pkt_bufs: size=2048 max=32768 alloc=24576 free=32 async=0 CRITICAL pkt_bufs: rx=24544 rx_ring=9 rx_queued=24535 Issue 15 © Solarflare Communications 2013 64 Onload User Guide There are potentially 32768 packet buffers available and the stack has allocated (used) 24576 packet buffers. In the socket receive buffers there are 24544 packets buffers waiting to be processed by the application - this is approaching the EF_MAX_RX_PACKETS limit and is the reason the CRITICAL flag is present i.e. the Onload stack is under memory pressure. Only 9 packet buffers are available to the receive descriptor ring. Onload will aim to keep the RX descriptor ring full at all times. If there are not enough available packet buffers to refill the RX descriptor ring this is indicated by the LOW memory pressure flag. The onload_stackdump lots command will also identify the number of memory pressure events and number of packets dropped as a result of memory pressure. memory_pressure: 1 memory_pressure_drops: 22096 Controlling Onload Packet Buffer Use A number of environment variables control the packet buffer allocation on a per stack basis. Refer to Appendix A: Parameter Reference on page 93 for a description of EF_MAX_PACKETS. Unless explicitly configured by the user, EF_MAX_RX_PACKETS and EF_MAX_TX_PACKETS will be automatically set to 75% of the EF_MAX_PACKETS value. This ensures that sufficient buffers are available to both receive and transmit. The EF_MAX_RX_PACKETS and EF_MAX_TX_PACKETS are not typically configured by the user. If an application requires more packet buffers than the maximum configured, then EF_MAX_PACKETS may be increased to meet demand, however it should be recognized that larger packet buffer queues increase cache footprint which can lead to reduced throughput and increased latency. EF_MAX_PACKETS is the maximum number of packet buffers that could be used by the stack. Setting EF_MAX_RX_PACKETS to a value greater than EF_MAX_PACKETS effectively means that all packet buffers (EF_MAX_PACKETS) allocated to the stack will be used for RX - with nothing left for TX. The safest method is to only increase EF_MAX_PACKETS which keeps the RX and TX packet buffers values at 75% of this value. Issue 15 © Solarflare Communications 2013 65 Onload User Guide Configuring Scalable Packet Buffers NOTE: SR-IOV and therefore Scalable Packet Buffer Mode is not currently supported on the SFN7000 series adapter but will be available in a future release. Using the Scalable Packet Buffer Mode Onload stacks are bound to virtual functions (VFs) and provide a PCI SR-IOV compliant means to provide memory protection and translation. VFs employ the kernel IOMMU. Refer to Chapter 7 and Scalable Packet Buffer Mode on page 83 for 32-bit kernel limitations. Procedure: Step 1. Platform Support on page 66 Step 2. BIOS and Linux Kernel Configuration on page 67 Step 3. Update adapter firmware and enable SR-IOV on page 67 Step 4. Enable VFs for Onload on page 68 Step 5. Check PCIe VF Configuration on page 68 Step 6. Check VFs in onload_stackdump on page 69 Step 1. Platform Support Scalable Packet Buffer Mode is implemented using SR-IOV, support for which is a relatively recent addition to the Linux kernel. There were several kernel bugs in early incarnations of SR-IOV support, up to and including kernel.org 2.6.34. The fixes have been back-ported to recent Red Hat kernels. Users are advised to enable scalable packet buffer mode on Red Hat kernel 2.6.32-131.0.15 or later, or kernel.org 2.6.35 or later. In other distributions, it is recommended that the most recent patched kernel version is used • The system hardware must have an IOMMU and this must be enabled in the BIOS. • The kernel must be compiled with support for IOMMU and kernel command line options are required to select the IOMMU mode. • The kernel must be compiled with support for SR-IOV APIs (CONFIG-PCI-IOV). • SR-IOV must be enabled on the network adapter using the sfboot utility. • When more than 6 VFs are needed, the system hardware and kernel must support PCIe Alternative Requester ID (ARI) - a PCIe Gen 2 feature. • Onload options EF_PACKET_BUFFER_MODE=1 must be set in the environment. NOTE: The Scalable Packet Buffer feature can be susceptible to known kernel issues observed on RHEL6 and SLES 11. (See http://www.spinics.net/lists/linux-pci/msg10480.html for details. The condition can result in an unresponsive server if intel_iommu has been enabled in the grub.conf file, as per the procedure at Step 2. BIOS and Linux Kernel Configuration on page 67, and if the Solarflare sfc_resource driver is reloaded. This issue has been addressed in newer kernels Issue 15 © Solarflare Communications 2013 66 Onload User Guide Step 2. BIOS and Linux Kernel Configuration To use SR-IOV, hardware virtualization must be enabled. Refer to RedHat Enabling Intel VT-x and AMD-V Virtualization in BIOS for more information. Take care to enable VT-d as well as VT on an Intel platform. To verify that the extensions have been correctly enabled refer to RedHat Verifying virtualization extensions.For best kernel configuration performance and to avoid kernel bugs exhibited when IOMMU is enabled for all devices, Solarflare recommend the kernel is configured to use the IOMMU in pass-through mode - append the following lines to kernel line in the /boot/grub/grub.conf file: On an Intel system: intel_iommu=on iommu=on,pt On an AMD system: amd_iommu=on, iommu=on,pt In pass-through mode the IOMMU is bypassed for regular devices. Refer to Red Hat: PCI passthrough for more information. NOTE: On Linux Red Hat 5 servers (2.6.18) it is necessary to also use the iommu_type=2 option. NOTE: EnterpriseOnload v2.1.0.0 users and OpenOnload v201109-u2 (onwards) users: Recent kernels are compiled with support for IOMMUs by default, but unfortunately the realtime (-rt) kernel patches are not currently compatible with IOMMUs (Red Hat MRG kernels are compiled with CONFIG_PCI_IOV disabled). It is possible to use scalable packet buffer mode on some systems without IOMMU support, but in an insecure mode. In this configuration the IOMMU is bypassed, and there is no checking of DMA addresses provided by Onload in user-space. Bugs or mis-behaviour of user-space code can compromise the system. To enable this insecure mode, set the Onload module option unsafe_sriov_without_iommu=1 for the sfc_resource kernel module. Linux MRG users are urged to use MRGu2 and kernel 3.2.33-rt50.66.el6rt.x86_64 or later to avoid known issues and limitations of earlier versions. The unsafe_sriov_without_iommu option is obsoleted in OpenOnload 201210. It is replaced by physical addressing mode - see Physical Addressing Mode on page 70 for details. Step 3. Update adapter firmware and enable SR-IOV 1 Download and install the Solarflare Linux Utilities RPM from support.solarflare.com and unzip the utilities file to reveal the RPM: 2 Install the RPM: # rpm -Uvh sfutils- .rpm 3 Issue 15 Identify the current firmware version on the adapter: © Solarflare Communications 2013 67 Onload User Guide # sfupdate 4 Upgrade the adapter firmware with sfupdate: # sfupdate --write Full instructions on using sfupdate can be found in the Solarflare Network Server Adapter User Guide. 5 Use sfboot to enable SR-IOV and enable the VFs. You can enable up to 127 VFs per port, but the host BIOS may only be able to support a smaller number. The following example will configure 16 VFs on each Solarflare port: # sfboot sriov=enabled vf-count=16 vf-msix-limit=1 6 Option Default Value Description sriov= Disabled Enable/Disable hardware SRIOV support vf-count= 127 Number of virtual functions advertised per port. See the note below. vf-msix-limit= 1 Number of MSI-X interrupts per VF It is necessary to reboot the server following changes using sfboot and sfupdate. NOTE: Enabling all 127 VFs per port with more than one MSI-X interrupt per VF may not be supported by the host BIOS. If the BIOS doesn't support this then you may get 127 VFs on one port and no VFs on the other port. You should contact your BIOS vendor for an upgrade or reduce the VF count. NOTE: On Red Hat 5 servers the vf-count should not exceed 32. NOTE: VF allocation must be symmetric across all Solarflare interfaces. Step 4. Enable VFs for Onload #export EF_PACKET_BUFFER_MODE=1 Refer to Appendix A: Parameter Reference on page 93 for other values. Step 5. Check PCIe VF Configuration The network adapter sfc driver will initialize the VFs, which can be displayed by the lspci command: # lspci -d 1924: 05:00.0 Ethernet controller: Solarflare Communications SFC9020 [Solarflare] 05:00.1 Ethernet controller: Solarflare Communications SFC9020 [Solarflare] 05:00.2 Ethernet controller: Solarflare Communications SFC9020 Virtual Function [Solarflare] Issue 15 © Solarflare Communications 2013 68 Onload User Guide 05:00.3 Ethernet controller: Function [Solarflare] 05:00.4 Ethernet controller: Function [Solarflare] 05:00.5 Ethernet controller: Function [Solarflare] 05:00.6 Ethernet controller: Function [Solarflare] 05:00.7 Ethernet controller: Function [Solarflare] 05:01.0 Ethernet controller: Function [Solarflare] 05:01.1 Ethernet controller: Function [Solarflare] Solarflare Communications SFC9020 Virtual Solarflare Communications SFC9020 Virtual Solarflare Communications SFC9020 Virtual Solarflare Communications SFC9020 Virtual Solarflare Communications SFC9020 Virtual Solarflare Communications SFC9020 Virtual Solarflare Communications SFC9020 Virtual The lspci example output above identifies one physical function per physical port and the virtual functions (four for each port) of a single Solarflare dual-port network adapter. Step 6. Check VFs in onload_stackdump The onload_stackdump netif command will identify VFs being used by Onload stacks as in the following example: # onload_stackdump netif ci_netif_dump: stack=0 name= ver=201109 uid=0 pid=3354 lock=10000000 UNLOCKED nics=3 primed=3 sock_bufs: max=1024 n_allocated=4 pkt_bufs: size=2048 max=32768 alloc=1152 free=128 async=0 pkt_bufs: rx=1024 rx_ring=1024 rx_queued=0 pkt_bufs: tx=0 tx_ring=0 tx_oflow=0 tx_other=0 time: netif=3df7d2 poll=3df7d2 now=3df7d2 (diff=0.000sec) ci_netif_dump_vi: stack=0 intf=0 vi=67 dev=0000:05:01.0 hw=0C0 evq: cap=2048 current=8 is_32_evs=0 is_ev=0 rxq: cap=511 lim=511 spc=15 level=496 total_desc=0 txq: cap=511 lim=511 spc=511 level=0 pkts=0 oflow_pkts=0 txq: tot_pkts=0 bytes=0 ci_netif_dump_vi: stack=0 intf=1 vi=67 dev=0000:05:01.1 hw=0C0 evq: cap=2048 current=8 is_32_evs=0 is_ev=0 rxq: cap=511 lim=511 spc=15 level=496 total_desc=0 txq: cap=511 lim=511 spc=511 level=0 pkts=0 oflow_pkts=0 txq: tot_pkts=0 bytes=0 The output above corresponds to VFs advertised on the Solarflare network adapter interface identified using the lspci command - Refer to Step 5 above. Issue 15 © Solarflare Communications 2013 69 Onload User Guide Physical Addressing Mode Physical addressing mode is a Scalable Packet Buffer Mode that also allows Onload stacks to use large amounts of packet buffer memory (avoiding the limitations of the address translation table on the adapter), but without the requirement to configure and use SR-IOV virtual functions. Physical addressing mode, does however, remove memory protection from the network adapter’s access of packet buffers. Unprivileged user-level code is provided and directly handles the raw physical memory addresses of packets buffers. User-level code provides physical memory addresses directly to the adapter and therefore has the ability to direct the adapter to read or write arbitrary memory locations. A result of this is that a malicious or buggy application can compromise system integrity and security. OpenOnload versions earlier than onload-201210 and EnterpriseOnload2.1.0.0 are limited to 1 million packet buffers. This limit was raised to 2 million packets buffers in 201210-u1 and EnterpriseOnload-2.1.0.1. To enable physical addressing mode: 1 Ignore configuration steps 1-4 above. 2 Put the following option into a user-created .conf file in the /etc/modprobe.d directory: options onload phys_mode_gid= Where setting to be -1 allows all users to use physical addressing mode and setting to an integer x restricts use of physical addressing mode to the specific user group x. 3 Reload the Onload drivers onload_tool reload 4 Enable the Onload environment using EF_PACKET_BUFFER_MODE 2 or 3. EF_PACKET_BUFFER_MODE=2 is equivalent to mode 0, but uses physical addresses. Mode 3 uses SRIOV VFs with physical addresses, but does not use the IOMMU for memory translation and protection. Refer to Appendix A: Parameter Reference on page 93 for a complete description of all EF_PACKET_BUFFER_MODE options. 6.28 Programmed I/O PIO (programmed input/output) describes the process whereby data is directly transferred by the CPU to or from an I/O device. It is an alternative to bus master DMA techniques where data are transferred without CPU involvement. Solarflare 7000 series adapters support TX PIO, where packets on the transmit path can be “pushed” to the adapter directly by the CPU. This improves the latency of transmitted packets but can cause a very small increase in CPU utilisation. TX PIO is therefore especially useful for smaller packets. The Onload TX PIO feature is enabled by default but can be disabled via the environment variable EF_PIO. An additional environment variable, EF_PIO_THRESHOLD specifies the size of the largest packet size that can use TX PIO. Issue 15 © Solarflare Communications 2013 70 Onload User Guide PIO buffers on the adapter are limited to a maximum of 8 Onload stacks. For optimum performance, PIO buffers should be reserved for critical processes and other processes should set EF_PIO to 0 (zero). The Onload stackdump utility provides additional counters to indicate the level of PIO use - see TX PIO Counters on page 138 for details. The Solarflare net driver will also use PIO buffers for non-accelerated sockets and this will reduce the number of PIO buffers available to Onload stacks. To prevent this set the driver module option piobuf_size=0. When both accelerated and non-accelerated sockets are using PIO, the number of PIO buffers available to Onload stacks can be calculated from the total 16 available PIO regions: description example value piobuf_size driver module parameter 256 rss_cpus driver module parameter 4 region a chunk of memory 2048 bytes 2048 bytes Using the above example values, each port on the adapter requires: piobuf_size * rss_cpus / region size = 0.5 regions - (round up - so each port needs 1 region). This leaves 16-2 = 14 regions for Onload stacks which also require one region per port, per stack. Therefore from our example we can have 7 onload stacks using PIO buffers. 6.29 Templated Sends “Templated sends” is another SFN7000 series adapter feature that builds on top of TX PIO to provide further transmit latency improvements. This can be used in applications that know the majority of the content of packets in advance of when the packet is to be sent. For example, a market feed handler may publish packets that vary only in the specific value of certain fields, possibly different symbols and price information, but are otherwise identical. Templated sends involve creating a template of a packet on the adapter containing the bulk of the data prior to the time of sending the packet. Then, when the packet is to be sent, the remaining data is pushed to the adapter to complete and send the packet. The Onload templated sends feature uses the Onload Extensions API to generate the packet template which is then instantiated on the adapter ready to receive the “missing” data before each transmission. The API details are available in the Onload 201310 distribution at /src/include/onload/ extensions_zc.h Refer to Appendix D: Onload Extensions API for further information on the use of packet templates including code examples of using this feature. Issue 15 © Solarflare Communications 2013 71 Onload User Guide 6.30 Debug and Logging Onload support various debug and logging options Logging and debug information will be displayed on an attached console or will be sent to the syslog. To force all debug to the syslog set the onload environment variable EF_LOG_VIA_IOCTL=1. Log Levels: • EF_UNIX_LOG - refer to Appendix A: Parameter Reference on page 93 for details. • TP_LOG (bitmask) - useful for stack debugging. See Onload source code /src/include/ci/ internal/ip_log.h for bit values. • Onload module options: - oo_debug_bits=[bitmask] - useful for kernel logging and events not involving an onload stack. See src/include/onload/debug.h for bit values. - ci_tp_log=[bitmask] - useful for kernel logging and events involving an onload stack. See Onload source code /src/include/ci/internal/ip_log.h for details. Issue 15 © Solarflare Communications 2013 72 Onload User Guide Chapter 7: Limitations Users are advised to read the latest release_notes distributed with the Onload release for a comprehensive list of Known Issues. 7.1 Introduction This chapter outlines configurations that Onload does not accelerate and ways in which Onload may change behaviour of the system and applications. It is a key goal of Onload to be fully compatible with the behaviour of the regular kernel stack, but there are some cases where behaviour deviates. 7.2 Changes to Behavior Multithreaded Applications Termination As Onload handles networking in the context of the calling application's thread it is recommended that applications ensure all threads exit cleanly when the process terminates. In particular the exit() function causes all threads to exit immediately - even those in critical sections. This can cause threads currently within the Onload stack holding the per stack lock to terminate without releasing this shared lock - this is particularly important for shared stacks where a process sharing the stack could ’hang’ when Onload locks are not released. An unclean exit can prevent the Onload kernel components from cleanly closing the application's TCP connections, a message similar to the following will be observed: [onload] Stack [0] released with lock stuck and any pending TCP connections will be reset. To prevent this, applications should always ensure that all threads exit cleanly. Packet Capture Packets delivered to an application via the accelerated path are not visible to the OS kernel. As a result, diagnostic tools such as tcpdump and wireshark do not capture accelerated packets. The Solarflare supplied onload_tcpdump does support capture of UDP and TCP packets from Onload stacks - Refer to Appendix G: onload_tcpdump on page 164 for details. Firewalls Packets delivered to an application via the accelerated path are not visible to the OS kernel. As a result, these packets are not visible to the kernel firewall (iptables) and therefore firewall rules will not be applied to accelerated traffic. The onload_iptables feature can be used to enforce Linux iptables rules as hardware filters on the Solarflare adapter, refer to Appendix I: onload_iptables on Issue 15 © Solarflare Communications 2013 73 Onload User Guide page 174. NOTE: Hardware filtering on the network adapter will ensure that accelerated applications receive traffic only on ports to which they are bound. System Tools With the exception of ’listening’ sockets, TCP sockets accelerated by Onload are not visible to the netstat tool. UDP sockets are visible to netstat. Accelerated sockets appear in the /proc directory as symbolic links to /dev/onload. Tools that rely on /proc will probably not identify the associated file descriptors as being sockets. Refer to Onload and File Descriptors, Stacks and Sockets on page 36 for more details. Accelerated sockets can be inspected in detail with the Onload onload_stackdump tool, which exposes considerably more information than the regular system tools. For details of onload_stackdump refer to Appendix E: onload_stackdump on page 137. Signals If an application receives a SIGSTOP signal, it is possible for the processing of network events to be stalled in an Onload stack used by the application. This happens if the application is holding a lock inside the stack when the application is stopped, and if the application remains stopped for a long time, this may cause TCP connections to time-out. A signal which terminates an application can prevent threads from exiting cleanly. Refer to Multithreaded Applications Termination on page 73 for more information. Undefined content may result when a signal handler uses the third argument (ucontext) and if the signal is postponed by Onload. To avoid this, use the Onload module option safe_signals_and_exit=0 or use EF_SIGNALS_NOPOSTPONE to prevent specific signals being postponed by Onload. 7.3 Limits to Acceleration IP Fragmentation Fragmented IP traffic is not accelerated by Onload on the receive side, and is instead received transparently via the kernel stack. IP fragmentation is rarely seen with TCP, because the TCP/IP stacks segment messages into MTU-sized IP datagrams. With UDP, datagrams are fragmented by IP if they are too large for the configured MTU. Refer to Fragmented UDP on page 52 for a description of Onload behaviour. Broadcast Traffic Broadcast sends and receives function as normal but will not be accelerated. Multicast traffic can be accelerated. Issue 15 © Solarflare Communications 2013 74 Onload User Guide IPv6 Traffic IPv6 traffic functions as normal but will not be accelerated. Raw Sockets Raw Socket sends and receives function as normal but will not be accelerated. Statically Linked Applications Onload will not accelerate statically linked applications. This is due to the method in which Onload intercepts libc function calls (using LD_PRELOAD). Local Port Address Onload is limited to OOF_LOCAL_ADDR_MAX number of local interface addresses. A local address can identify a physical port or a VLAN, and multiple addresses can be assigned to a single interface where each address contributes to the maximum value. Users can allocate additional local interface addresses by increasing the compile time constant OOF_LOCAL_ADDR_MAX in the /src/lib/ efthrm/oof_impl.h file and rebuilding Onload. In onload-201205 OOF_LOCAL_ADDR_MAX was replaced by the onload module option max_layer2_interfaces. Bonding, Link aggregation • Onload will only accelerate traffic over 802.3ad and active-backup bonds. • Onload will not accelerate traffic if a bond contains any slave interfaces that are not Solarflare network devices. Adding a non-Solarflare network device to a bond that is currently accelerated by Onload may result in unexpected results such as connections being reset. • Acceleration of bonded interfaces in Onload requires a kernel configured with sysfs support and a bonding module version of 3.0.0 or later. In cases where Onload will not accelerate the traffic it will continue to work via the OS network stack. For more information and details of configuration options refer to the Solarflare Server Adapter User Guide section ’Setting Up Teams’. VLANs • Onload will only accelerate traffic over VLANs where the master device is either a Solarflare network device, or over a bonded interface that is accelerated. i.e. If the VLAN's master is accelerated, then so is the VLAN interface itself. • Nested VLAN tags are not accelerated, but will function as normal. • The ifconfig command will return inconsistent statistics on VLAN interfaces (not master interface). Issue 15 © Solarflare Communications 2013 75 Onload User Guide • When a Solarflare interface is part of a bond (team) and also on a VLAN, network traffic will not be accelerated on this interface or any other interface on the same adapter and same VLAN. • Hardware filters installed by Onload on the Solarflare adapter will only consider the IP address and port, but not the VLAN identifier. Therefore if the same IP address:port combination exists on different VLAN interfaces, only the first interface to install the filter will receive the traffic. In cases where Onload will not accelerate the traffic it will continue to work via the OS network stack. For more information and details and configuration options refer to the Solarflare Server Adapter User Guide section ’Setting Up VLANs’. NOTE: The onload_tool reload command will unload then reload the adapter driver removing all physical devices and associated VLAN devices. TCP RTO During Overload Conditions Under very high load conditions an increased frequency of TCP retransmission timeouts (RTOs) might be observed. This has the potential to occur when a thread servicing the stack is descheduled by the CPU whilst still holding the stack lock thus preventing another thread from accessing/polling the stack. A stack not being serviced means that ACKs are not received in a timely manner for packets sent and results in RTOs for the unacknowledged packets. Enabling the per stack environment variable EF_INT_DRIVEN can reduce the likelihood of this behaviour by ensuring the stack is serviced promptly. TCP with Jumbo Frames When using jumbo frames with TCP, Onload will limit the MSS to 2048 bytes to ensure that segments do not exceed the size of internal packet buffers. This should present no problems unless the remote end of a connection is unable to negotiate this lower MSS value. Transmission Path - Packet Loss Occasionally Onload needs to send a packet, which would normally be accelerated, via the kernel. This occurs when there is no destination address entry in the ARP table or to prevent an ARP table entry from becoming stale. By default, the Linux sysctl, unres_qlen, will enqueue 3 packets per unresolved address when waiting for an ARP reply, and on a server subject to a very high UDP or TCP traffic load this can result in packet loss on the transmit path and packets being discarded. The unres_qlen value can be identified using the following command: sysctl -a | grep unres_qlen net.ipv4.neigh.eth2.unres_qlen = 3 net.ipv4.neigh.eth0.unres_qlen = 3 net.ipv4.neigh.lo.unres_qlen = 3 net.ipv4.neigh.default.unres_qlen = 3 Issue 15 © Solarflare Communications 2013 76 Onload User Guide Changes to the queue lengths can be made permanent in the /etc/sysctl.conf file. Solarflare recommend setting the unres_qlen value to at least 50. If packet discards are suspected, this extremely rare condition can be indicated by the cp_defer counter produced by the onload_stackdump lots command on UDP sockets or from the unresolved_discards counter in the Linux /proc/net/stat arp_cache file. Issue 15 © Solarflare Communications 2013 77 Onload User Guide 7.4 epoll - Known Issues Onload supports different implementations of epoll controlled by the EF_UL_EPOLL environment variable - see Multiplexed I/O on page 52 for configuration details. • When using EF_UL_EPOLL=1, it has been identified that the behavior of epoll_wait() differs from the kernel when the EPOLLONESHOT event is requested, resulting in two ’wakeups’ being observed, one from the kernel and one from Onload. This behavior is apparent on SOCK_DGRAM and SOCK_STREAM sockets for all combinations of EPOLLONESHOT, EPOLLIN and EPOLLOUT events. This applies for TCP listening sockets and UDP sockets, but not for TCP connected sockets. • EF_EPOLL_CTL_FAST is enabled by default and this modifies the semantics of epoll. In particular, it buffers up calls to epoll_ctl() and only applies them when epoll_wait() is called. This can break applications that do epoll_wait() in one thread and epoll_ctl() in another thread. The issue only affects EF_UL_EPOLL=2 and the solution is to set EF_EPOLL_CTL_FAST=0 if this is a problem. The described condition does not occur if EF_UL_EPOLL=1. • When EF_EPOLL_CTL_FAST is enabled and an application is testing the readiness of an epoll file descriptor without actually calling epoll_wait(), for example by doing epoll within epoll or epoll within select(), if one thread is calling select() or epoll_wait() and another thread is doing epoll_ctl(), then EF_EPOLL_CTL_FAST should be disabled. This applies when using EF_UL_EPOLL 1 or 2. If the application is monitoring the state of the epoll file descriptor indirectly, e.g. by monitoring the epoll fd with poll, then EF_EPOLL_CTL_FAST can cause issues and should be set to zero. • A socket should removed from an epoll set only when all references to the socket are closed. With EF_UL_EPOLL=1 (default) a socket is removed from the epoll set if the file descriptor is closed, even if other references to the socket exist. This can cause problems if file descriptors are duplicated using dup(). For example: s = socket(); s2 = dup(s); epoll_ctl(epoll_fd, EPOLL_CTL_ADD, s, ...); close(s); /* socket referenced by s is removed from epoll set when using onload */ Workaround is set EF_UL_EPOLL=2. • When Onload is unable to accelerate a connected socket, e.g. because no route to the destination exists which uses a Solarflare interface, the socket will be handed off to the kernel and is removed from the epoll set. Because the socket is no longer in the epoll set, attempts to modify the socket with epoll_ctl() will fail with the ENOENT (descriptor not present) error. The described condition does not occur if EF_UL_EPOLL=1. • If an epoll file descriptor is passed to the read() or write () functions these will return a different errorcode than that reported by the kernel stack. This issue exists for all implementations of epoll. Issue 15 © Solarflare Communications 2013 78 Onload User Guide • When EPOLLET is used and the event is ready, epoll_wait() is triggered by ANY event on the socket instead of the requested event. This issue should not affect application correctness. The problem exists for both implementations of epoll. • When using the receive filtering API ( see Receive Filtering API Overview on page 134) I/O multiplex functions such as poll() and select() may return that a socket is readable, but a subsequent call to read() can fail because the filter has rejected the packets. Affected system calls are poll(), epoll(), select() and pselect(). Issue 15 © Solarflare Communications 2013 79 Onload User Guide 7.5 Configuration Issues Mixed Adapters Sharing a Broadcast Domain Onload should not be used when Solarflare and non-Solarflare interfaces in the same network server are configured in the same broadcast domain1 as depicted by the following diagram. When an originating server (S1) sends an ARP request to a remote server (S2) having more than one interface within the same broadcast domain, ARP responses from S2 will be generated from all interfaces and it is non-deterministic which response the originator uses. When Onload detects this situation, it prompts a message identifying 'duplicate claim of ip address' to appear in the (S1) host syslog as a warning of potential problems. Problem 1 Traffic from S1 to S2 may be delivered through either of the interfaces on S2, irrespective of the IP address used. This means that if one interface is accelerated by Onload and the other is not, you may or may not get acceleration. To resolve the situation (for the current session) issue the following command: echo 1 >/proc/sys/net/ipv4/conf/all/arp_ignore or to resolve it permanently add the following line to the /etc/sysctl.conf file: net.ipv4.conf.all.arp_ignore = 1 and run the sysctl command for this be effective. sysctl -p These commands ensure that an interface will only respond to an ARP request when the IP address matches its own. Refer to the Linux documentation Linux/Documentation/networking/ipsysctl.txt for further details. Issue 15 © Solarflare Communications 2013 80 Onload User Guide Problem 2 A more serious problem arises if one interface on S2 carries Onload accelerated TCP connections and another interface on the same host and same broadcast domain is non-Solarflare: A TCP packet received on the non-Solarflare interface can result in accelerated TCP connections being reset by the kernel stack and therefore appear to the application as if TCP connections are being dropped/terminated at random. To prevent this situation the Solarflare and non-Solarflare interfaces should not be configured in the same broadcast domain. The solution described for problem 1 above can reduce the frequency of problem 2, but does not eliminate it. TCP packets can be directed to the wrong interface because either (i) the originator S1 needs to refresh its ARP table for the destination IP address - so sends an ARP request and subsequently directs TCP packets to the non-Solarflare interface, or (ii) a switch within the broadcast domain broadcasts the TCP packets to all interfaces. 1.A Broadcast domain can be a local network segment or VLAN. Virtual Memory on 32 Bit Systems On 32 bit Linux systems the amount of allocated virtual address space defaults, typically, to 128Mb which limits the number of Solarflare interfaces that can be configured. Virtual memory allocation can be identified in the /proc/meminfo file e.g. grep Vmalloc /proc/meminfo VmallocTotal: 122880 kB VmallocUsed: 76380 kB VmallocChunk: 15600 kB The Onload driver will attempt to map all PCI Base Address Registers for each Solarflare interface into virtual memory where each interface requires 16Mb. Examination of the kernel logs in /var/log/messages at the point the Onload driver is loading, would reveal a memory allocation failure as in the following extract: allocation failed: out of vmalloc space - use vmalloc= to increase size. [sfc efrm] Failed (-12) to map bar (16777216 bytes) [sfc efrm] efrm_nic_add: ERROR: linux_efrm_nic_ctor failed (-12) One solution is to use a 64 bit kernel. Another is to increase the virtual memory allocation on the 32 bit system by setting vmalloc size on the ’kernel line’ in the /boot/grub/grub.conf file to 256, for example, kernel /vmlinuz-2.6.18-238.el5 ro root=/dev/sda7 vmalloc=256M The system must be rebooted for this change to take effect. Issue 15 © Solarflare Communications 2013 81 Onload User Guide Hardware Resources Onload uses certain physical resources on the network adapter. If these resources are exhausted, it is not possible to create new Onload stacks and not possible to accelerate new sockets. These physical resources include: 1 Virtual NICs. Virtual NICs provide the interface by which a user level application sends and receives network traffic. When these are exhausted it is not possible to create new Onload stacks, meaning new applications cannot be accelerated. However, Solarflare network adapters support large numbers of Virtual NICs, and this resource is not typically the first to run out. 2 Filters. Filters are used to demultiplex packets received from the wire to the appropriate application. When these are exhausted it is not possible to create new accelerated sockets. Solarflare recommend that applications do not allocate more than 4096 filters. 3 Buffer table entries. The buffer table provides address protection and translation for DMA buffers. When these are exhausted it is not possible to create new Onload stacks, and existing stacks are not able to allocate more DMA buffers. When any of these resources are exhausted, normal operation of the system should continue, but it will not be possible to accelerate new sockets or applications. Under severe conditions, after resources are exhausted, it may not be possible to send or receive traffic resulting in applications getting ’stuck’. The onload_stackdump utility should be used to monitor hardware resources. IGMP Operation and Multicast Process Priority It is important that the priority of processes using UDP multicast do not have a higher priority than the kernel thread handling the management of multicast group membership. Failure to observe this could lead to the following situations: 1 Incorrect kernel IGMP operation. 2 The higher priority user process is able to effectively block the kernel thread and prevent it from identifying the multicast group to Onload which will react by dropping packets received for the multicast group. A combination of indicators may identify this: • ethtool reports good packets being received while multicast mismatch does not increase. • ifconfig identifies data is being received. • onload_stackdump will show the rx_discard_mcast_mismatch counter increasing. Lowering the priority of the user process will remedy the situation and allow the multicast packets through Onload to the user process. Issue 15 © Solarflare Communications 2013 82 Onload User Guide Dynamic Loading If the onload library libonload is opened with dlopen() and closed with dlclose() it can leave the application in an unpredictable state. Users are advised to use the RTLD_NODELETE flag to prevent the library from being unloaded when dlclose() is called. Scalable Packet Buffer Mode Support for SR-IOV is disabled on 32-bit kernels, therefore the following features are not available on 32-bit kernels. • Scalable Packet Buffer Mode (EF_PACKET_BUFFER_MODE=1) • ef_vi with VFs On some recent kernel versions, configuring the adapter to have a large number of VFs (via sfboot) can cause kernel panics. This problem affects kernel versions in the range 3.0 to 3.3 inclusive and is due to the netlink messages that include information about network interfaces growing too large. The problem can be avoided by ensuring that the total number of physical network interfaces, including VFs, is no more than 30. Scalable Packet Buffer Mode on SFN7122F and SFN7322F SR-IOV and therefore Scalable Packet Buffer Mode is not currently supported on the SFN7000 series adapters. This feature will be supported in a future release. Huge Pages with IPC namespace Huge page support should not be enabled if the application uses IPC namespaces and the CLONE_NEWIPC flag. Failure to observe this may result in a segfault. Huge Pages with Shared Stacks Processes which share an Onload stack should not attempt to use huge pages. Refer to Stack Sharing on page 55 for limitation details. Huge Pages - Size When using huge pages, it is recommended to avoid setting the page size greater than 2 Mbyte. A failure to observe this could lead to Onload unable to allocate further buffer table space for packet buffers. Red Hat MRG 2 and SR-IOV EnterpriseOnload from version 2.1.0.1 includes support for Red Hat MRG2 update 3 and the 3.6.11rt kernel. Solarflare do not recommend the use of SR-IOV or the IOMMU when using Onload on these systems due to a number of known kernel issues. The following Onload features should not be used on MRG2u3: Issue 15 © Solarflare Communications 2013 83 Onload User Guide • Scalable packet buffer mode (EF_PACKET_BUFFER_MODE=1) • ef_vi with VFs PowerPC Architecture • 32 bit applications are known not to work correctly with onload-201310. This has been corrected in onload-201310-u1. • SR-IOV is not supported by onload-201310 on PowerPC systems. • PowerPC architectures do not currently support PIO for reduced latency. EF_PIO should be set to zero. Java 7 Applications - use of vfork() Onload accelerated Java 7 applications that call vfork() should set the environment variable EF_VFORK_MODE=2 and thereafter the application should not create sockets or accelerated pipes in vfork() child before exec. Issue 15 © Solarflare Communications 2013 84 Onload User Guide Chapter 8: Change History This chapter provides a brief history of changes, additions and removals to Onload releases affecting Onload behaviour and Onload environment variables. • Features...Page 85 • Environment Variables...Page 87 • Module Options...Page 91 The OOL column identifies the OpenOnload release supporting the feature. The EOL column identifies the EnterpriseOnload release supporting the feature. (NS = not supported) 8.1 Features Feature OOL EOL Description/Notes SO_TIMESTAMPING 201310-u1 NS Socket option to receive hardware timestamps for received packets. onload_fd_check_feature() 201310-u1 NS onload_fd_check_feature() on page 114 4.0.2.6628 net driver 201310-u1 NS Net driver supporting SFN5xxx, 6xxx and 7xxx series adapters introducing hardware packet timestamps and PTP on 7xxx series adapters. 4.0.0.6585 net driver 201310 NS Net driver supporting SFN5xxx, 6xxx and 7xxx series adapters and Solarflare PTP and hardware packet timestamps. Multicast Replication 201310 NS Multicast Replication on page 56 TX PIO 201310 NS Programmed I/O on page 70 Large Buffer Table Support 201310 NS Large Buffer Table Support on page 62 Templated Sends 201310 NS Templated Sends on page 71 ONLOAD_MSG_WARM 201310 NS ONLOAD_MSG_WARM on page 46 SO_TIMESTAMP 201310 NS Supported for TCP sockets dup3() 201310 NS Onload wil intercept calls to create a copy of a file descriptor using dup3(). 3.3.0.6262 net driver NS 2.1.0.1 Support Solarflare Enhanced PTP (sfptpd). IP_ADD_SOURCE_MEMBERSHIP 201210-u1 NS Join the supplied multicast group on the given interface and accept data from the supplied source address. SO_TIMESTAMPNS Issue 15 © Solarflare Communications 2013 85 Onload User Guide Feature OOL EOL Description/Notes IP_DROP_SOURCE_MEMBERSHI P 201210-u1 NS Drops membership to the given multicast group, interface and source address. MCAST_JOIN_SOURCE_GROUP 201210-u1 NS Join a source specific group. MCAST_LEAVE_SOURCE_GROUP 201210-u1 NS Leave a source specific group. 3.3.0.6246 net driver 201210-u1 NS Support Solarflare Enhanced PTP (sfptpd). Huge pages support 201210 NS Packet buffers use huge pages. Controlled by EF_USE_HUGE_PAGES Default is 1 - use huge pages if available See Limitations on page 73 Issue 15 onload_iptables 201210 NS onload_stackdump processes 201210 NS Apply Linux iptables firewall rules or user-defined firewall rules to Solarflare interfaces Show all accelerated processes by PID onload_stackdump affinities Show CPU core accelerated process is running on onload_stackdump env Show environment variables - EF_VALIDATE_ENV Physical addressing mode 201210 NS Allows a process to use physical addresses rather than controlled I/O addresses. Enabled by EF_PACKET_BUFFER_MODE 2 or 3 UDP sendmmsg() 201210 NS Send multiple msgs in a single function call I/O Multiplexing 201210 NS Support for ppoll(), pselect() and epoll_pwait() DKMS 201210 NS OpenOnload available in DKMS RPM binary format 3.2.1.6222B net driver 201210 NS OpenOnload only 3.2.1.6110 net driver NS 2.1.0.0 EnterpriseOnload only 3.2.1.6099 net driver 201205-u1 NS Removing zombie stacks 201205-u1 2.1.0.0 onload_stackdump -z kill will terminate stacks lingering after exit Compatibility 201205-u1 2.1.0.0 Compatibility with RHEL6.3 and Linux 3.4.0 TCP striping 201205 2.1.0.0 Single TCP connection can use the full bandwidth of both ports on a Solarflare adapter TCP loopback acceleration 201205 2.1.0.0 EF_TCP_CLIENT_LOOPBACK & EF_TCP_SERVER_LOOPBACK TCP delayed acknowledgments 201205 2.1.0.0 EF_DYNAMIC_ACK_THRESH TCP reset following RTO 201205 2.1.0.0 EF_TCP_RST_DELAYED_CONN Configure control plane tables 201205 2.1.0.0 max_layer_2_interface max_neighs max_routes Onload adapter support 201109-u2 2.0.0.0 Onload support for SFN5322F & SFN6x22F © Solarflare Communications 2013 86 Onload User Guide Feature OOL EOL Description/Notes Accelerate pipe2() 201109-u2 2.0.0.0 Accelerate pipe2() function call SOCK_NONBLOCK SOCK_CLOEXEC 201109-u2 2.0.0.0 TCP socket types Extensions API 201109-u2 2.0.0.0 Support for onload_thread_set_spin() 3.2 net driver 201109-u1 2.0.0.0 Onload_tcpdump 201109 2.0.0.0 Scalable Packet Buffer 201109 2.0.0.0 Zero-Copy UDP RX 201109 2.0.0.0 Zero-Copy TCP TX 201109 2.0.0.0 Receive filtering 201109 2.0.0.0 TCP_QUICKACK 201109 2.0.0.0 setsockopt() option Benchmark tool sfnettest 201109 2.0.0.0 Support for sfnt-stream 3.1 net driver 201104 Extensions API 201104 2.0.0.0 Initial publication SO_BINDTODEVICE SO_TIMESTAMP SO_TIMESTAMPNS 201104 2.0.0.0 setsockopt() and getsockopt() options Accelerated pipe() 201104 2.0.0.0 Accelerate pipe() function call UDP recvmmsg() 201104 2.0.0.0 Deliver multiple msgs in a single function call Benchmark tool sfnettest 201104 2.0.0.0 Supports only sfnt-pingpong EF_PACKET_BUFFER_MODE=1 8.2 Environment Variables Variable OOL EOL Changed EF_TX_PUSH_THRESHOLD 201310_u1 NS Improve EF_TX_PUSH low latency transmit feature. EF_RX_TIMESTAMPING 201310_u1 NS Control of receive packet hardware timestamps. EF_RETRANSMIT_THRESHOLD_SY NACK 201104 1.0.0.0 EF_PIO 201310 NS 201310-u1 Notes Default changed from 4 to 5. Enable/disable PIO Default value 1. Issue 15 © Solarflare Communications 2013 87 Onload User Guide Variable OOL EOL EF_PIO_THRESHOLD 201310 NS Identifies the largest packet size that can use PIO. Default value is 1514. EF_VFORK_MODE 201310 NS Dictates how vfork() intercept should work. EF_FREE_PACKETS_LOW_WATERM ARK 201310 NS Level of free packets to be retained during runtime. EF_TCP_SNDBUF_MODE 201310 2.0.0.6 Limit TCP packet buffers used on the send queue and retransmit queue. EF_TXQ_SIZE Changed Notes 201310 Limited to 2048 for SFN7000 series. EF_MAX_ENDPOINTS 201104 1.1.0.3 201310 Default changed to 1024 from 10. EF_SO_TIMESTAMP_RESYNC_TIM E 201104 2.1.0.1 201310 Removed from OOL. EF_SIGNALS_NOPOSTPONE 201210-u1 2.1.0.1 Prevent the specified list of signals from being postponed by onload. EF_FORCE_TCP_NODELAY 201210 NS Force use of TCP_NODELAY. EF_USE_HUGE_PAGES 201210 NS Enables huge pages for packet buffers. EF_VALIDATE_ENV 201210 NS Will warn about obsolete or misspelled options in the environment Default value 1. EF_PD_VF 201205-u1 2.1.0.0 201210 Allocate VIs within SR-IOV VFs to allocate unlimited memory. Replaced with new options on EF_PACKET_BUFFER_MODE EF_PD_PHYS_MODE 201205_u1 2.1.0.0 201210 Allows a VI to use physical addressing rather than protected I/O addresses Replaced with new options on EF_PACKET_BUFFER_MODE EF_MAX_PACKETS 20101111 1.0.0.0 201210 Onload will round the specified value up to the nearest multiple of 1024. EF_EPCACHE_MAX 20101111 1.0.0.0 201210 Removed from OOL NS 201210 Removed EF_TCP_MAX_SEQERR_MSGS EF_STACK_LOCK_BUZZ 20101111 1.0.0.0 201210 OOL Change to per_process, from per_stack. EOL is per stack. EF_RFC_RTO_INITIAL 20101111 1.0.0.0 201210 Change default to 1000 from 3000 2.1.0.0 EF_DYNAMIC_ACK_THRESH Issue 15 201205 2.1.0.0 201210 © Solarflare Communications 2013 Default value changed to 16 from 32 in 201210 88 Onload User Guide Variable OOL EOL Changed Notes EF_TCP_SERVER_LOOPBACK 201205 2.1.0.0 201210 TCP loopback acceleration Added option 4 for client loopback to cause both ends of a TCP connection to share a newly created stack. EF_TCP_CLIENT_LOOPBACK Option 4 is not supported in EnterpriseOnload. EF_TCP_RST_DELAYED 201205 2.1.0.0 Reset TCP connection following RTO expiry EF_SA_ONSTACK_INTERCEPT 201205 2.1.0.0 Default value 0 EF_SHARE_WITH 201109-u2 2.0.0.0 EF_EPOLL_CTL_HANDOFF 201109-u2 2.0.0.0 NS EF_CHECK_STACK_USER EF_POLL_USEC 201109-u1 1.0.0.0 EF_DEFER_WORK_LIMIT 201109-u1 2.0.0.0 EF_POLL_FAST_LOOPS 20101111 1.0.0.0 Default value 1 201109-u2 Renamed EF_SHARE_WITH Default value 32 201109-u1 Renamed EF_POLL_FAST_USEC 2.0.0.0 EF_POLL_NONBLOCK_FAST_LOOP S 201104 EF_PIPE_RECV_SPIN 201104 2.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_PKT_WAIT_SPIN 20101111 1.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_PIPE_SEND_SPIN 201104 2.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_TCP_ACCEPT_SPIN 20101111 1.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_TCP_RECV_SPIN 20101111 1.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_TCP_SEND_SPIN 20101111 1.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_UDP_RECV_SPIN 20101111 1.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_UDP_SEND_SPIN 20101111 1.0.0.0 201109-u1 Becomes per-process, was previously per-stack EF_EPOLL_NONBLOCK_FAST_LOO PS 201104-u2 2.0.0.0 201109-u1 Removed EF_POLL_AVOID_INT 20101111 1.0.0.0 201109-u1 Removed EF_SELECT_AVOID_INT 20101111 1.0.0.0 201109-u1 Removed Issue 15 2.0.0.0 201109-u1 2.0.0.1 © Solarflare Communications 2013 Renamed EF_POLL_NONBLOCK_FAST_USEC 89 Onload User Guide Variable OOL EOL Changed Notes EF_SIG_DEFER 20101111 1.0.0.0 201109-u1 Removed EF_IRQ_CORE 201109 2.0.0.0 201109-u2 Non-root user can now set it when using scalable packet buffer mode EF_IRQ_CHANNEL 201109 2.0.0.0 EF_IRQ_MODERATION 201109 2.0.0.0 EF_PACKET_BUFFER_MODE 201109 2.0.0.0 Default value 0 201210 In 201210 options 2 and 3 enable physical addressing mode. EOL only supports option 1. Default - disabled EF_SIG_REINIT 201109 NS 201109-u1 Default value 0. Removed in 201109-u1 EF_POLL_TCP_LISTEN_UL_ONLY 201104 2.0.0.0 201109 Removed EF_POLL_UDP 20101111 1.0.0.0 201109 Removed EF_POLL_UDP_TX_FAST 20101111 1.0.0.0 201109 Removed EF_POLL_UDP_UL_ONLY 201104 2.0.0.0 201109 Removed EF_SELECT_UDP 20101111 1.0.0.0 201109 Removed EF_SELECT_UDP_TX_FAST 20101111 1.0.0.0 201109 Removed EF_UDP_CHECK_ERRORS 20101111 1.0.0.0 201109 Removed EF_UDP_RECV_FAST_LOOPS 20101111 1.0.0.0 201109 Removed EF_UDP_RECV_MCAST_UL_ONLY 20101111 1.0.0.0 201109 Removed EF_UDP_RECV_UL_ONLY 20101111 1.0.0.0 201109 Removed EF_TX_QOS_CLASS 201104-u2 2.0.0.0 Default value 0 EF_TX_MIN_IPG_CNTL 201104-u2 2.0.0.0 Default value 0 EF_TCP_LISTEN_HANDOVER 201104-u2 2.0.0.0 Default value 0 EF_TCP_CONNECT_HANDOVER 201104-u2 2.0.0.0 Default value 0 EF_EPOLL_NONBLOCK_FAST_LOO PS 201104-u2 2.0.0.0 Default value 32 201109-u1 EF_TCP_SNDBUF_MODE Removed in 201109-u1 2.0.0.6 Default value 0 EF_UDP_PORT_HANDOVER2_MAX 201104-u1 2.0.0.0 Default value 1 EF_UDP_PORT_HANDOVER2_MIN 201104-u1 2.0.0.0 Default value 2 EF_UDP_PORT_HANDOVER3_MAX 201104-u1 2.0.0.0 Default value 1 EF_UDP_PORT_HANDOVER3_MIN 201104-u1 2.0.0.0 Default value 2 EF_STACK_PER_THREAD 201104-u1 2.0.0.0 Default value 0 Issue 15 © Solarflare Communications 2013 90 Onload User Guide Variable OOL EOL Changed Notes EF_PREFAULT_PACKETS 20101111 1.0.0.0 201104-u1 Enabled by default, was previously disabled EF_MCAST_RECV 201104-u1 2.0.0.0 Default value 1 EF_MCAST_JOIN_BINDTODEVICE 201104-u1 2.0.0.0 Default value 0 EF_MCAST_JOIN_HANDOVER 201104-u1 2.0.0.0 Default value 0 EF_DONT_ACCELERATE 201104-u1 2.0.0.0 Default value 0 EF_MULTICAST 20101111 1.0.0.0 201104-u1 Removed EF_TX_PUSH 20101111u1 1.0.0.0 201104 Enabled by default, was previously disabled 201109 No longer set by the latency profile script 8.3 Module Options Issue 15 Option OOL EOL epoll2_max_stacks 201210 NS Identifies the maximum number of stacks that an epoll file descriptor can handle when EF_UL_EPOLL=2 phys_mode_gid 201210 NS Enable physical addressing mode and restrict which users can use it shared_buffer_table 201210 NS This option should be set to enable ef_vi applications that use the ef_iobufset API. Setting shared_buffer_table=10000 will make 10000 buffer table entries available for use with ef_iobufset. safe_signals_and_exit 201205 2.1.0.0 When Onload intercepts a termination signal it will attempt a clean exit by releasing resources including stack locks etc. The default is (1) enabled and it is recommended that this remains enabled unless signal handling problems occur when it can be disabled (0). © Solarflare Communications 2013 Changed Notes 91 Onload User Guide Option OOL EOL max_layer2_interfaces 201205 2.1.0.0 Maximum number of network interfaces (includes physical, VLAN and bonds) supported in the control plane. max_routes 201205 2.1.0.0 Maximum number of entries in the Onload route table. Default is 256. max_neighs 201205 2.1.0.0 Maximum number of entries in Onload ARP/neighbour table. Rounded up to power of two value. Default is 1024. unsafe_sriov_without_iommu 201209-u2 2.0.0.0 201210 Removed, obsoleted by physical addressing modes and phys_mode_gid. 2.0.0.0 201210 Obsolete - Removed buffer_table_min Changed Notes buffer_table_max NOTE: The user should always refer to the Onload distribution release notes and change log. These are available from http://www.openonload.org/download.html. Issue 15 © Solarflare Communications 2013 92 Onload User Guide Appendix A: Parameter Reference Parameter List The parameter list details the following: • The environment variable used to set the parameter. • Parameter name: the name used by onload_stackdump. • The default, min and max values. • Whether the variable scope applies per-stack or per-process. • Description. EF_ACCEPTQ_MIN_BACKLOG Name: acceptq_min_backlog default: 1 per-stack Sets a minimum value to use for the 'backlog' argument to the listen() call. If the application requests a smaller value, use this value instead. EF_ACCEPT_INHERIT_NODELAY Name: accept_force_inherit_nodelay default: 1 min: 0 max: 1 per-process If set to 1, TCP sockets accepted from a listening socket inherit the TCP_NODELAY socket option from the listening socket. EF_ACCEPT_INHERIT_NONBLOCK Name: accept_force_inherit_nonblock default: 0 min: 0 max: 1 per-process If set to 1, TCP sockets accepted from a listening socket inherit the O_NONBLOCK flag from the listening socket. EF_BINDTODEVICE_HANDOVER Name: bindtodevice_handover default: 0 min: 0 max: 1 per-stack Hand sockets over to the kernel stack that have the SO_BINDTODEVICE socket option enabled. Issue 15 © Solarflare Communications 2013 93 Onload User Guide EF_BURST_CONTROL_LIMIT Name: burst_control_limit default: 0 per-stack If non-zero, limits how many bytes of data are transmitted in a single burst. This can be useful to avoid drops on low-end switches which contain limited buffering or limited internal bandwidth. This is not usually needed for use with most modern, high-performance switches. EF_BUZZ_USEC Name: buzz_usec default: 0 per-stack Sets the timeout in microseconds for lock buzzing options. Set to zero to disable lock buzzing (spinning). Will buzz forever if set to -1. Also set by the EF_POLL_USEC option. EF_CONG_AVOID_SCALE_BACK Name: cong_avoid_scale_back default: 0 per-stack When >0, this option slows down the rate at which the TCP congestion window is opened. This can help to reduce loss in environments where there is lots of congestion and loss. EF_DEFER_WORK_LIMIT Name: defer_work_limit default: 32 per-stack The maximum number of times that work can be deferred to the lock holder before we force the unlocked thread to block and wait for the lock EF_DELACK_THRESH Name: delack_thresh default: 1 min: 0 max: 65535 per-stack This option controls the delayed acknowledgement algorithm. A socket may receive up to the specified number of TCP segments without generating an ACK. Setting this option to 0 disables delayed acknowledgements.NB. This option is overridden by EF_DYNAMIC_ACK_THRESH, so both options need to be set to 0 to disable delayed acknowledgements. Issue 15 © Solarflare Communications 2013 94 Onload User Guide EF_DONT_ACCELERATE Name: dont_accelerate default: 0 min: 0 max: 1 per-process Do not accelerate by default. This option is usually used in conjuction with onload_set_stackname() to allow individual sockets to be accelerated selectively. EF_DYNAMIC_ACK_THRESH Name: dynack_thresh default: 16 min: 0 max: 65535 per-stack If set to >0 this will turn on dynamic adapation of the ACK rate to increase efficiency by avoiding ACKs when they would reduce throughput. The value is used as the threshold for number of pending ACKs before an ACK is forced. If set to zero then the standard delayed-ack algorithm is used. EF_EPOLL_CTL_FAST Name: ul_epoll_ctl_fast default: 1 min: 0 max: 1 per-process Avoid system calls in epoll_ctl() when using an accelerated epoll implementation. System calls are deferred until epoll_wait() blocks, and in some cases removed completely. This option improves performance for applications that call epoll_ctl() frequently.CAVEATS: This option has no effect when EF_UL_EPOLL=0. Following dup(), dup2(), fork() or exec(), some changes to epoll sets may be lost. If you monitor the epoll fd in another poll, select or epoll set, and the effects of epoll_ctl() are latency critical, then this option can cause latency spikes or even deadlock. EF_EPOLL_CTL_HANDOFF Name: ul_epoll_ctl_handoff default: 1 min: 0 max: 1 per-process Allow epoll_ctl() calls to be passed from one thread to another in order to avoid lock contention. This optimisation is particularly important when epoll_ctl() calls are made concurrently with epoll_wait() and spinning is enabled.This option is enabled by default.CAVEAT: This option may cause an error code returned by epoll_ctl() to be hidden from the application when a call is deferred. In such cases an error message is emitted to stderr or the system log. EF_EPOLL_MT_SAFE Name: ul_epoll_mt_safe default: 0 min: 0 max: 1 per-process This option disables concurrency control inside the accelerated epoll implementations, reducing CPU overhead. It is safe to enable this option if, for each epoll set, all calls on the epoll set are concurrency safe.This option Issue 15 © Solarflare Communications 2013 95 Onload User Guide improves performance with EF_UL_EPOLL=1 and also with EF_UL_EPOLL=2 and EF_EPOLL_CTL_FAST=1. EF_EPOLL_SPIN Name: ul_epoll_spin default: 0 min: 0 max: 1 per-process Spin in epoll_wait() calls until an event is satisfied or the spin timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_EVS_PER_POLL Name: evs_per_poll default: 64 min: 0 max: 0x7fffffff per-stack Sets the number of hardware network events to handle before performing other work. The value chosen represents a trade-off: Larger values increase batching (which typically improves efficiency) but may also increase the working set size (which harms cache efficiency). EF_FDS_MT_SAFE Name: fds_mt_safe default: 1 min: 0 max: 1 per-process This option allows less strict concurrency control when accessing the user-level file descriptor table, resulting in increased performance, particularly for multi-threaded applications. Single-threaded applications get a small latency benefit, but multi-threaded applications benefit most due to decreased cache-line bouncing between CPU cores.This option is unsafe for applications that make changes to file descriptors in one thread while accessing the same file descriptors in other threads. For example, closing a file descriptor in one thread while invoking another system call on that file descriptor in a second thread. Concurrent calls that do not change the object underlying the file descriptor remain safe.Calls to bind(), connect(), listen() may change underlying object. If you call such functions in one thread while accessing the same file descriptor from the other thread, this option is also unsafe.Also concurrent calls may happen from signal handlers, so set this to 0 if your signal handlers may close sockets EF_FDTABLE_SIZE Name: fdtable_size default: 0 per-process Limit the number of opened file descriptors by this value. If zero, the initial hard limit of open files (`ulimit -n -H`) is used. Hard and soft resource limits for opened file descriptors (help ulimit, man 2 setrlimit) are bound by this value. Issue 15 © Solarflare Communications 2013 96 Onload User Guide EF_FDTABLE_STRICT Name: fdtable_strict default: 0 min: 0 max: 1 per-process Enables more strict concurrency control for the user-level file descriptor table. Enabling this option can reduce performance for applications that create and destroy many connections per second. EF_FORCE_SEND_MULTICAST Name: force_send_multicast default: 1 min: 0 max: 1 per-stack This option causes all multicast sends to be accelerated. When disabled, multicast sends are only accelerated for sockets that have cleared the IP_MULTICAST_LOOP flag.This option disables loopback of multicast traffic to receivers on the same host, unless those receivers are sharing an OpenOnload stack with the sender (see EF_NAME) and EF_MULTICAST_LOOP_OFF=0. See the OpenOnload manual for further details on multicast operation. EF_FORCE_TCP_NODELAY Name: tcp_force_nodelay default: 0 min: 0 max: 2 per-stack This option allows the user to override the use of TCP_NODELAY. This may be useful in cases where 3rd-party software is (not) setting this value and the user would like to control its behaviour: 0 - do not override 1 always set TCP_NODELAY 2 - never set TCP_NODELAY EF_FORK_NETIF Name: fork_netif default: 3 min: CI_UNIX_FORK_NETIF_NONE CI_UNIX_FORK_NETIF_BOTH per-process max: This option controls behaviour after an application calls fork(). 0 - Neither fork parent nor child creates a new OpenOnload stack 1 - Child creates a new stack for new sockets 2 - Parent creates a new stack for new sockets 3 - Parent and child each create a new stack for new sockets EF_FREE_PACKETS_LOW_WATERMARK NAME: free_packets_low default: 100 per-stack Keep free packets number to be at least this value. EF_MIN_FREE_PACKETS defines initialisation behaviour; this value is about normal application runtime. This value is used if we can not allocate more packets at any time, i.e. in case of AMD IOMMU only. Issue 15 © Solarflare Communications 2013 97 Onload User Guide EF_HELPER_PRIME_USEC Name: timer_prime_usec default: 250 per-stack Sets the frequency with which software should reset the count-down timer. Usually set to a value that is significantly smaller than EF_HELPER_USEC to prevent the count-down timer from firing unless needed. Defaults to (EF_HELPER_USEC / 2). EF_HELPER_USEC Name: timer_usec default: 500 per-stack Timeout in microseconds for the count-down interrupt timer. This timer generates an interrupt if network events are not handled by the application within the given time. It ensures that network events are handled promptly when the application is not invoking the network, or is descheduled.Set this to 0 to disable the countdown interrupt timer. It is disabled by default for stacks that are interrupt driven. EF_INT_DRIVEN Name: int_driven default: 1 min: 0 max: 1 per-stack Put the stack into an 'interrupt driven' mode of operation. When this option is not enabled Onload uses heuristics to decide when to enable interrupts, and this can cause latency jitter in some applications. So enabling this option can help avoid latency outliers.This option is enabled by default except when spinning is enabled.This option can be used in conjunction with spinning to prevent outliers caused when the spin timeout is exceeded and the application blocks, or when the application is descheduled. In this case we recommend that interrupt moderation be set to a reasonably high value (eg. 100us) to prevent too high a rate of interrupts. EF_INT_REPRIME Name: int_reprime default: 0 min: 0 max: 1 per-stack Enable interrupts more aggressively than the default. EF_IRQ_CHANNEL Name: irq_channel default: 4294967295 min: -1 max: SMAX per-stack Set the net-driver receive channel that will be used to handle interrupts for this stack. The core that receives interrupts for this stack will be whichever core is configured to handle interrupts for the specified net driver receive channel.This option only takes effect EF_PACKET_BUFFER_MODE=0 (default) or 2. Issue 15 © Solarflare Communications 2013 98 Onload User Guide EF_IRQ_CORE Name: irq_core default: 4294967295 min: -1 max: SMAX per-stack Specify which CPU core interrupts for this stack should be handled on.With EF_PACKET_BUFFER_MODE=1 or 3, Onload creates dedicated interrupts for each stack, and the interrupt is assigned to the requested core.With EF_PACKET_BUFFER_MODE=0 (default) or 2, Onload interrupts are handled via net driver receive channel interrupts. The sfc_affinity driver is used to choose which net-driver receive channel is used. It is only possible for interrupts to be handled on the requested core if a net driver interrupt is assigned to the selected core. Otherwise a nearby core will be selected.Note that if the IRQ balancer service is enabled it may redirect interrupts to other cores. EF_IRQ_MODERATION Name: irq_usec default: 0 min: 0 max: 1000000 per-stack Interrupt moderation interval, in microseconds.This option only takes effective with EF_PACKET_BUFFER_MODE=1 or 3. Otherwise the interrupt moderation settings of the kernel net driver take effect. EF_KEEPALIVE_INTVL Name: keepalive_intvl default: 75000 per-stack Default interval between keepalives, in milliseconds. EF_KEEPALIVE_PROBES Name: keepalive_probes default: 9 per-stack Default number of keepalive probes to try before aborting the connection. EF_KEEPALIVE_TIME Name: keepalive_time default: 7200000 per-stack Default idle time before keepalive probes are sent, in milliseconds. Issue 15 © Solarflare Communications 2013 99 Onload User Guide EF_LOAD_ENV Name: load_env default: 1 min: 0 max: 1 per-process OpenOnload will only consult other environment variables if this option is set. i.e. Clearing this option will cause all other EF_ environment variables to be ignored. EF_LOG_VIA_IOCTL Name: log_via_ioctl default: 0 min: 0 max: 1 per-process Causes error and log messages emitted by OpenOnload to be written to the system log rather than written to standard error. This includes the copyright banner emitted when an application creates a new OpenOnload stack.By default, OpenOnload logs are written to the application standard error if and only if it is a TTY.Enable this option when it is important not to change what the application writes to standard error.Disable it to guarantee that log goes to standard error even if it is not a TTY. EF_MAX_ENDPOINTS Name: max_ep_bufs default: 1024 stack min: 0 max: CI_CFG_NETIF_MAX_ENDPOINTS_MAX per- This option places an upper limit on the number of accelerated endpoints (sockets, pipes etc.) in an Onload stack. This option should be set to a power of two between 1 and 32,768.When this limit is reached listening sockets are not able to accept new connections over accelerated interfaces. New sockets and pipes created via socket() and pipe() etc. are handed over to the kernel stack and so are not accelerated.Note: Multiple endpoint buffers are consumed by each accelerated pipe. EF_MAX_EP_PINNED_PAGES Name: max_ep_pinned_pages default: 512 per-stack Not currently used. EF_MAX_PACKETS Name: max_packets default: 32768 min: 1024 per-stack Upper limit on number of packet buffers in each OpenOnload stack. Packet buffers require hardware resources which may become a limiting factor if many stacks are each using many packet buffers. This option can be used to limit how much hardware resource and memory a stack uses. This option has an upper limit determined by the max_packets_per_stack onload module option.Note: When 'scalable packet buffer mode' is not enabled (see Issue 15 © Solarflare Communications 2013 100 Onload User Guide EF_PACKET_BUFFER_MODE) the total number of packet buffers possible in aggregate is limited by a hardware resource. The SFN5x series adapters support approximately 120,000 packet buffers. EF_MAX_RX_PACKETS Name: max_rx_packets default: 24576 min: 0 max: 1000000000 per-stack The maximum number of packet buffers in a stack that can be used by the receive data path. This should be set to a value smaller than EF_MAX_PACKETS to ensure that some packet buffers are reserved for the transmit path. EF_MAX_TX_PACKETS Name: max_tx_packets default: 24576 min: 0 max: 1000000000 per-stack The maximum number of packet buffers in a stack that can be used by the transmit data path. This should be set to a value smaller than EF_MAX_PACKETS to ensure that some packet buffers are reserved for the receive path. EF_MCAST_JOIN_BINDTODEVICE Name: mcast_join_bindtodevice default: 0 min: 0 max: 1 per-stack When a UDP socket joins a multicast group (using IP_ADD_MEMBERSHIP or similar), this option causes the socket to be bound to the interface that the join was on. The benefit of this is that it ensures the socket will not accidentally receive packets from other interfaces that happen to match the same group and port. This can sometimes happen if another socket joins the same multicast group on a different interface, or if the switch is not filtering multicast traffic effectively.If the socket joins multicast groups on more than one interface, then the binding is automatically removed. EF_MCAST_JOIN_HANDOVER Name: mcast_join_handover default: 0 min: 0 max: 2 per-stack When this option is set to 1, and a UDP socket joins a multicast group on an interface that is not accelerated, the UDP socket is handed-over to the kernel stack. This can be a good idea because it prevents that socket from consuming Onload resources, and may also help avoid spinning when it is not wanted.When set to 2, UDP sockets that join multicast groups are always handed-over to the kernel stack. Issue 15 © Solarflare Communications 2013 101 Onload User Guide EF_MCAST_RECV Name: mcast_recv default: 1 min: 0 max: 1 per-stack Controls whether or not to accelerate multicast receives. When set to zero, multicast receives are not accelerated, but the socket continues to be managed by Onload.See also EF_MCAST_JOIN_HANDOVER.See the OpenOnload manual for further details on multicast operation. EF_MIN_FREE_PACKETS Name: min_free_packets default: 100 per-stack Minimum number of free packets to reserve for each stack at initialisation. If Onload is not able to allocate sufficient packet buffers to fill the RX rings and fill the free pool with the given number of buffers, then creation of the stack will fail. EF_MULTICAST_LOOP_OFF Name: multicast_loop_off default: 1 min: 0 max: 1 per-stack When set, disables loopback of multicast traffic to receivers in the same OpenOnload stack.See the OpenOnload manual for further details on multicast operation. EF_NAME Name: ef_name per-process Processes setting the same value for EF_NAME in their environment can share an OpenOnload stack. EF_NETIF_DTOR Name: netif_dtor default: 1 min: 0 max: 2 per-process This option controls the lifetime of OpenOnload stacks when the last socket in a stack is closed. EF_NONAGLE_INFLIGHT_MAX Name: nonagle_inflight_max default: 50 min: 1 per-stack This option affects the behaviour of TCP sockets with the TCP_NODELAY socket option. Nagle's algorithm is enabled when the number of packets in-flight (sent but not acknowledged) exceeds the value of this option. This improves efficiency when sending many small messages, while preserving low latency.Set this option to -1 to Issue 15 © Solarflare Communications 2013 102 Onload User Guide ensure that Nagle's algorithm never delays sending of TCP messages on sockets with TCP_NODELAY enabled. EF_NO_FAIL Name: no_fail default: 1 min: 0 max: 1 per-process This option controls whether failure to create an accelerated socket (due to resource limitations) is hidden by creating a conventional unaccelerated socket. Set this option to 0 to cause out-of-resources errors to be propagated as errors to the application, or to 1 to have Onload use the kernel stack instead when out of resources.Disabling this option can be useful to ensure that sockets are being accelerated as expected (ie. to find out when they are not). EF_PACKET_BUFFER_MODE Name: packet_buffer_mode default: 0 min: 0 max: 3 per-stack This option affects how DMA buffers are managed. The default packet buffer mode uses a limited hardware resource, and so restricts the total amount of memory that can be used by Onload for DMA.Setting EF_PACKET_BUFFER_MODE!=0 enables 'scalable packet buffer mode' which removes that limit. See details for each mode below. 1 - SR-IOV with IOMMU. Each stack allocates a separate PCI Virtual Function. IOMMU guarantees that different stacks do not have any access to each other data. 2 - Physical address mode. Inherently unsafe; no address space separation between different stacks or net driver packets. 3 - SR-IOV with physical address mode. Each stack allocates a separate PCI Virtual Function. IOMMU is not used, so this mode is unsafe in the same way as (2).To use odd modes (1 and 3) SR-IOV must be enabled in the BIOS, OS kernel and on the network adapter. In these modes you also get faster interrupt handler which can improve latency for some workloads.For mode (1) you also have to enable IOMMU (also known as VT-d) in BIOSand in your kernel.For unsafe physical address modes (2) and (3), you should tune phys_mode_gid module parameter of the onload module. EF_PIO NAME: pio default: 1 min: 0 max: 2 per-stack Control of whether Programmed I/O is used instead of DMA for small packets: 0 - no (use DMA); 1 - use PIO for small packets if available (default); 2 - use PIO for small packets and fail if PIO is not available. Mode 1 will fall back to DMA if PIO is not currently available. Mode 2 will fail to create the stack if the hardware supports PIO but PIO is not currently available. On hardware that does not support PIO there is no difference between mode 1 and mode 2 In all cases, PIO will only be used for small packets (see EF_PIO_THRESHOLD) and if the VI's transmit queue is currently empty. If these conditions are not met DMA will be used, even in mode 2. Issue 15 © Solarflare Communications 2013 103 Onload User Guide EF_PIO_THRESHOLD NAME: pio_thresh default: 1514 max: 0 per-stack Sets a threshold for the size of packet that will use PIO, if turned on using EF_PIO. Packets up to the threshold will use PIO. Larger packets will not. EF_PIPE Name: ul_pipe default: 2 min: CI_UNIX_PIPE_DONT_ACCELERATE CI_UNIX_PIPE_ACCELERATE_IF_NETIF per-process max: - disable pipe acceleration, 1 - enable pipe acceleration, 2 - acclerate pipes only if an Onload stack already exists in the process. EF_PIPE_RECV_SPIN Name: pipe_recv_spin default: 0 min: 0 max: 1 per-process Spin in pipe receive calls until data arrives or the spin timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_PIPE_SEND_SPIN Name: pipe_send_spin default: 0 min: 0 max: 1 per-process Spin in pipe send calls until space becomes available in the socket buffer or the spin timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_PKT_WAIT_SPIN Name: pkt_wait_spin default: 0 min: 0 max: 1 per-process Spin while waiting for DMA buffers. If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_POLL_FAST Name: ul_poll_fast default: 1 min: 0 max: 1 per-process Allow a poll() call to return without inspecting the state of all polled file descriptors when at least one event is satisfied. This allows the accelerated poll() call to avoid a system call when accelerated sockets are 'ready', and Issue 15 © Solarflare Communications 2013 104 Onload User Guide can increase performance substantially.This option changes the semantics of poll(), and as such could cause applications to misbehave. It effectively gives priority to accelerated sockets over non-accelerated sockets and other file descriptors. In practice a vast majority of applications work fine with this option. EF_POLL_FAST_USEC Name: ul_poll_fast_usec default: 32 per-process When spinning in a poll() call, causes accelerated sockets to be polled for N usecs before unaccelerated sockets are polled. This reduces latency for accelerated sockets, possibly at the expense of latency on unaccelerated sockets. Since accelerated sockets are typically the parts of the application which are most performancesensitive this is typically a good tradeoff. EF_POLL_NONBLOCK_FAST_USEC Name: ul_poll_nonblock_fast_usec default: 200 per-process When invoking poll() with timeout==0 (non-blocking), this option causes non-accelerated sockets to be polled only every N usecs.This reduces latency for accelerated sockets, possibly at the expense of latency on unaccelerated sockets. Since accelerated sockets are typically the parts of the application which are most performance-sensitive this is often a good tradeoff.Set this option to zero to disable, or to a higher value to further improve latency for accelerated sockets.This option changes the behaviour of poll() calls, so could potentially cause an application to misbehave. EF_POLL_ON_DEMAND Name: poll_on_demand default: 1 min: 0 max: 1 per-stack Poll for network events in the context of the application calls into the network stack. This option is enabled by default.This option can improve performance in multi-threaded applications where the Onload stack is interruptdriven (EF_INT_DRIVEN=1), because it can reduce lock contention. Setting EF_POLL_ON_DEMAND=0 ensures that network events are (mostly) processed in response to interrupts. EF_POLL_SPIN Name: ul_poll_spin default: 0 min: 0 max: 1 per-process Spin in poll() calls until an event is satisfied or the spin timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. Issue 15 © Solarflare Communications 2013 105 Onload User Guide EF_POLL_USEC Name: ef_poll_usec_meta_option default: 0 per-process This option enables spinning and sets the spin timeout in microseconds.Setting this option is equivalent to: Setting EF_SPIN_USEC and EF_BUZZ_USEC, enabling spinning for UDP sends and receives, TCP sends and receives, select, poll and epoll_wait(), and enabling lock buzzing.Spinning typically reduces latency and jitter substantially, and can also improve throughput. However, in some applications spinning can harm performance; particularly application that have many threads. When spinning is enabled you should normally dedicate a CPU core to each thread that spins.You can use the EF_*_SPIN options to selectively enable or disable spinning for each API and transport. You can also use the onload_thread_set_spin() extension API to control spinning on a per-thread and per-API basis. EF_PREFAULT_PACKETS Name: prefault_packets default: 1 min: 0 max: 1000000000 per-stack When set, this option causes the process to 'touch' the specified number of packet buffers when the Onload stack is created. This causes memory for the packet buffers to be pre-allocated, and also causes them to be memory-mapped into the process address space. This can prevent latency jitter caused by allocation and memory-mapping overheads.The number of packets requested is in addition to the packet buffers that are allocated to fill the RX rings. There is no guarantee that it will be possible to allocate the number of packet buffers requested.The default setting causes all packet buffers to be mapped into the user-level address space, but does not cause any extra buffers to be reserved. Set to 0 to prevent prefaulting. EF_PROBE Name: probe default: 1 min: 0 max: 1 per-process When set, file descriptors accessed following exec() will be 'probed' and OpenOnload sockets will be mapped to user-land so that they can be accelerated. Otherwise OpenOnload sockets are not accelerated following exec(). EF_RETRANSMIT_THRESHOLD Name: retransmit_threshold default: 15 min: 0 max: SMAX per-stack Number of retransmit timeouts before a TCP connect is aborted. EF_RETRANSMIT_THRESHOLD_SYN Name: retransmit_threshold_syn default: 4 Issue 15 min: 0 max: SMAX © Solarflare Communications 2013 per-stack 106 Onload User Guide Number of times a SYN will be retransmitted before a connect() attempt will be aborted. EF_RETRANSMIT_THRESHOLD_SYNACK Name: retransmit_threshold_synack default: 5 min: 0 max: SMAX per-stack Number of times a SYN-ACK will be retransmitted before an embryonic connection will be aborted. EF_RFC_RTO_INITIAL Name: rto_initial default: 1000 per-stack Initial retransmit timeout in milliseconds. i.e. The number of milliseconds to wait for an ACK before retransmitting packets. EF_RFC_RTO_MAX Name: rto_max default: 120000 per-stack Maximum retransmit timeout in milliseconds. EF_RFC_RTO_MIN Name: rto_min default: 200 per-stack Minimum retransmit timeout in milliseconds. EF_RXQ_LIMIT Name: rxq_limit default: 65535 min: CI_CFG_RX_DESC_BATCH max: 65535 per-stack Maximum fill level for the receive descriptor ring. This has no effect when it has a value larger than the ring size (EF_RXQ_SIZE). EF_RXQ_MIN Name: rxq_min default: 256 min: 2 * CI_CFG_RX_DESC_BATCH + 1 per-stack Minimum initial fill level for each RX ring. If Onload is not able to allocate sufficient packet buffers to fill each RX ring to this level, then creation of the stack will fail. Issue 15 © Solarflare Communications 2013 107 Onload User Guide EF_RXQ_SIZE Name: rxq_size default: 512 min: 512 max: 4096 per-stack Set the size of the receive descriptor ring. Valid values: 512, 1024, 2048 or 4096.A larger ring size can absorb larger packet bursts without drops, but may reduce efficiency because the working set size is increased. EF_RX_TIMESTAMPING Name: rx_timestamping default: 0 min: 0 max: 3 per-stack Control of hardware timestamping of received packets, possible values: 0 - do not do timestamping (default) 1 - request timestamping but continue if hardware is not capable or it does not succeed 2 - request timestamping and fail if hardware is capable and it does not succeed 3 - request timestamping and fail if hardware is not capable or it does not succeed. EF_SA_ONSTACK_INTERCEPT Name: sa_onstack_intercept default: 0 min: 0 max: 1 per-process Intercept signals when signal handler is installed with SA_ONSTACK flag. 0 - Don't intercept. If you call socketrelated functions such as send, file-related functions such as close or dup from your signal handler, then your application may deadlock. (default) 1 - Intercept. There is no guarantee that SA_ONSTACK flag will really work, but OpenOnload library will do its best. EF_SELECT_FAST Name: ul_select_fast default: 1 min: 0 max: 1 per-process Allow a select() call to return without inspecting the state of all selected file descriptors when at least one selected event is satisfied. This allows the accelerated select() call to avoid a system call when accelerated sockets are 'ready', and can increase performance substantially.This option changes the semantics of select(), and as such could cause applications to misbehave. It effectively gives priority to accelerated sockets over nonaccelerated sockets and other file descriptors. In practice a vast majority of applications work fine with this option. EF_SELECT_SPIN Name: ul_select_spin default: 0 Issue 15 min: 0 max: 1 per-process © Solarflare Communications 2013 108 Onload User Guide Spin in blocking select() calls until the select set is satisfied or the spin timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_SEND_POLL_MAX_EVS Name: send_poll_max_events default: 96 min: 1 max: 65535 per-stack When polling for network events after sending, this places a limit on the number of events handled. EF_SEND_POLL_THRESH Name: send_poll_thresh default: 64 min: 0 max: 65535 per-stack Poll for network events after sending this many packets.Setting this to a larger value may improve transmit throughput for small messages by allowing batching. However, such batching may cause sends to be delayed leading to increased jitter. EF_SHARE_WITH Name: share_with default: 0 min: -1 max: SMAX per-stack Set this option to allow a stack to be accessed by processes owned by another user. Set it to the UID of a user that should be permitted to share this stack, or set it to -1 to allow any user to share the stack. By default stacks are not accessible by users other than root.Processes invoked by root can access any stack. Setuid processes can only access stacks created by the effective user, not the real user. This restriction can be relaxed by setting the onload kernel module option allow_insecure_setuid_sharing=1.WARNING: A user that is permitted to access a stack is able to: Snoop on any data transmitted or received via the stack; Inject or modify data transmitted or received via the stack; damage the stack and any sockets or connections in it; cause misbehaviour and crashes in any application using the stack. EF_SIGNALS_NOPOSTPONE Name: signals_no_postpone default: 67109952 min: 0 max: (ci_uint64)(-1) per-process Comma-separated list of signal numbers to avoid postponing of the signal handlers. Your application will deadlock if one of the handlers uses socket function. By default, the list includes SIGBUS, SIGSEGV and SIGPROF.Please specify numbers, not string aliases: EF_SIGNALS_NOPOSTPONE=7,11,27 instead of EF_SIGNALS_NOPOSTPONE=SIGBUS,SIGSEGV,SIGPROF.You can set EF_SIGNALS_NOPOSTPONE to empty value to postpone all signal handlers in the same way if you suspect these signals to call network functions. Issue 15 © Solarflare Communications 2013 109 Onload User Guide EF_SOCK_LOCK_BUZZ Name: sock_lock_buzz default: 0 min: 0 max: 1 per-process Spin while waiting to obtain a per-socket lock. If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_BUZZ_USEC.The per-socket lock is taken in recv() calls and similar. This option can reduce jitter when multiple threads invoke recv() on the same socket, but can reduce fairness between threads competing for the lock. EF_SPIN_USEC Name: ul_spin_usec default: 0 per-process Sets the timeout in microseconds for spinning options. Set this to to -1 to spin forever. The spin timeout may also be set by the EF_POLL_USEC option.Spinning typically reduces latency and jitter substantially, and can also improve throughput. However, in some applications spinning can harm performance; particularly application that have many threads. When spinning is enabled you should normally dedicate a CPU core to each thread that spins.You can use the EF_*_SPIN options to selectively enable or disable spinning for each API and transport. You can also use the onload_thread_set_spin() extension API to control spinning on a per-thread and per-API basis. EF_STACK_LOCK_BUZZ Name: stack_lock_buzz default: 0 min: 0 max: 1 per-process Spin while waiting to obtain a per-stack lock. If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_BUZZ_USEC.This option reduces jitter caused by lock contention, but can reduce fairness between threads competing for the lock. EF_STACK_PER_THREAD Name: stack_per_thread default: 0 min: 0 max: 1 per-process Create a separate Onload stack for the sockets created by each thread. EF_TCP Name: ul_tcp default: 1 min: 0 max: 1 per-process Clear to disable acceleration of new TCP sockets. Issue 15 © Solarflare Communications 2013 110 Onload User Guide EF_TCP_ACCEPT_SPIN Name: tcp_accept_spin default: 0 min: 0 max: 1 per-process Spin in blocking TCP accept() calls until data arrives, the spin timeout expires or the socket timeout expires(whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_TCP_ADV_WIN_SCALE_MAX Name: tcp_adv_win_scale_max default: 14 min: 0 max: 14 per-stack Maximum value for TCP window scaling that will be advertised. EF_TCP_BACKLOG_MAX Name: tcp_backlog_max default: 256 per-stack Places an upper limit on the number of embryonic (half-open) connections in an OpenOnload stack. EF_TCP_CLIENT_LOOPBACK Name: tcp_client_loopback default: 0 per-stack min: 0 max: CITP_TCP_LOOPBACK_TO_NEWSTACK Enable acceleration of TCP loopback connections on the connecting (client) side: 0 - not accelerated (default); 1 - accelerate if the listening socket is in the same stack (you should also set EF_TCP_SERVER_LOOPBACK!=0); 2 - accelerate and move accepted socket to the stack of the connecting socket (server should allow this via EF_TCP_SERVER_LOOPBACK=2); 3 - accelerate and move the connecting socket to the stack of the listening socket (server should allow this via EF_TCP_SERVER_LOOPBACK!=0). 4 - accelerate and move both connecting and accepted sockets to the new stack (server should allow this via EF_TCP_SERVER_LOOPBACK=2).NOTE: Option 3 breaks some usecases with epoll, fork and dup calls. EF_TCP_CONNECT_HANDOVER Name: tcp_connect_handover default: 0 min: 0 max: 1 per-stack When an accelerated TCP socket calls connect(), hand it over to the kernel stack. This option disables acceleration of active-open TCP connections. Issue 15 © Solarflare Communications 2013 111 Onload User Guide EF_TCP_FASTSTART_IDLE Name: tcp_faststart_idle default: 65536 min: 0 per-stack The FASTSTART feature prevents Onload from delaying ACKs during times when doing so may reduce performance. FASTSTART is enabled when a connection is new, following loss and after the connection has been idle for a while.This option sets the number of bytes that must be ACKed by the receiver before the connection exits FASTSTART. Set to zero to prevent a connection entering FASTSTART after an idle period. EF_TCP_FASTSTART_INIT Name: tcp_faststart_init default: 65536 min: 0 per-stack The FASTSTART feature prevents Onload from delaying ACKs during times when doing so may reduce performance. FASTSTART is enabled when a connection is new, following loss and after the connection has been idle for a while.This option sets the number of bytes that must be ACKed by the receiver before the connection exits FASTSTART. Set to zero to disable FASTSTART on new connections. EF_TCP_FASTSTART_LOSS Name: tcp_faststart_loss default: 65536 min: 0 per-stack The FASTSTART feature prevents Onload from delaying ACKs during times when doing so may reduce performance. FASTSTART is enabled when a connection is new, following loss and after the connection has been idle for a while.This option sets the number of bytes that must be ACKed by the receiver before the connection exits FASTSTART following loss. Set to zero to disable FASTSTART after loss. EF_TCP_FIN_TIMEOUT Name: fin_timeout default: 60 per-stack Time in seconds to wait for a FIN in the TCP FIN_WAIT2 state. EF_TCP_INITIAL_CWND Name: initial_cwnd default: 0 min: 0 max: SMAX per-stack Sets the initial size of the congestion window (in bytes) for TCP connections. Some care is needed as, for example, setting smaller than the segment size may result in Onload being unable to send traffic. WARNING: Modifying this option may violate the TCP protocol. Issue 15 © Solarflare Communications 2013 112 Onload User Guide EF_TCP_LISTEN_HANDOVER Name: tcp_listen_handover default: 0 min: 0 max: 1 per-stack When an accelerated TCP socket calls listen(), hand it over to the kernel stack. This option disables acceleration of TCP listening sockets and passively opened TCP connections. EF_TCP_LOSS_MIN_CWND Name: loss_min_cwnd default: 0 min: 0 max: SMAX per-stack Sets the minimum size of the congestion window for TCP connections following loss.WARNING: Modifying this option may violate the TCP protocol. EF_TCP_RCVBUF Name: tcp_rcvbuf_user default: 0 per-stack Override SO_RCVBUF for TCP sockets. (Note: the actual size of the buffer is double the amount requested, mimicking the behavior of the Linux kernel.) EF_TCP_RECV_SPIN Name: tcp_recv_spin default: 0 min: 0 max: 1 per-process Spin in blocking TCP receive calls until data arrives, the spin timeout expires or the socket timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_TCP_RST_DELAYED_CONN Name: rst_delayed_conn default: 0 min: 0 max: 1 per-stack This option tells Onload to reset TCP connections rather than allow data to be transmitted late. Specifically, TCP connections are reset if the retransmit timeout fires. (This usually happens when data is lost, and normally triggers a retransmit which results in data being delivered hundreds of milliseconds late).WARNING: This option is likely to cause connections to be reset spuriously if ACK packets are dropped in the network. Issue 15 © Solarflare Communications 2013 113 Onload User Guide EF_TCP_RX_CHECKS Name: tcp_rx_checks default: 0 min: 0 max: 1 per-stack Internal/debugging use only: perform extra debugging/consistency checks on received packets. EF_TCP_RX_LOG_FLAGS Name: tcp_rx_log_flags default: 0 per-stack Log received packets that have any of these flags set in the TCP header. Only active when EF_TCP_RX_CHECKS is set. EF_TCP_SEND_SPIN Name: tcp_send_spin default: 0 min: 0 max: 1 per-process Spin in blocking TCP send calls until window is updated by peer, the spin timeout expires or the socket timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_TCP_SERVER_LOOPBACK Name: tcp_server_loopback default: 0 min: 0 max: CITP_TCP_LOOPBACK_ALLOW_ALIEN_IN_ACCEPTQ per-stack Enable acceleration of TCP loopback connections on the listening (server) side: 0 - not accelerated (default); 1 - accelerate if the connecting socket is in the same stack (you should also set EF_TCP_CLIENT_LOOPBACK!=0); 2 - accelerate and allow accepted socket to be in another stack (this is necessary for clients with EF_TCP_CLIENT_LOOPBACK=2,4). EF_TCP_SNDBUF Name: tcp_sndbuf_user default: 0 per-stack Override SO_SNDBUF for TCP sockets (Note: the actual size of the buffer is double the amount requested, mimicking the behavior of the Linux kernel.) Issue 15 © Solarflare Communications 2013 114 Onload User Guide EF_TCP_SNDBUF_MODE Name: tcp_sndbuf_mode default: 0 min: 0 max: 1 per-stack This option controls how the SO_SNDBUF limit is applied to TCP sockets. In the default mode the limit applies only to the send queue. When this option is set to 1, the limit applies to the size of the send queue and retransmit queue combined. EF_TCP_SYN_OPTS Name: syn_opts default: 7 per-stack A bitmask specifying the TCP options to advertise in SYN segments.bit 0 (0x1) is set to 1 to enable PAWS and RTTM timestamps (RFC1323),bit 1 (0x2) is set to 1 to enable window scaling (RFC1323),bit 2 (0x4) is set to 1 to enable SACK (RFC2018),bit 3 (0x8) is set to 1 to enable ECN (RFC3128). EF_TCP_TCONST_MSL Name: msl_seconds default: 25 per-stack The Maximum Segment Lifetime (as defined by the TCP RFC). A smaller value causes connections to spend less time in the TIME_WAIT state. EF_TXQ_LIMIT Name: txq_limit default: 268435455 min: 16 * 1024 max: 0xfffffff per-stack Maximum number of bytes to enqueue on the transmit descriptor ring. EF_TXQ_RESTART Name: txq_restart default: 268435455 min: 1 max: 0xfffffff per-stack Level (in bytes) to which the transmit descriptor ring must fall before it will be filled again. EF_TXQ_SIZE Name: txq_size default: 512 min: 512 max: 4096 per-stack Set the size of the transmit descriptor ring. Valid values: 512, 1024, 2048 or 4096. The max transmit queue size supported by the SFN7000 series adapter is 2048. Issue 15 © Solarflare Communications 2013 115 Onload User Guide EF_TX_MIN_IPG_CNTL Name: tx_min_ipg_cntl default: 0 min: -1 max: 20 per-stack Rate pacing value. EF_TX_PUSH Name: tx_push default: 1 min: 0 max: 1 per-stack Enable low-latency transmit. EF_TX_PUSH_THRESHOLD Name: tx_push_thresh default: 100 min: 1 per-stack Sets a threshold for the number of outstanding sends before we stop using TX descriptor push. This has no effect if EF_TX_PUSH=0. This threshold is ignored, and assumed to be 1, on pre-SFN7000 series hardware. It makes sense to set this value similar to EF_SEND_POLL_THRESH. EF_TX_QOS_CLASS Name: tx_qos_class default: 0 min: 0 max: 1 per-stack Set the QOS class for transmitted packets on this Onload stack. Two QOS classes are supported: 0 and 1. By default both Onload accelerated traffic and kernel traffic are in class 0. You can minimise latency by placing latency sensitive traffic into a separate QOS class from bulk traffic. EF_UDP Name: ul_udp default: 1 min: 0 max: 1 per-process Clear to disable acceleration of new UDP sockets. EF_UDP_CONNECT_HANDOVER Name: udp_connect_handover default: 1 min: 0 max: 1 per-stack When a UDP socket is connected to an IP address that cannot be accelerated by OpenOnload, hand the socket over to the kernel stack.When this option is disabled the socket remains under the control of OpenOnload. This may be worthwhile because the socket may subsequently be re-connected to an IP address that can be accelerated. Issue 15 © Solarflare Communications 2013 116 Onload User Guide EF_UDP_PORT_HANDOVER2_MAX Name: udp_port_handover2_max default: 1 per-stack When set (together with EF_UDP_PORT_HANDOVER2_MIN), this causes UDP sockets explicitly bound to a port in the given range to be handed over to the kernel stack. The range is inclusive. EF_UDP_PORT_HANDOVER2_MIN Name: udp_port_handover2_min default: 2 per-stack When set (together with EF_UDP_PORT_HANDOVER2_MAX), this causes UDP sockets explicitly bound to a port in the given range to be handed over to the kernel stack. The range is inclusive. EF_UDP_PORT_HANDOVER3_MAX Name: udp_port_handover3_max default: 1 per-stack When set (together with EF_UDP_PORT_HANDOVER3_MIN), this causes UDP sockets explicitly bound to a port in the given range to be handed over to the kernel stack. The range is inclusive. EF_UDP_PORT_HANDOVER3_MIN Name: udp_port_handover3_min default: 2 per-stack When set (together with EF_UDP_PORT_HANDOVER3_MAX), this causes UDP sockets explicitly bound to a port in the given range to be handed over to the kernel stack. The range is inclusive. EF_UDP_PORT_HANDOVER_MAX Name: udp_port_handover_max default: 1 per-stack When set (together with EF_UDP_PORT_HANDOVER_MIN), this causes UDP sockets explicitly bound to a port in the given range to be handed over to the kernel stack. The range is inclusive. EF_UDP_PORT_HANDOVER_MIN Name: udp_port_handover_min default: 2 per-stack When set (together with EF_UDP_PORT_HANDOVER_MAX), this causes UDP sockets explicitly bound to a port in the given range to be handed over to the kernel stack. The range is inclusive. Issue 15 © Solarflare Communications 2013 117 Onload User Guide EF_UDP_RCVBUF Name: udp_rcvbuf_user default: 0 per-stack Override SO_RCVBUF for UDP sockets. (Note: the actual size of the buffer is double the amount requested, mimicking the behavior of the Linux kernel.) EF_UDP_RECV_SPIN Name: udp_recv_spin default: 0 min: 0 max: 1 per-process Spin in blocking UDP receive calls until data arrives, the spin timeout expires or the socket timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC. EF_UDP_SEND_SPIN Name: udp_send_spin default: 0 min: 0 max: 1 per-process Spin in blocking UDP send calls until space becomes available in the socket buffer, the spin timeout expires or the socket timeout expires (whichever is the sooner). If the spin timeout expires, enter the kernel and block. The spin timeout is set by EF_SPIN_USEC or EF_POLL_USEC.Note: UDP sends usually complete very quickly, but can block if the application does a large burst of sends at a high rate. This option reduces jitter when such blocking is needed. EF_UDP_SEND_UNLOCKED Name: udp_send_unlocked default: 1 min: 0 max: 1 per-stack Enables the 'unlocked' UDP send path. When enabled this option improves concurrency when multiple threads are performing UDP sends. EF_UDP_SEND_UNLOCK_THRESH Name: udp_send_unlock_thresh default: 1500 per-stack UDP message size below which we attempt to take the stack lock early. Taking the lock early reduces overhead and latency slightly, but may increase lock contention in multi-threaded applications. EF_UDP_SNDBUF Issue 15 © Solarflare Communications 2013 118 Onload User Guide Name: udp_sndbuf_user default: 0 per-stack Override SO_SNDBUF for UDP sockets. (Note: the actual size of the buffer is double the amount requested, mimicking the behavior of the Linux kernel.) EF_UL_EPOLL Name: ul_epoll default: 1 min: 0 max: 2 per-process Choose epoll implementation. The choices are: 0 - kernel (unaccelerated) 1 - user-level (accelerated, lowest latency) 2 - kernel-accelerated (best when there are lots of sockets in the set)The default is the user-level implementation (1). EF_UL_POLL Name: ul_poll default: 1 min: 0 max: 1 per-process Clear to disable acceleration of poll() calls at user-level. EF_UL_SELECT Name: ul_select default: 1 min: 0 max: 1 per-process Clear to disable acceleration of select() calls at user-level. EF_UNCONFINE_SYN Name: unconfine_syn default: 1 min: 0 max: 1 per-stack Accept TCP connections that cross into or out-of a private network. EF_UNIX_LOG Name: log_level default: 3 per-process A bitmask determining which kinds of diagnostics messages will be logged. 0x1 errors 0x2 unexpected 0x4 setup 0x8 verbose 0x10 select() 0x20 poll() 0x100 socket set-up 0x200 socket control 0x400 socket caching 0x1000 signal interception 0x2000 library enter/ exit 0x4000 log call arguments 0x8000 context lookup 0x10000 pass-through 0x20000 very verbose 0x40000 Verbose returned error 0x80000 V.Verbose errors: show 'ok' too 0x20000000 verbose transport control 0x40000000 very verbose transport control 0x80000000 verbose pass-through Issue 15 © Solarflare Communications 2013 119 Onload User Guide EF_URG_RFC Name: urg_rfc default: 0 min: 0 max: 1 per-stack Choose between compliance with RFC1122 (1) or BSD behaviour (0) regarding the location of the urgent point in TCP packet headers. EF_USE_DSACK Name: use_dsack default: 1 min: 0 max: 1 per-stack Whether or not to use DSACK (duplicate SACK). EF_USE_HUGE_PAGES Name: huge_pages default: 1 min: 0 max: 2 per-stack Control of whether huge pages are used for packet buffers: 0 - no; 1 - use huge pages if available (default); 2 always use huge pages and fail if huge pages are not available.Mode 1 prints syslog message if there is not enough huge pages in the system.Mode 2 guarantees only initially-allocated packets to be in huge pages. It is recommended to use this mode together with EF_MIN_FREE_PACKETS, to control the number of such guaranteed huge pages. All non-initial packets are allocated in huge pages when possible; syslog message is printed if the system is out of huge pages.Non-initial packets may be allocated in non-huge pages without any warning in syslog for both mode 1 and 2 even if the system has free huge pages. EF_VALIDATE_ENV Name: validate_env default: 1 min: 0 max: 1 per-stack When set this option validates Onload related environment variables (starting with EF_). EF_VFORK_MODE NAME: vfork_mode default: 1 min: 0 max: 2 per-process This option dictates how vfork() intercept should work. After a vfork(), parent and child still share address space but not file descriptors. We have to be careful about making changes in the child that can be seen in the parent. We offer three options here. Different apps may require different options depending on their use of vfork(). If using EF_VFORK_MODE=2, it is not safe to create sockets or pipes in the child before calling exec(). 0 - Old behavior. Replace vfork() with fork() 1 - Replace vfork() with fork() and block parent till child exits/execs 2 - Replace vfork() with vfork() Issue 15 © Solarflare Communications 2013 120 Onload User Guide Appendix B: Meta Options There are several environment variables which act as meta-options and set several of the options detailed in Appendix A. These are: EF_POLL_USEC Setting EF_POLL_USEC causes the following options to be set: EF_SPIN_USEC=EF_POLL_USEC EF_SELECT_SPIN=1 EF_EPOLL_SPIN=1 EF_POLL_SPIN=1 EF_PKT_WAIT_SPIN=1 EF_TCP_SEND_SPIN=1 EF_UDP_RECV_SPIN=1 EF_UDP_SEND_SPIN=1 EF_TCP_RECV_SPIN=1 EF_BUZZ_USEC=EF_POLL_USEC EF_SOCK_LOCK_BUZZ=1 EF_STACK_LOCK_BUZZ=1 NOTE: If neither of the spinning options; EF_POLL_USEC and EF_SPIN_USEC are set, Onload will resort to default interrupt driven behaviour because the EF_INT_DRIVEN environment variable is enabled by default. EF_BUZZ_USEC Setting EF_BUZZ_USEC sets the following options: EF_SOCK_LOCK_BUZZ=1 EF_STACK_LOCK_BUZZ=1 NOTE: If EF_POLL_USEC is set to value N, then EF_BUZZ_USEC is also set to N only if N <= 100, If N > 100 then EF_BUZZ_USEC will be set to 100. This is deliberate as spinning for too long on internal locks may adversely affect performance. However the user can explicity set EF_BUZZ_USEC value e.g. export EF_POLL_USEC=10000 export EF_BUZZ_USEC=1000 Issue 15 © Solarflare Communications 2013 121 Onload User Guide Appendix C: Build Dependencies General Before Onload network and kernel drivers can be built and installed, the target platform must support the following capabilities: • Support a general C build environment - i.e. has gcc, make, libc and libc-devel. • Can compile kernel modules - i.e. has the correct kernel-devel package for the installed kernel version. • If 32 bit applications are to be accelerated on 64 bit architectures the machine must be able to build 32 bit applications. Building Kernel Modules The kernel must be built with CONFIG_NETFILTER enabled. Standard distributions will already have this enabled, but it must also be enabled when building a custom kernel. This option does not affect performance. The following commands can be used to install kernel development headers. • Debian based Distributions - including Ubuntu (any kernel): apt-get install linux-headers-$(uname -r) • For RedHat/Fedora (not for 32bit Kernel): If the system supports a 32 bit Kernel and the kernel is PAE then install kernel-PAE-devel otherwise install the following package: yum -y install kernel-devel • For SuSE: yast -i kernel-source Building 32 bit applications on 64 bit architecture platforms The following commands can be used to install 32 bit libc development headers. • Debian based Distributions - including Ubuntu: apt-get install gcc-multilib libc6-dev-i386 • For RedHat/Fedora: yum -y install glibc-devel.i586 • For SuSE: yast -i glibc-devel-32bit yast -i gcc-32bit Issue 15 © Solarflare Communications 2013 122 Onload User Guide Appendix D: Onload Extensions API The Onload Extensions API allows the user to customise an application using advanced features to improve performance. The Extensions API does not create any runtime dependency on Onload and an application using the API can run without Onload. The license for the API and associated libraries is a BSD 2-Clause License. This section covers the follows topics: Common Components...Page 112 Stacks API...Page 117 Zero-Copy API...Page 123 Receive Filtering API Overview...Page 134 Templated Sends...Page 136 Source Code The onload source code is provided with the Onload distribution. Entry points for the source code are: src/lib/transport/unix/onload_ext_intercept.c src/lib/transport/unix/zc_intercept.c Common Components For all applications employing the Extensions API the following components are provided: • #include An application should include the header file containing function prototypes and constant values required when using the API. • libonload_ext.a, libonload_ext.so This library provides stub implementations of the extended API. An application that wishes to use the extensions API should link against this library. When Onload is not present, the application will continue to function, but calls to the extensions API will have no effect (unless documented otherwise). To link to this library include the ’-l’ linker option on the compiler command line i.e. -lonload_ext Issue 15 © Solarflare Communications 2013 112 Onload User Guide • onload_is_present() Description If the application is linked with libonload_ext, but not running with Onload this will return 0. If the application is running with Onload this will return 1. Definition int onload_is_present (void) Formal Parameters none Return Value Returns 1 from libonload.so library or 0 from libonload_ext.a library • onload_fd_stat() Description Retrieves internal details about an accelerated socket. Definition see above Formal Parameters see above Return Value Returns: 0 socket is not accelerated 1 socket is accelerated -ENOMEM when memory cannot be allocated Notes Issue 15 When calling free() on stack_name use the (char *) because memory is allocated using malloc. © Solarflare Communications 2013 113 Onload User Guide • onload_fd_check_feature() Description Used to check whether the Onload file descriptor supports a feature or not. Definition see above Formal Parameters see above Return Value 0 if the feature is supported >0 if the fd is supported -ENOSYS if onload_fd_check_feature() is not supported. Notes Issue 15 none. © Solarflare Communications 2013 114 Onload User Guide • onload_thread_set_spin() Description For each thread, specify which operations should spin. Definition int onload_thread_set_spin( enum onload_spin_type type, unsigned spin) Formal Parameters type: which operation to change the spin status of. Type must be one of the following: enum onload_spin_type{ ONLOAD_SPIN_ALL ONLOAD_SPIN_UDP_RECV, ONLOAD_SPIN_UDP_SEND, ONLOAD_SPIN_TCP_RECV, ONLOAD_SPIN_TCP_SEND, ONLOAD_SPIN_TCP_ACCEPT, ONLOAD_SPIN_PIPE_RECV, ONLOAD_SPIN_PIPE_SEND, ONLOAD_SPIN_SELECT, ONLOAD_SPIN_POLL, ONLOAD_SPIN_PKT_WAIT, ONLOAD_SPIN_EPOLL_WAIT }; spin: is a boolean which indicates whether the operation should spin or not. Return Value Return zero 0 on success or -EINVAL if unsupported type is specified. Notes Spin time (for all threads) is set using the EF_SPIN_USEC parameter. Disable all sorts of spinning: onload_thread_set_spin(ONLOAD_SPIN_ALL, 0); Enable all sorts of spinning: onload_thread_set_spin(ONLOAD_SPIN_ALL, 1); The onload_thread_set_spin API can be used to control spinning on a per-thread or per-API basis. The existing spin-related configuration options set the default behaviour for threads, and the onload_thread_set_spin API overrides the default. Issue 15 © Solarflare Communications 2013 115 Onload User Guide To enable spinning only for certain threads: 1 Set the spin timeout by setting EF_SPIN_USEC, and disable spinning by default by setting EF_POLL_USEC=0. 2 In each thread that should spin, invoke onload_thread_set_spin(). To disable spinning only in certain threads: 1 Enable spinning by setting EF_POLL_USEC= . 2 In each thread that should not spin, invoke onload_thread_set_spin(). NOTE: If a thread is set to NOT spin and then blocks this may invoke an interrupt for the whole stack. Interrupts occurring on moderately busy threads may cause unintended and undesirable consequences. Issue 15 © Solarflare Communications 2013 116 Onload User Guide Stacks API Using the Onload Extensions API an application can bind selected sockets to specific Onload stacks and in this way ensure that time-critical sockets are not starved of resources by other non-critical sockets. The API allows an application to select sockets which are to be accelerated thus reserving Onload resources for performance critical paths. This also prevents non-critical paths from creating jitter for critical paths. • onload_set_stackname() Description Select the Onload stack that new sockets are placed in. Definition int onload_set_stackname( int who, int scope, const char *name) Formal Parameters who: Must be one of the following: ONLOAD_THIS_THREAD - to modify the stack name in which all subsequent sockets are created by this thread. ONLOAD_ALL_THREADS - to modify the stack name in which all subsequent sockets are created by all threads in the current process. ONLOAD_THIS_THREAD takes precedence over ONLOAD_ALL_THREADS. scope: Must be one of the following: ONLOAD_SCOPE_THREAD - name is scoped with current thread ONLOAD_SCOPE_PROCESS - name is scoped with current process ONLOAD_SCOPE_USER - name is scoped with current user ONLOAD_SCOPE_GLOBAL - name is global across all threads, users and processes. ONLOAD_SCOPE_NOCHANGE - undo effect of a previous call to onload_set_stackname(ONLOAD_THIS_THREAD, ...) see notes. name: is the stack name up to 8 characters. or can be an empty string to set no stackname or can be the special value ONLOAD_DONT_ACCELERATE to prevent sockets created in this thread, user, process from being accelerated. Sockets identified by the options above will belong to the Onload stack until a subsequent call using onload_set_stackname identifies a different stack or the ONLOAD_SCOPE_NOCHANGE option is used. Issue 15 © Solarflare Communications 2013 117 Onload User Guide Return Value 0 on success -1 with errno set to ENAMETOOLONG if the name exceeds permitted length -1 with errno set to EINVAL if other parameters are invalid. Notes 1. This applies for stacks selected for sockets created by socket() and for pipe(), it has no effect on accept(). Passively opened sockets created via accept() will always be in the same stack as the listening socket that they are linked to, this means that the following are functionally identical i.e. onload_set_stackname(foo) socket listen onload_set_stackname(bar) accept and onload_set_stackname(foo) socket listen accept onload_set_stackname(bar) In both cases the listening socket and the accepted socket will be in stack foo. 2. Scope defines the namespace in which a stack belongs. A stackname of foo in scope user is not the same as a stackname of foo in scope thread. Scope restricts the visibility of a stack to either the current thread, current process, current user or is unrestricted (global). This has the property that with, for example, process based scoping, two processes can have the same stackname without sharing a stack - as the stack for each process has a different namespace. 3. Scoping can be thought of as adding a suffix to the supplied name e.g. ONLOAD_SCOPE_THREAD: -t ONLOAD_SCOPE_PROCESS: -p ONLOAD_SCOPE_USER: -u ONLOAD_SCOPE_GLOBAL: This is an example only and the implementation is free to do something different such as maintaining different lists for different scopes. Issue 15 © Solarflare Communications 2013 118 Onload User Guide Notes (continued) 4. ONLOAD_SCOPE_NOCHANGE will undo the effect of a previous call to onload_set_stackname(ONLOAD_THIS_THREAD, ...). If you have previously used onload_set_stackname(ONLOAD_THIS_THREAD, ...) and want to revert to the behaviour of threads that are using the ONLOAD_ALL_THREADS configuration, without changing that configuration, you can do the following: onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_NOCHANGE, "");. Related environment variables are: • EF_DONT_ACCELERATE Default: 0 Min: 0 Max: 1 Per-process If this environment variable is set then acceleration for ALL sockets is disabled and handed off to the kernel stack until the application overrides this state with a call to onload_set_stackname(). • EF_STACK_PER_THREAD Default: 0 Min: 0 Max: 1 Per-process If this environment variable is set each socket created by the application will be placed in a stack depending on the thread in which it is created. Stacks could, for example, be named using the thread ID of the thread that creates the stack, but this should not be relied upon. A call to onload_set_stackname overrides this variable. EF_DONT_ACCELERATE takes precedence over this variable. • EF_NAME The environment variable EF_NAME will be honoured to control Onlaod stack sharing. However, a call to onload_set_stackname overrides this variable and, EF_DONT_ACCELERATE and EF_STACK_PER_THREAD both take precedence over EF_NAME. Issue 15 © Solarflare Communications 2013 119 Onload User Guide • onload_stackname_save() Description Save the state of the current onload stack identified by the previous call to onload_set_stackname() Definition int onload_stackname_save (void) Formal Parameters none Return Value 0 on success -ENOMEM when memory cannot be allocated. • onload_stackname_restore() Description Restore stack state saved with a previous call to onload_stackname_save(). All updates/changes to state of the current stack will be deleted and all state previously saved will be restored. Definition int onload_stackname_restore (void) Formal Parameters none Return Value 0 on success non-zero if an error occurs. The API stackname save and restore functions provide flexibility when binding sockets to an Onload stack. Using a combination of onload_set_stackname(), onload_stackname_save() and onload_stackname_restore(), the user is able to create default stack settings which apply to one or more sockets, save this state and then create changed stack settings which are applied to other sockets. The original default settings can then be restored to apply to subsequent sockets. Stacks API Usage Using a combination of the EF_DONT_ACCELERATE environment variable and the function onload_set_stackname(), the user is able to control/select sockets which are to be accelerated and isolate these performance critical sockets and threads from the rest of the system. Issue 15 © Solarflare Communications 2013 120 Onload User Guide • onload_stack_opt_set_int() Description Set/modify per stack options that all subsequently created stacks will use instead of using the existing global stack options.. Definition int onload_stack_opt_set_int( const char* name, int64_t value) Formal Parameters name: stack option to modify value: new value for the stack option. Example: onload_stack_opt_set_int(dont_accelerate, 1); Return Value 0 on success -1 with errno set to EINVAL if the requested option is not found. Notes Cannot be used to modify options on existing stacks - only for new stacks. Cannot be used to modify process options - only stack options. Modified options will be used for all newly created stacks until onload_stack_opt_reset() is called. • onload_stack_opt_reset() Issue 15 Description Revert to using global stack options for newly created stacks.. Definition int onload_stack_opt_reset(void) Formal Parameters none. Return Value 0 always Notes Should be called following a call to onload_stack_opt_set_int() to revert to using global stack options for all newly created stacks. © Solarflare Communications 2013 121 Onload User Guide Stacks API - Examples • This thread will use stack foo, other threads in the stack will continue as before. onload_set_stackname(ONLOAD_THIS_THREAD, ONLOAD_SCOPE_GLOBAL, "foo") • All threads in this process will get their own stack called foo. This is equivalent to the EF_STACK_PER_THREAD environment variable. onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_THREAD, "foo") • All threads in this process will share a stack called foo. If another process did the same function call it will get its own stack. onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_PROCESS, "foo") • All threads in this process will share a stack called foo. If another process run by the same user did the same, it would share the same stack as the first process. If another process run by a different user did the same it would get is own stack. onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_USER, "foo") • Equivalent to EF_NAME. All threads will use a stack called foo which is shared by any other process which does the same. onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_GLOBAL, "foo") • Equivalent to EF_DONT_ACCELERATE. New sockets/pipes will not be accelerated until another call to onload_set_stackname(). onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_GLOBAL, ONLOAD_DONT_ACCELERATE) Issue 15 © Solarflare Communications 2013 122 Onload User Guide Zero-Copy API Zero-Copy can improve the performance of networking applications by eliminating intermediate buffers when transferring data between application and network adapter. The Onload Extensions Zero-Copy API supports zero-copy of UDP received packet data and TCP transmit packet data. The API provides the following components: • #include In addition to the common components, an application should include this header file which contains all function prototypes and constant values required when using the API. This file includes comprehensive documentation, required data structures and function definitions. Zero-Copy Data Buffers To avoid the copy data is passed to and from the application in special buffers described by a struct onload_zc_iovec. A message or datagram can consist of multiple iovecs using a struct onload_zc_msg. A single call to send may involve multiple messages using an array of struct onload_zc_mmsg. Figure 7: Zero-Copy Data Buffers Issue 15 © Solarflare Communications 2013 123 Onload User Guide Zero-Copy UDP Receive Overview Figure 8 illustrates the difference between the normal UDP receive mode and the zero-copy method. When using the standard POSIX socket calls, the adapter delivers packets to an Onload packet buffer which is described by a descriptor previously placed in the RX descriptor ring. When the application calls recv(), Onload copies the data from the packet buffer to an application-supplied buffer. Using the zero-copy UDP receive API the application calls the onload_zc_recv() function including a callback function which will be called when data is ready. The callback can directly access the data inside the Onload packet buffer avoiding a copy. Figure 8: Traditional vs. Zero-Copy UDP Receive A single call using onload_zc_recv() function can result in multiple datagrams being delivered to the callback function. Each time the callback returns to Onload the next datagram is delivered. Processing stops when the callback instructs Onload to cease delivery or there are no further received datagrams. If the receiving application does not require to look at all data received (i.e. is filtering) this can result in a considerable performance advantage because this data is not pulled into the processor's cache, thereby reducing the application cache footprint. As a general rule, the callback function should avoid calling other system calls which attempt to modify or close the current socket. Zero-copy UDP Receive is implemented within the Onload Extensions API. Issue 15 © Solarflare Communications 2013 124 Onload User Guide Zero-Copy UDP Receive The onload_zc_recv() function specifies a callback to invoke for each received UDP datagram. The callback is invoked in the context of the call to onload_zc_recv() (i.e. It blocks/spins waiting for data). Before calling, the application must set the following in the struct onload_zc_recv_args: cb set to the callback function pointer user_ptr set to point to application state, this is not touched by onload msg.msghdr.msg_control msg_controllen msg_name msg_namelen the user application should set these to appropriate buffers and lengths (if required) as you would for recvmsg (or NULL and 0 if not used) flags set to indicate behavior (e.g. ONLOAD_MSG_DONTWAIT) Figure 9: Zero-Copy recv_args The callback gets to examine the data, and can control what happens next: (i) whether or not the buffer(s) are kept by the callback or are immediately freed by Onload; and (ii) whether or not onload_zc_recv() will internally loop and invoke the callback with the next datagram, or immediately return to the application. The next action is determined by setting flags in the return code as follows: ONLOAD_ZC_KEEP Issue 15 the callback function can elect to retain ownership of received buffer(s) by returning ONLOAD_ZC_KEEP. Following this, the correct way to release retained buffers is to call onload_zc_release_buffers()to explicitly release the first buffer from each received datagram. Subsequent buffers pertaining to the same datagram will then be automatically released. © Solarflare Communications 2013 125 Onload User Guide ONLOAD_ZC_CONTINUE to suggest that Onload should loop and process more datagrams ONLOAD_ZC_TERMINATE to insist that Onload immediately return from the onload_zc_recv() Flags can also be set by Onload: ONLOAD_ZC_END_OF_BURST Onload sets this flag to indicate that this is the last packet ONLOAD_ZC_MSG_SHARED Packet buffers are read only If there is unaccelerated data on the socket from the kernel’s receive path this cannot be handled without copying. The application has two choices as follows: Issue 15 ONLOAD_MSG_RECV_OS_INLI NE set this flag when calling onload_zc_recv(). Onload will deal with the kernel data internally and pass it to the callback check return code check the return code from onload_zc_recv(). If it returns ENOTEMPTY then the application must call onload_recvmsg_kernel() to retrieve the kernel data. © Solarflare Communications 2013 126 Onload User Guide Zero-Copy Receive Example #1 Figure 10: Zero-Copy Receive -example #1 Issue 15 © Solarflare Communications 2013 127 Onload User Guide Zero-Copy Receive Example #2 Figure 11: Zero-Copy Receive - example #2 NOTE: onload_zc_recv() should not be used together with onload_set_recv_filter()and only supports accelerated (Onloaded) sockets. For example, when bound to a broadcast address the socket fd is handed off to the kernel and this function will return ESOCKNOTSUPPORT. Issue 15 © Solarflare Communications 2013 128 Onload User Guide Zero-Copy TCP Send Overview Figure 12 illustrates the difference between the normal TCP transmit method and the zero- copy method. When using standard POSIX socket calls, the application first creates the payload data in an application allocated buffer before calling the send() function. Onload will copy the data to a Onload packet buffer in memory and post a descriptor to this buffer in the network adapter TX descriptor ring. Using the zero-copy TCP transmit API the application calls the onload_zc_alloc_buffers() function to request buffers from Onload. A pointer to a packet buffer is returned in response. The application places the data to send directly into this buffer and then calls onload_zc_send() to indicate to Onload that data is available to send. Onload will post a descriptor for the packet buffer in the network adapter TX descriptor ring and ring the TX doorbell. The network adapter fetches the data for transmission. Figure 12: Traditional vs. Zero-Copy TCP Transmit NOTE: The socket used to allocate zero-copy buffers must be in the same stack as the socket used to send the buffers. When using TCP loopback, Onload can move a socket from one stack to another. Users must ensure that they ALWAYS USE BUFFERS FROM THE CORRECT STACK. NOTE: The onload_zc_send function does not currently support the ONLOAD_MSG_MORE or TCP_CORK flags. Zero-copy TCP transmit is implemented within the Onload Extensions API. Issue 15 © Solarflare Communications 2013 129 Onload User Guide Zero-Copy TCP Send The zero-copy send API supports the sending of multiple messages to different sockets in a single call. Data buffers must be allocated in advance and for best efficiency these should be allocated in blocks and off the critical path. The user should avoid simply moving the copy from Onload into the application, but where this is unavoidable, it should also be done off the critical path. Figure 13: Zero-Copy send Figure 14: Zero-Copy allocate buffers The onload_zc_send() function return value identifies how many of the onload_zc_mmsg array’s rc fields are set. Each onload_zc_mmsg.rc returns how many bytes (or error) were sent in for that message. Refer to the table below. rc = onload_zc_send() rc < 0 application error calling onload_zc_send(). rc is set to the error code rc == 0 should not happen 0 < rc <= n_msgs rc is set to the number of messages whose status has been sent in mmsgs[i].rc. rc == n_msgs is the normal case rc = mmsg[i].rc rc < 0 error sending this message. rc is set to the error code rc >= 0 rc is set to the number of bytes that have been sent in this message. Compare to the message length to establish which buffers sent Sent buffers are owned by Onload. Unsent buffers are owned by the application and must be freed or reused to avoid leaking. Issue 15 © Solarflare Communications 2013 130 Onload User Guide Zero-Copy Send - Single Message, Single Buffer Figure 15: Zero-Copy - Single Message, Single Buffer Example The example above demonstrates error code handling. Note it contains an examples of bad practice where buffers are allocated and populated on the critical path. Issue 15 © Solarflare Communications 2013 131 Onload User Guide Zero-Copy Send - Multiple Message, Multiple Buffers Figure 16: Zero-Copy - Multiple Messages, Multiple Buffers Example The example above demonstrates error code handling and contains some examples of bad practice where buffers are allocated and populated on the critical path. Issue 15 © Solarflare Communications 2013 132 Onload User Guide Zero-Copy Send - Full Example Figure 17: Zero-Copy Send Issue 15 © Solarflare Communications 2013 133 Onload User Guide Receive Filtering API Overview The Onload Extensions Receive API allows user-defined filters to determine if data received on a UDP socket should be discarded before it enters the socket receive buffer. Receive filtering provides an alternative to the onload_zc_recv() function described in the previous sections. It allows the application to examine the received data inplace and direct Onload to discard a subset of the data if not required - thereby saving the overhead of having to copy unwanted data. Only the subset of datagrams that the application is interested in are copied into the application buffers Figure 18: UDP Receive Filtering Receive filtering is implemented within the Onload Extensions API. NOTE: An application using the Receive Filtering API can continue to use any POSIX function on the socket e.g select() poll(), epoll_wait() or recv(), but must not use the onload_zc_receive() function. Issue 15 © Solarflare Communications 2013 134 Onload User Guide Receive Filtering API The Onload Extensions Receive Filtering API installs a user defined filter that determines whether received data is passed to the receive function or discarded by Onload. The API provides the following components: • #include In addition to the common components, an application should include this header file which contains all function prototypes and constant values required when using the API. This file includes comprehensive documentation, required data structures and function definitions The Receive Filtering API is a variation on the zero-copy receive whereby the normal socket methods are used for accessing the data, but the application can specify a filter that is run on each datagram before it is received. The filter can elect to discard or even modify the received message before this is copied to the application. Figure 19: Receive Filter The onload_set_recv_filter() function returns immediately and the filter callback is executed in the context of subsequent calls to recv(), recvmsg() etc. NOTE: When using receive filtering, calls to poll(), select() etc, may return that a socket is readable, but if the filter then discards the datagram a call to recv() would not find it and block. NOTE: The application should respect the ONLOAD_ZC_MSG_SHARED flag - this indicates that the datagram may be delivered to more than one socket and so should not be modified. Issue 15 © Solarflare Communications 2013 135 Onload User Guide Receive Filter - Example Figure 20: Receive Filter - Example . NOTE: The onload_set_recv_filter() function should NOT be used together with the onload_zc_recv() function. Templated Sends For a description of the templates sends feature, refer to Templated Sends on page 71. For a description of the packet template to be used by the templated sends feature refer to the use notes and references to onload_msg_template in the [onload]/src/include/onload/ extensions_zc.h file included with the Onload 201310 distribution. Issue 15 © Solarflare Communications 2013 136 Onload User Guide Appendix E: onload_stackdump Introduction The Solarflare onload_stackdump diagnostic utility is a component of the Onload distribution which can be used to monitor Onload performance, set tuning options and examine aspects of the system performance. NOTE: To view data for all stacks, created by all users, the user must be root when running onload_stackdump. Non-root users can only view data for stacks created by themselves and accessible to them via the EF_SHARE_WITH variable. The following examples of onload_stackdump are demonstrated elsewhere in this user guide: • Monitoring Using onload_stackdump on page 30 • Processing at User-Level on page 30 • As Few Interrupts as Possible on page 31 • Eliminating Drops on page 32 • Minimizing Lock Contention on page 33 General Use The onload_stackdump tool can produce an extensive range of data and it can be more useful to limit output to specific stacks or to specific aspects of the system performance for analysis purposes. • For help, and to list all onload_stackdump commands and options: onload_stackdump -? • To list and read environment variables descriptions: onload_stackdump doc • For descriptions of statistics variables: onload_stackdump describe_stats Describes all statistics listed by the onload_stackdump lots command. • To identify all stacks, by identifier and name, and all processes accelerated by Onload: onload_stackdump #stack-id stack-name 6 teststack pids 28570 • To limit the command/option to a specific stack e.g (stack 4). onload_stackdump 4 lots Issue 15 © Solarflare Communications 2013 137 Onload User Guide List Onloaded Processes The ’onload_stackdump processes’ command will show the PID and name of processes being accelerated by Onload and the Onload stack being used by each process e.g. # onload_stackdump processes #pid stack-id cmdline 25587 3 ./sfnt-pingpong Onloaded proceseses which have not created a socket are not displayed, but can be identified using the lsof command. Identify Onloaded Processes Affinities The ’onload_stackdump affinities’ command will identify the task affinity for an accelerated process e.g. # onload_stackdump affinities pid=25587 cmdline=./sfnt-pingpong task25587: 80 The task affinity is identified from an 8 bit mask i.e. 01 is CPU core 0, 02 is CPU core 1, 80 is CPU core 7 etc. List Onload Environment variables The ’onload_stackdump env ’ command will identify onloaded processes running in the current environment and list all Onload variables set in the current environment e.g. # EF_POLL_USEC=100000 EF_TXQ_SIZE=4096 EF_INT_DRIVE=1 onload # onload_stackdump env pid: 25587 cmdline: ./sfnt-pingpong env: EF_POLL_USEC=100000 env: EF_TXQ_SIZE=4096 env: EF_INT_DRIVEN=1 TX PIO Counters The Onload stackdump utility exposes new counters to indicate how often TX PIO is being used see Programmed I/O on page 70. To view PIO counters run the following command: $ onload_stackdump stats | grep pio pio_pkts: 2485971 no_pio_err: 0 Issue 15 © Solarflare Communications 2013 138 Onload User Guide The values returned will identify the number of packets sent via PIO and number of times when PIO was not used due to an error condition. Removing Zombie and Orphan Stacks Onload stacks and sockets can remain active even after all processes using them have been terminated or have exited, for example to ensure sent data is successfully received by the TCP peer or to honour TCP TIME_WAIT semantics. Such stacks should always eventually self-destruct and disappear with no user intervention. However, these stacks, in some instances, cause problems for re-starting applications, for example the application may be unable to use the same port numbers when these are still being used by the persistent stack socket. Persistent stacks also retain resources such as packet buffers which are then denied to other stacks. Such stacks are termed ’zombie’ or ’orphan’ stacks and it may be undesirable or desirable that they exist. • To list all persistent stacks: # onload_stackdump -z all No output to the console or syslog means that no such stacks exist. • To list a specific persistent stack: # onload_stackdump -z • To display the state of persistent stacks: # onload_stackdump -z dump • To terminate persistent stacks # onload_stackdump -z kill • To display all options available for zombie/orphan stacks: # onload_stackdump --help Snapshot vs. Dynamic Views The onload_stackdump tool presents a snapshot view of the system when invoked. To monitor state and variable changes whilst an application is running use onload_stackdump with the Linux watch command e.g. snapshot: onload_stackdump netif dynamic: watch -d -n1 onload_stackdump netif Some onload_stackdump commands also update periodically whilst monitoring a process. These commands usually have the watch_ prefix e.g. watch_stats, watch_more_stats, watch_tcp_stats, watch_ip_stats etc. Use the onload_stackdump -h option to list all commands. Issue 15 © Solarflare Communications 2013 139 Onload User Guide Monitoring Receive and Transmit Packet Buffers onload_stackdump packets onload_stackdump packets ci_netif_pkt_dump_all: id=6 pkt_bufs: size=2048 max=32768 alloc=576 free=50 async=0 pkt_bufs: rx=525 rx_ring=522 rx_queued=3 pkt_bufs: tx=1 tx_ring=0 tx_oflow=0 tx_other=1 509: 0x8000 Rx 1: 0x4000 Nonb n_zero_refs=66 n_freepkts=50 estimated_free_nonb=16 free_nonb=0 nonb_pkt_pool=a39ffff $ The onload_stackdump packets command can be useful to review packet buffer allocation, use and reuse within a monitored process. The example above identifies that the process has a maximum of 32768 buffers (each of 2048 bytes) available. From this pool 576 buffers have been allocated and 50 from that allocation are currently free for reuse - that means they can be pushed onto the receive or transmit ring buffers ready to accept new incoming/outgoing data. On the receive side of the stack, 525 packet buffers have been allocated, 522 have been pushed to the receive ring - and are available for incoming packets, and 3 are currently in the receive queue for the application to process. On the transmit side of the stack, only 1 packet buffer is currently allocated and because it is not currently in the transmit ring and is not in an overflow buffer it is counted as tx_other. The remaining values are calculations based on the packet buffer values. Issue 15 © Solarflare Communications 2013 140 Onload User Guide TCP Application STATS The following onload_stackdump commands can be used to monitor accelerated TCP connections: onload_stackdump tcp_stats Field Description tcp_active_opens Number of socket connections initiated by the local end tcp_passive_opens Number of sockets connections accepted by the local end tcp_attempt_fails Number of failed connection attempts tcp_estab_resets Number of established connections which were subsequently reset tcp_curr_estab Number of socket connections in the established or close_wait states tcp_in_segs Total number of received segments - includes errored segments tcp_out_segs Total number of transmitted segments - excluding segments containing only retransmitted octets tcp_retran_segs Total number of retransmitted segments tcp_in_errs Total number of segments received with errors tcp_out_rsts Number of reset segments sent onload_stackdump more_stats | grep tcp Issue 15 Field Description tcp_has_recvq Non zero if receive queue has data ready tcp_recvq_bytes Total bytes in receive queue tcp_recvq_pkts Total packets in receive queue tcp_has_recv_reorder Non zero if socket has out of sequence bytes tcp_recv_reorder_pkts: Number of out of sequence packets received tcp_has_sendq Non zero if send queues have data ready tcp_sendq_bytes Number of bytes currently in all send queues for this connection © Solarflare Communications 2013 141 Onload User Guide tcp_sendq_pkts Number of packets currently in all send queues for this connection tcp_has_inflight Non zero if some data remains unacknowledged tcp_inflight_bytes Total number of unacknowledged bytes tcp_inflight_pkts Total number of unacknowledged packets tcp_n_in_listenq Number of sockets (summed across all listening sockets) where the local end has responded to SYN, with a SYN_ACK, but this has not yet been acknowledged by the remote end tcp_n_in_acceptq Number of sockets (summed across all listening sockets) that are currently queued waiting for the local application to call accept() Use the onload_stackdump -h command to list all TCP connection, stack and socket commands. The onload_stackdump LOTS Command. The onload_stackdump lots command will produce extensive data for all accelerated stacks and sockets. The command can also be restricted to a specific stack and its associated connections when the stack number is entered on the command line e.g. onload_stackdump lots onload_stackdump 2 lots For descriptions of the statistics refer to the output from the following command: onload_stackdump describe_stats The following tables describe the output from the onload_stackdump lots command for: • TCP stack • TCP established connection socket • TCP listening socket • UDP socket Within the tables the following abbreviations are used: • rx = receive (or receiver), tx = transmit (or send) • pkts = packets, skts = sockets • Max = maximum, num = number, seq = sequence number Issue 15 © Solarflare Communications 2013 142 Onload User Guide Stackdump Output: TCP Stack onload_stackdump lots Command entered ci_netif_dump: stack=7 name= Stack id and stack name as set by EF_NAME. ver=201310 uid=0 pid=21098 Onload version, user id and process id of creator process lock=20000000 LOCKED Internal stack lock status nics=3 primed=1 nics = bitfield identifies adapters used by this stack e.g. 3 = 0x11 - so stack is using NICs 1 and 2. primed = 1 means the event queue will generate an interrupt when the next event arrives sock_bufs: max=1024 n_allocated=4 Max number of sockets buffers which can be allocated, and number currently in use. Socket buffers are also used by pipes. pkt_bufs: size=2048 max=32768 alloc=576 free=57 async=0 Packet buffers: A total of 32768 (each of 2048 bytes) pkt buffers are available to this stack. 576 have been allocated of which 57 are free and can be reused by either receive or transmit rings. async = buffers that are not free, not being used, not being reaped - i.e in a state waiting to be returned for reuse Issue 15 © Solarflare Communications 2013 143 Onload User Guide pkt_bufs: rx=517 rx_ring=514 rx_queued=3 Receive packet buffers: A total of 517 pkt buffers are currently in use, 514 have been pushed to the receive ring, 3 are in the application’s receive queue If the CRITICAL flag is displayed it indicates a memory pressure condition in which the number of packets in the receive socket buffers (rx=517) is approaching the EF_MAX_RX_PACKETS value. If the LOW flag is displayed it indicates a memory pressure condition when there are not enough packet buffers available to refill the RX descriptor ring. pkt_bufs: tx=2 tx_ring=1 tx_oflow=0 tx_other=1 Transmit packet buffers: time: netif=5eb5c61 poll=5eb5c61 now=5eb5c61 (diff=0.000sec) Internal timer values ci_netif_dump_vi: stack=7 intf=0 vi_instance=87 hw=0C0 Data describes the stack’s virtual interface to the NIC evq: cap=2048 current=16de30 is_32_evs=0 is_ev=0 Event queue data: A total of 2 pkt buffers are currently in use, 1 remains in the transmit ring, 0 buffers have overflowed. tx_other = pkt buffers not in use by transmit and not in tx_ring or tx_oflow queue cap - max num of events queue can hold current - current event queue location is_32_evs - is 1 if there are 32 or more events pending is_ev - is 1 if there are any events pending Issue 15 © Solarflare Communications 2013 144 Onload User Guide rxq: cap=511 lim=511 spc=1 level=510 total_desc=93666 Receive queue data: cap - total capacity lim - max fill level for receive descriptor ring, specified by EF_RXQ_LIMIT spc - amount of free space in receive queue - how many descriptors could be added before the receive queue becomes full level - how full the receive queue currently is total_desc - total number of descriptors that have been pushed to the receive queue txq: cap=511 lim=511 spc=511 level=0 pkts=0 oflow_pkts=0 Transmit queue data: cap - total capacity lim - max fill level for transmit descriptor ring, specified by EF_TXQ_LIMIT spc - amount of free space in the transmit queue - how many descriptors could be added before the transmit queue becomes full level - how full the transmit queue currently is pkts - how many packets are represented by the descriptors in the transmit queue oflow - how many packets are in the overflow transmit queue (i.e. waiting for space in the NIC's transmit queue) txq: tot_pkts=93669 bytes=0 Total number of packets sent and number of packet bytes currently in the queue ci_netif_dump_extra: stack=7 Additional data follows in_poll=0 post_poll_list_empty=1 poll_did_wake=0 Stack Polling Status: in_poll = process is currently polling post_poll_list_empty=1, (1=true, 0=false) tasks to be done once polling is complete poll_did_wake = while polling, the process identified a socket which needs to be woken following the poll Issue 15 © Solarflare Communications 2013 145 Onload User Guide rx_defrag_head=-1 rx_defrag_tail=-1 Reassembly sequence numbers. -1 means no re-assembly has occurred tx_tcp_may_alloc=1 nonb_pool=1 send_may_poll=0 is_spinner=0 TCP buffer data: tx_tcp_may_alloc=num pkt buffers tcp could use nonb_pool= number of pkt buffers available to tcp process without holding lock send_may_poll=0 is_spinner= TRUE if a thread is spinning send_may_poll=0 0 hwport_to_intf_i=0,-1,-1,-1,-1,-1 intf_i_to_hwport=0,0,0,0,0,0 Internal mapping of internal interfaces to hardware ports uk_intf_ver=03e89aa26d20b98fd08793e771f2cdd 9 md5 user/kernel interface checksum computed by both kernel and user application to verify internal data structures ci_netif_dump_reap_list: stack=7 Identifies sockets that have buffers which can be freed e.g. 7:2 = stack 7 socket 2 7:2 7:1 Issue 15 © Solarflare Communications 2013 146 Onload User Guide Stackdump Output: TCP Established Connection Socket TCP 7:1 lcl=192.168.1.2:50773 rmt=192.168.1.1:34875 ESTABLISHED Socket Configuration. lock: 10000000 UNLOCKED Internal stack lock status rx_wake=0000b6f4(RQ) tx_wake=00000002 flags: Internal sequence values that are incremented each time a queue is ’woken’ addr_spc_id=fffffffffffffffe s_flags: REUSE BOUND Address space identifier in which this socket exists and flags set on the socket Stack:socket id, local and remote ip:port address, TCP connection is ESTABLISHED Allow bind to reuse local addresses rcvbuf=129940 sndbuf=131072 rx_errno=0 tx_errno=0 so_error=0 Socket receive buffer size, send buffer size, rx_errno = ZERO whilst data can still arrive, otherwise contains error code. tx_errno = ZERO if transmit can still happen, otherwise contains error code. so_error = current socket error (0 = no error) tcpflags: TSO WSCL SACK ESTAB TCP flags currently set for this sockets TCP state: ESTABLISHED State of the TCP connection snd: up=b554bb86 una-nxt-max=b554bb86b554bb87-b556b6a6 enq=b554bb87 TCP sequence numbers. up = (urgent pointer) sequence of byte following the 00B byte una-nxt-max = sequence number of first unacknowledged byte, sequence number of next byte we expect to be acknowledged and max = sequence of last byte in the current send window enq = sequence number of last byte currently queued for transmit Issue 15 © Solarflare Communications 2013 147 Onload User Guide send=0(0) pre=0 inflight=1(1) wnd=129824 unused=129823 Send Data. send = number of pkts(bytes) sent pre = number of pkts in pre-send queue. A process can add data to the prequeue when it is prevented from sending the data immediately. The data will be sent when the current sending operation is complete inflight = number of pkts(bytes) sent but not yet acknowledged wnd = receiver’s advertised window size (bytes) and number of free (unused) space (bytes) in that window snd: cwnd=49733+0 used=0 ssthresh=65535 bytes_acked=0 Open Congestion window (cwnd). cwnd = congestion window size (bytes) used = portion of the cwnd currently in use slowstart thresh - number of bytes that have to be sent before process can exit slow start bytes_acked = number of bytes acknowledged - this value is used to calculate the rate at which the congestion window is opened current cwnd status = OPEN snd:Onloaded(Valid) if=6 mtu=1500 intf_i=0 vlan=0 encap=4 Onloaded = can reach the destination via an accelerated interface. (Valid) = cached control plane information is up-to-date, can send immediately using this information. (Old) = cached control plane information may be out-of-date. On next send Onload will do a control plane lookup - this will add some latency. rcv: nxt-max=0e9251fe-0e944d1d current=0e944d92 FASTSTART FAST Receiver Data. nxt-max = next byte we expect to receive and last byte we expect to receive (because of window size) current = byte currently being processed Issue 15 © Solarflare Communications 2013 148 Onload User Guide rob_n=0 recv1_n=2 recv2_n=0 wnd adv=129823 cur=129940 usr=0 Reorder buffer. Bytes received out of sequence are put into a reorder buffer awaiting further bytes before reordering can occur. rob_n = num of bytes in reorder buffer recv1_n = num of bytes in general reorder buffer recv2_n = num of bytes in urgent data reorder buffer wnd adv = receiver advertised window size cur = current receive window size usr = current tcp stack user async: rx_put=-1 rx_get=-1 tx_head=-1 Asynchronous queue data. eff_mss=1448 smss=1460 amss=1460 used_bufs=2 uid=0 wscl s=1 r=1 Max Segment Size. eff_mss = effective_mss smss = sender mss amss = advertised mss used_bufs = number of transmit buffers used user id that created this socket(0 = root) wscl s/r = parameters to window scaling algorithm srtt=01 rttvar=000 rto=189 zwins=0,0 Round trip time (RTT) - all values are timer ticks. srtt = smothed RTT value rttvar = RTT variation rto = current RTO timeout value zwins = zero windows, times when advertised window has gone to zero size. Issue 15 © Solarflare Communications 2013 149 Onload User Guide retrans=0 dupacks=0 rtos=0 frecs=0 seqerr=0 ooo_pkts=0 ooo=0 Re-transmissions. number of retrans which have occurred dupacks = number of duplicate acks received rtos = number of retrans timeouts frecs = number of fast recoverys seqerr = number of sequence errors number of out of sequence pkts number of out of order events timers: Currently active timers tx_nomac Number of TCP packets sent via the OS using raw sockets when up to date ARP data is not available. Issue 15 © Solarflare Communications 2013 150 Onload User Guide Stackdump Output: TCP Stack Listen Socket TCP 7:3 lcl=0.0.0.0:50773 rmt=0.0.0.0:0 LISTEN Socket configuration. stack:socket id, LISTENING socket on port 50773 local and remote addresses not set - not bound to any IP addr Internal stack lock status lock: 10000000 UNLOCKED rx_wake=00000000 tx_wake=00000000 flags: addr_spc_id=fffffffffffffffe s_flags: REUSE BOUND PBOUND Internal sequence values that are incremented each time a queue is ’woken’ Address space identifier in which this socket exists and flags set on the socket Allow bind to reuse local port rcvbuf=129940 sndbuf=131072 rx_errno=6b tx_errno=20 so_error=0 Receive Buffer. tcpflags: WSCL SACK Flags advertised during handshake listenq: max=1024 n=0 Listen Queue. socket receive buffer size, send buffer size, rx_errno = ZERO whilst data can still arrive, otherwise contains error code. tx_errno = ZERO if transmit can still happen, otherwise contains error code. so_error = current socket error (0 = no error) queue of half open connects (SYN received and SYNACK sent - waiting for final ACK) n - number of connections in the queue acceptq: max=5 n=0 get=-1 put=-1 total=0 Accept Queue. queue of open connections, waiting for application to call accept(). max = max connections that can exist in the queue n = current number of connections get/put = indexes for queue access total = num of connections that have traversed this queue Issue 15 © Solarflare Communications 2013 151 Onload User Guide epcache: n=0 cache=EMPTY pending=EMPTY Endpoint cache. n = number of endpoints currently known to this socket cache = EMPTY or yes if endpoints are still cached pending = EMTPY or yes if endpoints stilll have to be cached defer_accept=0 Number of times TCP_DEFER_ACCEPT kicked in - see TCP socket options l_overflow=0 l_no_synrecv=0 a_overflow=0 a_no_sock=0 ack_rsts=0 os=2 l_overflow = number of times listen queue was full and had to reject a SYN request l_no_synrecv = number of times unable to allocate internal resource for SYN request a_overflow = number of times unable to promote connection to the accept queue which is full a_no_sock = number of times unable to create socket ack_rsts = number of times received an ACK before SYN so the connection was reset os=2 there are 2 sockets being processed in the kernel Issue 15 © Solarflare Communications 2013 152 Onload User Guide Stackdump Output: UDP Socket: UDP 4:1 lcl=192.168.1.2:38142 rmt=192.168.1.1:42638 UDP Socket Configuration. stack:socket id, UDP socket on port 38142 Local and remote addresses and ports lock: 20000000 LOCKED Stack internal lock status rx_wake=000e69b0 tx_wake=000e69b1 flags: Internal sequence values that are incremented each time a queue is ’woken’ addr_spc_id=fffffffffffffffe s_flags: REUSE Address space identifier in which this socket exists and flags set on the socket Allow bind to reuse local addresses rcvbuf=129024 sndbuf=129024 rx_errno=0 tx_errno=0 so_error=0 Buffers. udpflags: FILT MCAST_LOOP RXOS Flags set on the UDP socket mcast_snd: intf=-1 ifindex=0 saddr=0.0.0.0 ttl=1 mtu=1500 Multicast. socket receive buffer size, send buffer size, rx_errno = ZERO whilst data can still arrive, otherwise contains error code. tx_errno = ZERO if transmit can still happen, otherwise contains error code. so_error = current socket error (0 = no error) intf = multicast hardware port id (-1 means port was not set) ifindex = interface (port) identifier saddr = IP address tt1 = time to live (default for multicast =1) mtu = max transmission unit size rcv: q_bytes=0 q_pkts=0 reap=2 tot_bytes=30225920 tot_pkts=944560 Receive Queue. q_bytes = num bytes currently in rx queue q_pkts = num pkts currently in rx queue tot_bytes = total bytes received tot_pkts = total pkts received Issue 15 © Solarflare Communications 2013 153 Onload User Guide rcv: oflow=0(0%) drop=0 eagain=0 pktinfo=0 Overflow Buffer. q_max=0 oflow = number of datagrams in the overflow queue when the socket buffer is full. drop = number of datagrams dropped due to running out of packet buffer memory. eagain = number of times the application tried to read from a socket when there is no data ready - this value can be ignored on the rcv side pktinfo = number of times IP_PKTINFO control message was received q_max = max depth reached by the receive queue (bytes) rcv: os=0(0%) os_slow=0 os_error=0 Number of datagrams received via: os = operating system os_slow = operating system slow socket os_error = recv() function call via OS returned an error snd: q=0+0 ul=944561 os=0(0%) os_slow=0(0%) Send values. q = number of bytes sent to the interface but not yet transmitted ul = number of datagrams sent via onload os = number of datagrams sent via OS os_slow number of datagrams sent via OS slow path Unconnected UDP send. snd: cp_match=0(0%) cp_match = number dgrams sent via accelerated path and percent this is of all unconnected send dgrams snd: lk_poll=0(0%) lk_pkt=944561(100%) lk_snd=0(0%) Stack internal lock. lk_poll = number of times the lock was held while we poll the stack lk_pkt = number of pkts sent while holding the lock lk_snd = number of times the lock was held while sending data Issue 15 © Solarflare Communications 2013 154 Onload User Guide snd: lk_defer=0(0%) cached_daddr=0.0.0.0 Sending deferred to the process/thread currently holding the lock snd: eagain=0 spin=0 block=0 eagain = count of the number of times the application tried to send data, but the transmit queue is already full. A high value on the send side may indicate transmit issues. spin = number of times process had to spin when the send queue was full block = number of times process had to block when the send queue was full snd: poll_avoids_full=0 fragments=0 confirm=0 poll_avoids_full = number of times polling created space in the send queue fragments = number of (non first) fragments sent confirm = number of datagrams sent with MSG_CONFIRM flag snd: os_late=0 unconnect_late=0 os_late = number of pkts sent via OS after copying unconnect_late = number of pkts silently dropped when process/thread becomes disconnected during a send procedure Following the stack and socket data onload_stackdump lots will display a list of statistical data. For descriptions of the fields refer to the output from the following command: onload_stackdump describe_stats The final list produced by onload_stackdump lots shows the current values of all environment variables in the monitored process environment. For descriptions of the environment variables refer to Appendix A: Parameter Reference on page 93 or use the onload_stackdump doc command. Issue 15 © Solarflare Communications 2013 155 Onload User Guide Appendix F: Solarflare sfnettest Introduction Solarflare sfnettest is a set of benchmark tools and test utilities supplied by Solarflare for benchmark and performance testing of network servers and network adapters. The sfnettest is available in binary and source forms from: http://www.openonload.org/ Download the sfnettest- .tgz source file and unpack using the tar command. tar -zxvf sfnettest- .tgz Run the make utility from the /sfnettest- /src subdirectory to build the benchmark applications. Refer to the README.sfnt-pingpong or README.sfnt-stream files in the distribution directory once sfnettest is installed. sfnt-pingpong Description: The sfnt-pingpong application measures TCP and UDP latency by creating a single socket between two servers and running a simple message pattern between them. The output identifies latency and statistics for increasing TCP/UDP packet sizes. Usage: sfnt-pingpong [options] [ [ ]] Options: Issue 15 Option Description --port server port --sizes single message size (bytes) --connect connect() UDP socket --spin spin on non-blocking recv() --muxer select, poll or epoll --serv-muxer none, select, poll or epoll (same as client by default) --rtt report round-trip-time --raw dump raw results to files --percentile percentile © Solarflare Communications 2013 156 Onload User Guide Option Description --minmsg minimum message size --maxmsg maximum message size --minms min time per msg size (ms) --maxms max time per msg size (ms) --miniter minimum iterations for result --maxiter maximum iterations for result --mcast use multicast addressing --mcastintf set the multicast interface. The client sends this parameter to the server. --mcastintf=eth2 both client and server use eth2 --mcastintf=’eth2;eth3’ client uses eth2 and server uses eth3 (quotes are required for this format) --mcastloop IP_MULTICAST_LOOP --bindtodev SO_BINDTODEVICE --forkboth fork client and server --n-pipe include pipes in file descriptor set --n-unix-d include unix datagrams in the file descriptor set --n-unix-s include unix streams in the file descriptor set --n-udp include UDP sockets in file descriptor set --n-tcpc include TCP sockets in file descriptor set --n-tcpl include TCP listening sockets in file descriptor set --tcp-serv host:port for TCP connections --timeout socket SND/RECV timeout --affinity ’ ; ’ Enclose values in quotes. This option should be set on the client side only. The client sends the value to the server. The user must ensure that the identified server core is available on the server machine. This option will override any value set by taskset on the same command line. --n-pings Issue 15 number of ping messages © Solarflare Communications 2013 157 Onload User Guide Option Description --n-pongs number of pong messages --nodelay enable TCP_NODELAY Standard options: Option Description -? --help this message -q --quiet quiet -v --verbose display more information Example TCP latency command lines: [root@server]# onload --profile=latency taskset -c 1 ./sfnt-pingpong [root@client]# onload --profile=latency taskset -c 1 ./sfnt-pingpong -maxms=10000 --affinity "1;1" tcp Example UDP latency command lines: [root@server]# onload --profile=latency taskset -c 9 ./sfnt-pingpong [root@client]# onload --profile=latency taskset -c 9 ./sfnt-pingpong -maxms=10000 --affinity "9;9" udp Example output: # version: 1.4.0-modified # src: 13b27e6b86132da11b727fbe552e2293 # date: Sat Apr 21 11:56:22 BST 2012 # uname: Linux server4.uk.level5networks.com 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 x86_64 GNU/Linux # cpu: model name : Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz # lspci: 05:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) # lspci: 05:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) # lspci: 83:00.0 Ethernet controller: Solarflare Communications SFC9020 [Solarstorm] # lspci: 83:00.1 Ethernet controller: Solarflare Communications SFC9020 [Solarstorm] # lspci: 85:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection # eth0: driver: igb # eth0: version: 3.0.6-k Issue 15 © Solarflare Communications 2013 158 Onload User Guide # # # # # # # # # # # # # # # # # # # # # # # # # # # # # eth0: bus-info: 0000:05:00.0 eth1: driver: igb eth1: version: 3.0.6-k eth1: bus-info: 0000:05:00.1 eth2: driver: sfc eth2: version: 3.2.1.6083 eth2: bus-info: 0000:83:00.0 eth3: driver: sfc eth3: version: 3.2.1.6083 eth3: bus-info: 0000:83:00.1 eth4: driver: e1000e eth4: version: 1.4.4-k eth4: bus-info: 0000:85:00.0 virbr0: driver: bridge virbr0: version: 2.3 virbr0: bus-info: N/A virbr0-nic: driver: tun virbr0-nic: version: 1.6 virbr0-nic: bus-info: tap ram: MemTotal: 32959748 kB tsc_hz: 3099966880 LD_PRELOAD=libonload.so server LD_PRELOAD=libonload.so onload_version=201205 EF_TCP_FASTSTART_INIT=0 EF_POLL_USEC=100000 EF_TCP_FASTSTART_IDLE=0 size 1 2 4 8 16 32 mean 2453 2453 2467 2465 2460 2474 min 2380 2379 2380 2383 2380 2399 median 2434 2435 2436 2446 2441 2454 64 2495 2419 2474 max 18288 45109 10502 8798 7494 8758 12174 %ile 2669 2616 2730 2642 2632 2677 stddev 77 90 82 70 68 71 iter 1000000 1000000 1000000 1000000 1000000 1000000 2716 77 1000000 The output identifies mean, minimum, median and maximum (nanosecond) RTT/2 latency for increasing packet sizes including the 99% percentile and standard deviation for these results. A message size of 32 bytes has a mean latency of 2.4 microsecs with a 99%ile latency less than 2.7 microsecs. Issue 15 © Solarflare Communications 2013 159 Onload User Guide sfnt-stream The sfnt-stream application measures RTT latency (not 1/2 RTT) for a fixed size message at increasing message rates. Latency is calculated from a sample of all messages sent. Message rates can be set with the rates option and the number of messages to sample using the sample option. Solarflare sfnt-stream only functions on UDP sockets. This limitation will be removed to support other protocols in the future. Refer to the README.sfnt-stream file which is part of the Onload distribution for further information. Usage: sfnt-stream [options] [tcp|udp|pipe|unix_stream|unix_datagram [host[:port]]] Options: Option Description --msgsize message size (bytes) --rates msg rates - [+ ] --millisec time per test (milliseconds) --samples number of samples per test --stop stop when TX rate achieved is below give percentage of target rate --maxburst maximum burst length --port server port number --connect connect() UDP socket --spin spin on non-blocking recv() --muxer select, poll, epoll or none --rtt report round-trip-time --raw dump raw results to file --percentile percentile --mcast set the multicast address --mcastintf set multicast interface. The client sends this parameter to the server. --mcastintf=eth2 both client and server use eth2 --mcastintf=’eth2;eth3’ client uses eth2 and server uses eth3 (quotes are required for this format) Issue 15 © Solarflare Communications 2013 160 Onload User Guide Option Description --mcastloop IP_MULTICAST_LOOP --ttl IP_TTL and IP_MULTICAST_TTL --bindtodevice SO_BINDTODEVICE --n-pipe include pipes in file descriptor set --n-unix-d include unix datagram in file descriptor set --n-unix-s include unix stream in file descriptor set --n-udp include UDP sockets in file descriptor set --n-tcpc include TCP sockets in file descriptor set --n-tcpl include TCP listening sockets in file descriptor set --tcpc-serv host:port for TCP connections --nodelay enable TCP_NODELAY --affinity " , ; " enclose the values in double quotes e.g. "4,5;3". This option should be set on the client side only. The client sends the value to the server. The user must ensure that the identified server core is available on the server machine. This option will override any value set by taskset on the same command line. --rtt-iter iterations for RTT measurement standard options: Issue 15 Option Description -? --help this message -q --quiet quiet -v --verbose display more information --version display version information © Solarflare Communications 2013 161 Onload User Guide Example command lines client/server # ./sfnt-stream (server) # ./sfnt-stream --affinity 1,1 udp (client) # ./taskset -c 1 ./sfnt-stream --affinity="3,5;3" --mcastintf=eth4 udp (client) Bonded Interfaces: sfnt-stream The following example configures a single bond, having two slaves interfaces, on each machine. Both client and server machines use eth4 and eth5. Client Configuration [root@client src]# ifconfig eth4 0.0.0.0 down [root@client src]# ifconfig eth5 0.0.0.0 down [root@client src]# modprobe bonding miimon=100 mode=1 xmit_hash_policy=layer2 primary=eth5 [root@client src]# ifconfig bond0 up [root@client src]# echo +eth4 > /sys/class/net/bond0/bonding/slaves [root@client src]# echo +eth5 > /sys/class/net/bond0/bonding/slaves [root@client src]# ifconfig bond0 172.16.136.27/21 [root@client sfnt-stream: sfnt-stream: sfnt-stream: src]# onload --profile=latency taskset -c 3 ./sfnt-stream server: waiting for client to connect... server: client connected server: client 0 at 172.16.136.28:45037 Server Configuration [root@server src]# ifconfig eth4 0.0.0.0 down [root@server src]# ifconfig eth5 0.0.0.0 down [root@server src]# modprobe bonding miimon=100 mode=1 xmit_hash_policy=layer2 primary=eth5 [root@server src]# ifconfig bond0 up [root@server src]# echo +eth4 > /sys/class/net/bond0/bonding/slaves [root@server src]# echo +eth5 > /sys/class/net/bond0/bonding/slaves [root@server src]# ifconfig bond0 172.16.136.28/21 NOTE: server sends to IP address of client bond [root@server src]# onload --profile=latency taskset -c 1 ./sfnt-stream -mcastintf=bond0 --affinity "1,1;3" udp 172.16.136.27 Issue 15 © Solarflare Communications 2013 162 Onload User Guide Output Fields: All time measurements are nanoseconds unless otherwise stated. Issue 15 Field Description mps target Msg per sec target rate mps send Msg per sec actual rate mps recv Msg receive rate latency mean RTT mean latency latency min RTT minimum latency latency median RTT median latency latency max RTT maximum latency latency %ile RTT 99%ile latency stddev Standard deviation of sample latency samples Number of messages used to calculate latency measurement sendjit mean Mean variance when sending messages sendjit min Minimum variance when sending messages sendjit max Maximum variance when sending messages sendjit behind Number of times the sender falls behind and is unable to keep up with the transmit rate gaps n_gaps Count the number of gaps appearing in the stream gaps n_drops Count the number of drops from stream gaps n_ooo Count the number of sequence numbers received out of order © Solarflare Communications 2013 163 Onload User Guide Appendix G: onload_tcpdump Introduction By definition, Onload is a kernel bypass technology and this prevents packets from being captured by packet sniffing applications such as tcpdump, netstat and wireshark. Onload supports the onload_tcpdump application that supports packet capture from onload stacks to a file or to be displayed on standard out (stdout). Packet capture files produced by onload_tcpdump can then be imported to the regular tcpdump, wireshark or other third party application where users can take advantage of dedicated search and analysis features. Onload_tcpdump allows for the capture of all TCP and UDP unicast and multicast data sent or received via Onload stacks - including shared stacks. Building onload_tcpdump The onload_tcpdump script is supplied with the Onload distribution and is located in the Onload /scripts sub-directory. NOTE: libpcap and libpcap-devel must be built and installed BEFORE Onload is installed. Using onload_tcpdump For help use the ./onload_tcpdump -h command: Usage: onload_tcpdump [-o stack-(id|name) [-o stack ...]] tcpdump_options_and_parameters "man tcpdump" for details on tcpdump parameters. You may use stack id number or shell-like pattern for the stack name to specify the Onload stacks to listen on. If you do not specify stacks, onload_tcpdump will monitor all onload stacks. If you do not specify interface via -i option, onload_tcpdump listens on ALL interfaces instead of the first one. For further information refer to the Linux man tcpdump pages. Examples: • Capture all accelerated traffic from eth2 to a file called mycaps.pcap: # onload_tcpdump -ieth2 -wmycaps.pcap • If no file is specified onload_tcpdump will direct output to stdout: # onload_tcpdump -ieth2 • To capture accelerated traffic for a specific Onload stack (by name): # onload_tcpdump -ieth4 -o stackname Issue 15 © Solarflare Communications 2013 164 Onload User Guide • To capture accelerated traffic for a specific Onload stack (by ID): # onload_tcpdump -o 7 • To capture accelerated traffic for Onload stacks where name begins with "abc" # onload_tcpdump -o ’abc*’ • To capture accelerated traffic for onload stack 1, stack named "stack2" and all onload stacks with name beginning with "ab": # onload_tcpdump -o 1 -o ’stack2’ -o ’ab*’ Dependencies The onload_tcpdump application requires libpcap and libpcap-devel to be installed on the server. If libpcap is not installed the following message is reported when onload_tcpdump is invoked: ./onload_tcpdump ci Onload was compiled without libpcap development package installed. You need to install libpcap-devel or libpcap-dev package to run onload_tcpdump. tcpdump: truncated dump file; tried to read 24 file header bytes, only got 0 Hangup If libpcap is missing it can be downloaded from http://www.tcpdump.org/ Untar the compressed file on the target server and follow build instructions in the INSTALL.txt file. The libpcap package must be installed before Onload is built and installed. Limitations • Currently onload_tcpdump captures only packets from onload stacks and not from kernel stacks. • The onload_tcpdump application monitors stack creation events and will attach to newly created stacks however, there is a short period (normally only a few milliseconds) between stack creation and the attachment during which packets sent/received will not be captured. Known Issues Users may notice that the packets sent when the destination address is not in the host ARP table causes the packets to appear in both onload_tcpdump and (Linux) tcpdump. SolarCapture Solarflare’s SolarCapture is a packet capture application for Solarflare network adapters. It is able to capture received packets from the wire at line rate, assigning accurate timestamps to each packet. Packets are captured to PCAP file or forwarded to user-supplied logic for processing. For details see the SolarCapture User Guide (SF-108469-CD) available from https://support.solarflare.com/. Issue 15 © Solarflare Communications 2013 165 Onload User Guide Appendix H: ef_vi The Solarflare ef_vi API is a layer 2 API that grants an application direct access to the Solarflare network adapter datapath to deliver lower latency and reduced per message processing overheads. ef_vi is the internal API used by Onload for sending and receiving packets. It can be used directly by applications that want the very lowest latency send and receive API and that do not require a POSIX socket interface. Characteristics • ef_vi is packaged with the Onload distribution. • ef_vi is an OSI level 2 interface which sends and receives raw Ethernet frames. • ef_vi supports a zero-copy interface because the user process has direct access to memory buffers used by the hardware to receive and transmit data. • An application can use both ef_vi and Onload at the same time. For example, use ef_vi to receive UDP market data and Onload sockets for TCP connections for trading. • The ef_vi API can deliver lower latency than Onload and incurs reduced per message overheads. • ef_vi is free software distributed under a LGPL license. • The user application wishing to use the layer 2 ef_vi API must implement the higher layer protocols. Components All components required to build and link a user application with the Solarflare ef_vi API are distributed with Onload. When Onload is installed all required directories/files are located under the Onload distribution directory: Compiling and Linking Refer to the README.ef_vi file in the Onload directory for compile and link instructions. Using ef_vi Users of ef_vi must first allocate a virtual interface (VI), encapsulated by the type "ef_vi". A VI includes: • A receive descriptor ring (for receiving packets). • A transmit descriptor ring (for sending packets). • An event queue (to receive notifications from the hardware). Issue 15 © Solarflare Communications 2013 166 Onload User Guide Figure 21: Virtual Interface Components To transmit a packet, the application writes the packet contents (including all headers) into one or more packet buffers, and calls ef_vi_transmit(). One or more descriptors that describe the packet are queued in the transmit ring, and a doorbell is "rung" to inform the adapter that the transmit ring is non-empty. To receive packets, descriptors, each identifying a buffer, are queued in the receive ring -ef_vi_receive_init() and _post()(Refer to Handling Events on page 171). When packets arrive at the VI, they are written into the buffers in fifo order. The event queue is a channel from the adapter to software which notifies software when packets arrive from the network, and when transmits complete (so that the buffers can be freed or reused). The application retrieves these events by calling ef_eventq_poll(). The buffers used for packet data must be pinned so that they cannot be paged, and they must be registered for DMA with the network adapter. The type "ef_iobufset" encapsulates a set of buffers. The adapter uses a special address space to identify locations in these buffers, and such addresses are designated by the type "ef_addr". Filters are the means by which the adapter decides where to deliver packets it receives from the network. By default all packets are delivered to the kernel network stack. Filters are added by the application to direct received packets to a given VI. Issue 15 © Solarflare Communications 2013 167 Onload User Guide Protection Domain The protection domain is a collection of VIs and memory regions tied to a single user interface. A memory region can be registered with different protection domains. This is useful for zero-copy forwarding. Refer also to ef_vi - Physical Addressing Mode on page 169 and ef_vi - Scalable Packet Buffer Mode on page 169. Any memory region in a protection domain can be used with any VI in the protection domain. Figure 22: Create a Protection Domain Virtual Interface Figure 23: Allocate a Virtual Interface A virtual interface consists of a transmit queue, receive queue and event queue. A VI can be allocated with just one of the above three components, but if the event queue is omitted an optional VI that has an event queue must be specified. Memory Region Figure 24: Allocate a Memory Region The ef_memreg_alloc() function registers a block of memory to be used with VIs within a protection domain. Performance will be improved when cache lined memory is set to a use a minimum 4KB or aligning the memory region on a 2MB boundary to use huge pages. Issue 15 © Solarflare Communications 2013 168 Onload User Guide Receive Packet Buffers Returns the size (bytes) of the received packet meta data prefix. A packet buffer should be at least as large as the value returned from ef_vi_receive_buffer_len. Address Space Three modes are possible for setting up the adapter address space; buffer table, SR-IOV, or no translation. The buffer table is a block of memory on the adapter that does the translation from buffer ID to physical addresses. When using a SFN5000 or SFN6000 series adapter there are 120,000 entries in the buffer table. Each buffer is mapped in each adapter so, regardless of the number of NICs installed, there are a total of 120,000 packet buffers in the system. The SFN7000 series adapters can employ more than the 120K packet buffers without the need to use Scalable packet buffer mode refer to Large Buffer Table Support on page 62 for details. SR-IOV employs the IOMMU virtual addresses. The IOMMU removes the 120K buffer limitation of the buffer table. See Scalable Packet Buffer Mode on page 62 for details of how to configure SR-IOV and enable the system IOMMU. The no translation mode requires the application to identify actual physical addresses to the adapter which means the application can direct the adapter to read/write any piece of memory. It is important to ensure that packet buffers are page aligned. ef_vi - Physical Addressing Mode An ef_vi application can use physical addressing mode, see Physical Addressing Mode on page 70. To enable physical addressing mode set the environment variable EF_VI_PD_FLAGS=phys. ef_vi - Scalable Packet Buffer Mode Using Scalable Packet Buffer Mode, packet buffers are allocated from the kernel IOMMU instead of from the buffer table. An ef_vi application can enable this mode by setting EF_VI_PD_FLAGS=vf. The caveats applicable to an Onload application will also apply i.e. SR-IOV must be enabled, the kernel must support an IOMMU. Refer to Scalable Packet Buffer Mode on page 62 for configuration details and caveats. Issue 15 © Solarflare Communications 2013 169 Onload User Guide Filters • The application is able to set multiple types of filters on the same VI. • If a filter already exists, an error is returned. • Cookies are used to remove filters. • De-allocating a VI removes the filters set for the VI • NOTE: IP filters do not match IP fragments, which are therefore received by the kernel stack. If this is an issue, layer 2 filters should be installed by the user. Figure 25: Creating Filters An issue has been identified in Onload-201310 such that if two applications attempt to install the same unicast|multicast all filters, an error will be returned and either application could receive the packets (both not both). If either application then exits, the packets are returned through the kernel and the remaining application will not receive the packets. This will be addressed in a future Onload release. Transmitting Packets The packet buffer memory must have been previously registered with the protection domain. If the transmit queue is empty when the doorbell is rung, 'TX PUSH' is used. In 'TX_PUSH', the doorbell is rung and the address of the packet buffer is written in one shot improving latency. TX_PUSH can Issue 15 © Solarflare Communications 2013 170 Onload User Guide cause ef_vi to poll for events, to check if the transmit queue is empty, before sending which can lead to a latency versus throughput trade off in some scenarios. Figure 26: Transmit Packets Handling Events • Receive descriptors should be posted to the adapter receive ring in multiples of 8. When an application pushes 10 descriptors, ef_vi will push 8 and ev_vi will ignore descriptor batch sizes < 8. Users should beware that if the rx ring is empty and the application pushes < 8 descriptors before blocking on the event queue, the application will remain blocked as there are no descriptors available to receive packets so nothing gets posted to the event queue. • The batch size for polling should be greater than the batch size for refilling to detect when the receive queue is going empty. • Packets of a size smaller than the interface MTU but larger than packet buffer sizes are delivered in multiple buffers as jumbos. • Since the adapter is cut-through, errors in receiving packets like multicast mismatch, CRC errors, etc. are delivered along with the packet. The software must detect these errors and recycle the associated packet buffers. Issue 15 © Solarflare Communications 2013 171 Onload User Guide Using ef_vi Example Issue 15 © Solarflare Communications 2013 172 Onload User Guide Example Applications Solarflare ef_vi comes with a range of example applications - including source code and make files. Application Description efforward Forward packets between two interfaces without modification. efpingpong Ping a simple message format between two interfaces. efpio Pingpong application that uses PIO. efrss Forward packets between two interfaces without modification, spreading the load over multiple VIs and threads. efsink Receives streams of packets on a single interface. Building Example Applications The ef_vi example applications are built along with the Onload installation and will be present in the /Onload- /build/gnu_x86_64/tests/ef_vi subdirectory. In the build directory there will be gnu, gnu_x86_64, x86_64_linux- directories. Files under the gnu directory are 32bit (if these are built), files under the gnu_x86_64 are 64bit. Source code files for the example applications exist in the /Onload- /src/tests/ ef_vi subdirectory. To rebuild the example applications you must have the Onload- /scripts subdirectory in your path and use the following procedure: [root@server01 Onload-201109]# cd scripts/ [root@server01 scripts]# export PATH="$PWD:$PATH" [root@server01 scripts]# cd ../build/_gnu_x86_64/tests/ef_vi/ [root@server01 ef_vi]# make clean [root@serverr01 ef_vi]# make Issue 15 © Solarflare Communications 2013 173 OpenOnload User Guide Appendix I: onload_iptables Description The Linux netfilter iptables feature provides filtering based on user-configurable rules with the aim of managing access to network devices and preventing unauthorized or malicious passage of network traffic. Packets delivered to an application via the Onload accelerated path are not visible to the OS kernel and, as a result, these packets are not visible to the kernel firewall (iptables). The onload_iptables feature allows the user to configure rules which determine which hardware filters Onload is permitted to insert on the adapter and therefore which connections and sockets can bypass the kernel and, as a consequence, bypass iptables. The onload_iptables command can convert a snapshot1 copy of the kernel iptables rules into Onload firewall rules used to determine if sockets, created by an Onloaded process, are retained by Onload or handed off to the kernel network stack. Additionally, user-defined filter rules can be added to the Onload firewall on a per interface basis. The Onload firewall applies to the receive filter path only. How it works Before Onload accelerates a socket it first checks the Onload firewall module. If the firewall module indicates the acceleration of the socket would violate a firewall rule, the acceleration request is denied and the socket is handed off to the kernel. Network traffic sent or received on the socket is not accelerated. Onload firewall rules are parsed in ascending numerical order. The first rule to match the newly created socket - which may indicate to accelerate or decelerate the socket - is selected and no further rules are parsed. If the Onload firewall rules are an exact copy of the kernel iptables i.e. with no additional rules added by the Onload user, then a socket handed off to the kernel, because of an iptables rule violation, will be unable to receive data through either path. Changing rules using onload_iptables will not interrupt existing network connections. NOTE: Onload firewall rules will not persist over network driver restarts. NOTE: The onload_iptables "IP rules" will only block hardware IP filters from being inserted and onload_iptables "MAC rules" will only block hardware MAC filters from being inserted. Therefore it is possible that if a rule is inserted to block a MAC address, the user is still able to accept traffic from the specified host by Onload inserting an appropriate IP hardware filter. 1. Issue 15 Subsequent changes to kernel iptables will not be reflected in the Onload firewall. © Solarflare Communications 2013 174 OpenOnload User Guide Files When the Onload drivers are loaded, firewall rules exist in the Linux proc psuedo file system at: /proc/driver/sfc_resource Within this directory the firewall_add, firewall_del and resources files will be present. These files are writeable only by a root user. No attempt should be made to remove these files. Once rules have been created for a particular interface – and only while these rules exist – a separate directory exists which contains the current firewall rules for the interface: /proc/driver/sfc_resource/ethN/firewall_rules Features To get help: # onload_iptables -h Rules The general format of the rule is: rule=n if=ethN protocol=(ip|tcp|udp) [local_ip=a.b.c.d[/mask]] [remote_ip=a.b.c.d[/mask]] [local_port=a[-b]] [remote_port=a[-b]] action=(ACCELERATE|DECELERATE) rule=n if=ethN protocol=eth mac=xx:xx:xx:xx:xx:xx[/FF:FF:FF:FF:FF:FF] action=(ACCELERATE|DECELERATE) Preview firewall rules Before creating the Onload firewall, run the onload_iptables -v option to identify which rules will be adopted by the firewall and which will be rejected (a reason is given for rejection): # onload_iptables -v DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:5201 => if=None protocol=tcp local_ip=0.0.0.0/0 local_port=5201-5201 remote_ip=0.0.0.0/0 remote_port=0-65535 action=DECELERATE DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:5201 => if=None protocol=tcp local_ip=0.0.0.0/0 local_port=5201-5201 remote_ip=0.0.0.0/0 remote_port=0-65535 action=DECELERATE DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpts:80:88 => if=None protocol=tcp local_ip=0.0.0.0/0 local_port=80-88 remote_ip=0.0.0.0/0 remote_port=0-65535 action= tcp -- 0.0.0.0/0 0.0.0.0/0 tcp spt:800 Issue 15 © Solarflare Communications 2013 175 OpenOnload User Guide => Error parsing: Insuffcient arguments in rule. The last rule is rejected because the action is missing. NOTE: The -v option does not create firewall rules for any Solarflare interface, but allows the user to preview which Linux iptables rules will be accepted and which will be rejected by Onload To convert Linux iptables to Onload firewall rules The Linux iptables can be applied to all or individual Solarflare interfaces. Onload iptables are only applied to the receive filter path. The user can select the INPUT CHAIN or a user defined CHAIN to parse from the iptables. The default CHAIN is INPUT. To adopt the rules from iptables even though some rules will be rejected enter the following command identifying the Solarflare interface the rules should be applied to: # onload_iptables -i ethN -c # onload_iptables -a -c Running the onload_iptables command will overwrite existing rules in the Onload firewall when used with the -i (interface) or -a (all interfaces) options. NOTE: Applying the Linux iptables to a Solarflare interface is optional. The alternatives are to create user-defined firewall rules per interface or not to apply any firewall rules per interface (default behaviour). NOTE: onload_iptables will import all rules to the identified interface - even rules specified on another interface. To avoid importing rules specified on ’other’ interfaces using the --use-extended option. To view rules for a specific interface: When firewall rules exist for a Solarflare interface, and only while they exist, a directory for the interface will be created in: /proc/driver/sfc_resource Rules for a specific interface will be found in the firewall_rules file e.g. cat /proc/driver/sfc_resource/eth3/firewall_rules if=eth3 0.0.0.0 if=eth3 0.0.0.0 Issue 15 rule=0 protocol=tcp local_ip=0.0.0.0/0.0.0.0 remote_ip=0.0.0.0/ local_port=5201-5201 remote_port=0-65535 action=DECELERATE rule=1 protocol=tcp local_ip=0.0.0.0/0.0.0.0 remote_ip=0.0.0.0/ local_port=5201-5201 remote_port=0-65535 action=DECELERATE © Solarflare Communications 2013 176 OpenOnload User Guide if=eth3 0.0.0.0 if=eth3 0.0.0.0 rule=2 protocol=tcp local_ip=0.0.0.0/0.0.0.0 remote_ip=0.0.0.0/ local_port=5201-5201 remote_port=72-72 action=DECELERATE rule=3 protocol=tcp local_ip=0.0.0.0/0.0.0.0 remote_ip=0.0.0.0/ local_port=80-88 remote_port=0-65535 action=DECELERATE To add a rule for a selected interface echo "rule=4 if=eth3 action=ACCEPT protocol=udp local_port=7330-7340" > / proc/driver/sfc_resource/firewall_add Rules can be inserted into any position in the table and existing rule numbers will be adjusted to accommodate new rules. If a rule number is not specified the rule will be appended to the existing rule list. NOTE: Errors resulting from the add/delete commands will be displayed in dmesg. To delete a rule from a selected interface: To delete a single rule: # echo "if=eth3 rule=2" > /proc/driver/sfc_resource/firewall_del To delete all rules: echo "eth2 all" > /proc/driver/sfc_resource/firewall_del When the last rule for an interface has been deleted the interface firewall_rules file is removed from /proc/driver/sfc_resource. The interface directory will be removed only when completely empty. Error Checking The onload_iptables command does not log errors to stdout. Errors arising from add or delete commands will logged in dmesg. Interface & Port Onload firewall rules are bound to an interface and not to a physical adapter port. It is possible to create rules for an interface in a configured/down state. Virtual/Bonded Interface On virtual or bonded interfaces firewall rules are only applied and enforced on the ’real’ interface. Issue 15 © Solarflare Communications 2013 177 OpenOnload User Guide Error Messages Error messages relating to onload_iptables operations will appear in dmesg. Table 3: Error Message Description Internal error Internal condition - should not happen. Unsupported rule Internal condition - should not happen. Out of memory allocating new rule Memory allocation error. Seen multiple rule numbers Only a single rule number can be specified when adding/deleting rules. Seen multiple interfaces Only a single interface can be specified when adding/deleting rules. Unable to understand action The action specified when adding a rule is not supported. Note that there should be no spaces i.e. action=ACCELERATE. Unable to understand protocol Non-supported protocol. Unable to understand remainder of the rule Non-supported parameters/syntax. Failed to understand interface The interface does not exist. Rules can be added to an interface that does not yet exist, but cannot be deleted from an non-existent interface. Failed to remove rule The rule does not exist. Error removing table Internal condition - should not happen. Invalid local_ip rule Invalid address/mask format. Supported formats: a.b.c.d a.b.c.d/n a.b.c.d/e.f.g.h where a.b.c.d.e.f.g.h are decimal range 0-255, n = decimal range 0-32. Issue 15 Invalid remote_ip rule Invalid address/mask format. Invalid rule A rule must identify at least an interface, a protocol, an action and at least one match criteria. © Solarflare Communications 2013 178 OpenOnload User Guide Table 3: Error Message Description Invalid mac Invalid mac address/mask format. Supported formats: xx:xx:xx:xx:xx:xx xx:xx:xx:xx:xx:xx/xx:xx:xx:xx:xx:xx where x is a hex digit. NOTE: A Linux limitation applicable to the /proc/ filesystem restricts a write operation to 1024 bytes. When writing to /proc/driver/sfc_resource/firewall_[add|del] files the user is advised to flush the write between lines which exceed the 1024 byte limit. Issue 15 © Solarflare Communications 2013 179 Onload User Guide Appendix J: Solarflare efpio Test Application The openonload-201310 distribution includes the command line efpio test application to measure latency of the Solarflare ef_vi layer 2 API with PIO. The efpio application is a single thread ping/pong. When all iterations are complete the client side will display the round-trip time. By default efpio downloads a packet to the adapter at start of day and transmits this same packet on every iteration of the test. The –c option can be used to test the latency of ef_vi using PIO to transfer a new transmit packet to the adapter on every iteration. With the onload distribution installed efpio will be present in the following directory: ~/openonload-201310/build/gnu_x86_64/tests/ef_vi efpio Options ./efpio –help usage: efpio [options] Table 4: efpio Options Parameter Description interface the local interface to use e.g. eth2 local-ip-intf local interface IP address/host name local-port local interface IP port number to use remote-mac MAC address of the remote interface remote-ip-intf remote server IP address/host name remote-port remote server port number options: -n -s -w -v -p -t -c Issue 15 - set number of iterations set udp payload size sleep instead of busy wait use a VF physical address mode disable TX push copy on critical path © Solarflare Communications 2013 180 Onload User Guide To run efpio The efpio must be started on the server (pong side) before the client (ping side) is run. Command line examples are shown below. 1 On the server side (server1) taskset –c ./efpio pong eth 8001 8001 # ef_vi_version_str: 201306-7122preview2 # udp payload len: 28 # iterations: 100000 # frame len: 70 2 On the client side (server2) taskset –c ./efpio ping eth 8001 8001 # ef_vi_version_str: 201306-7122preview2 # udp payload len: 28 # iterations: 100000 # frame len: 70 round-trip time: 2.848 μs M = cpu core, N = Solarflare adapter interface. Issue 15 © Solarflare Communications 2013 181
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : Yes Create Date : 2013:12:03 14:17:02Z Creator : FrameMaker 10.0.2 Modify Date : 2013:12:03 15:22:40Z Language : en XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/10/03-12:08:50 Creator Tool : FrameMaker 10.0.2 Metadata Date : 2013:12:03 15:22:40Z Producer : Acrobat Distiller 10.1.7 (Windows) Format : application/pdf Title : untitled Document ID : uuid:43170f73-8882-4de8-afba-4c5b8dbc5e84 Instance ID : uuid:df2ecb89-2b44-47aa-abec-87f5a1b69ac9 Page Mode : UseOutlines Page Count : 199EXIF Metadata provided by EXIF.tools