Altera SDK For OpenCL Programming Guide Aocl
aocl_programming_guide
User Manual:
Open the PDF directly: View PDF
.
Page Count: 153
| Download | |
| Open PDF In Browser | View PDF |
Altera SDK for OpenCL Programming Guide Subscribe Send Feedback Last updated for Quartus Prime Design Suite: 16.0 UG-OCL002 2016.05.02 101 Innovation Drive San Jose, CA 95134 www.altera.com TOC-2 Contents Altera SDK for OpenCL Programming Guide................................................... 1-1 Altera SDK for OpenCL Programming Guide Prerequisites.................................................................1-1 Altera SDK for OpenCL FPGA Programming Flow...............................................................................1-2 Altera Offline Compiler Kernel Compilation Flows...............................................................................1-3 One-Step Compilation for Simple Kernels...................................................................................1-4 Multistep Altera SDK for OpenCL Design Flow......................................................................... 1-5 Obtaining General Information on Software, Compiler, and Custom Platform................................1-7 Displaying the Software Version (version)...................................................................................1-8 Displaying the Compiler Version (--version).............................................................................. 1-8 Listing the Altera SDK for OpenCL Utility Command Options (help)................................... 1-8 Listing the Altera Offline Compiler Command Options (no argument, --help, or -h)......... 1-9 Listing the Available FPGA Boards in Your Custom Platform (--list-boards)....................... 1-9 Managing an FPGA Board..........................................................................................................................1-9 Installing an FPGA Board (install).............................................................................................. 1-10 Uninstalling the FPGA Board (uninstall)...................................................................................1-11 Querying the Device Name of Your FPGA Board (diagnose).................................................1-11 Running a Board Diagnostic Test (diagnose)............................................... 1-12 Programming the FPGA Offline or without a Host (program )................. 1-12 Programming the Flash Memory (flash )....................................................... 1-13 Structuring Your OpenCL Kernel........................................................................................................... 1-13 Guidelines for Naming the Kernel.............................................................................................. 1-14 Programming Strategies for Optimizing Data Processing Efficiency.................................... 1-15 Programming Strategies for Optimizing Memory Access Efficiency.....................................1-18 Implementing the Altera SDK for OpenCL Channels Extension........................................... 1-19 Implementing OpenCL Pipes.......................................................................................................1-36 Using Predefined Preprocessor Macros in Conditional Compilation................................... 1-50 Declaring __constant Address Space Qualifiers........................................................................1-51 Including Structure Data Types as Arguments in OpenCL Kernels...................................... 1-52 Inferring a Register........................................................................................................................ 1-55 Enabling Double Precision Floating-Point Operations............................................................1-57 Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels............................ 1-57 Designing Your Host Application........................................................................................................... 1-59 Host Programming Requirements.............................................................................................. 1-60 Allocating OpenCL Buffer for Manual Partitioning of Global Memory............................... 1-61 Collecting Profile Data During Kernel Execution.....................................................................1-63 Accessing Custom Platform-Specific Functions....................................................................... 1-65 Modifying Host Program for Structure Parameter Conversion............................................. 1-65 Allocating Shared Memory for OpenCL Kernels Targeting SoCs.......................................... 1-66 Managing Host Application......................................................................................................... 1-68 Compiling Your OpenCL Kernel............................................................................................................ 1-78 Compiling Your Kernel to Create Hardware Configuration File........................................... 1-79 Compiling a Kernel for a Big-Endian System (--big-endian)..................................................1-79 Altera Corporation TOC-3 Compiling Your Kernel without Building Hardware (-c)........................................................1-80 Specifying the Location of Header Files (-I )........................................................ 1-80 Specifying the Name of an AOC Output File (-o )................................................1-81 Compiling a Kernel for a Specific FPGA Board (--board )...........................1-81 Resolving Hardware Generation Fitting Errors during Kernel Compilation (--higheffort)..........................................................................................................................................1-83 Defining Preprocessor Macros to Specify Kernel Parameters (-D )........... 1-83 Generating Compilation Progress Report (-v).......................................................................... 1-85 Displaying the Estimated Resource Usage Summary On-Screen (--report)......................... 1-85 Suppressing AOC Warning Messages (-W)...............................................................................1-86 Converting AOC Warning Messages into Error Messages (-Werror)...................................1-86 Adding Source References to Optimization Reports (-g)........................................................ 1-86 Disabling Burst-Interleaving of Global Memory (--no-interleaving ).........................................................................................................1-86 Configuring Constant Memory Cache Size (--const-cache-bytes )................................1-87 Relaxing the Order of Floating-Point Operations (--fp-relaxed)............................................1-87 Reducing Floating-Point Rounding Operations (--fpc)...........................................................1-88 Emulating and Debugging Your OpenCL Kernel................................................................................. 1-88 Modifying Channels Kernel Code for Emulation..................................................................... 1-88 Compiling a Kernel for Emulation (-march=emulator).......................................................... 1-90 Emulating Your OpenCL Kernel................................................................................................. 1-91 Debugging Your OpenCL Kernel on Linux............................................................................... 1-92 Limitations of the AOCL Emulator.............................................................................................1-93 Reviewing Your Kernel's Resource Usage Information in the Area Report......................................1-94 Accessing the Area Report............................................................................................................1-94 Layout of the Area Report.............................................................................................................1-95 Profiling Your OpenCL Kernel................................................................................................................1-97 Instrumenting the Kernel Pipeline with Performance Counters (--profile)......................... 1-97 Launching the AOCL Profiler GUI (report).............................................................................. 1-98 Conclusion.................................................................................................................................................. 1-98 Document Revision History.....................................................................................................................1-99 Altera SDK for OpenCL Advanced Features...................................................... 2-1 OpenCL Library........................................................................................................................................... 2-1 Understanding RTL Modules and the OpenCL Pipeline........................................................... 2-3 Packaging an OpenCL Helper Function File for an OpenCL Library....................................2-13 Packaging an RTL Component for an OpenCL Library ......................................................... 2-14 Verifying the RTL Modules..........................................................................................................2-16 Packaging Multiple Object Files into a Library File..................................................................2-17 Specifying an OpenCL Library when Compiling an OpenCL Kernel....................................2-17 Using an OpenCL Library that Works with Simple Functions (Example 1).........................2-18 Using an OpenCL Library that Works with External Memory (Example 2)........................ 2-18 OpenCL Library Command-Line Options.................................................................................2-19 Kernel Attributes for Configuring Local Memory System...................................................................2-21 Restrictions on the Usage of Local Variable-Specific Kernel Attributes................................2-22 Kernel Attributes for Reducing the Overhead on Hardware Usage................................................... 2-23 Hardware for Kernel Interface..................................................................................................... 2-23 Kernel Replication Using the num_compute_units(X,Y,Z) Attribute...............................................2-26 Altera Corporation TOC-4 Customization of Replicated Kernels Using the get_compute_id() Function...................... 2-26 Using Channels with Kernel Copies............................................................................................2-28 Document Revision History.....................................................................................................................2-29 Support Statuses of OpenCL Features .............................................................. A-1 Support Statuses of OpenCL 1.0 Features............................................................................................... A-1 OpenCL1.0 C Programming Language Implementation.......................................................... A-1 OpenCL C Programming Language Restrictions.......................................................................A-4 Argument Types for Built-in Geometric Functions...................................................................A-5 Numerical Compliance Implementation.....................................................................................A-6 Image Addressing and Filtering Implementation...................................................................... A-7 Atomic Functions............................................................................................................................A-7 Embedded Profile Implementation.............................................................................................. A-7 Support Statuses of OpenCL 1.2 Features............................................................................................... A-8 OpenCL 1.2 Runtime Implementation........................................................................................ A-8 OpenCL 1.2 C Programming Language Implementation......................................................... A-8 Support Statuses of OpenCL 2.0 Features............................................................................................. A-10 OpenCL 2.0 Runtime Implementation...................................................................................... A-10 OpenCL 2.0 C Programming Language Restrictions for Pipes.............................................. A-10 Altera SDK for OpenCL Allocation Limits........................................................................................... A-11 Document Revision History....................................................................................................................A-12 Altera Corporation 1 Altera SDK for OpenCL Programming Guide 2016.05.02 UG-OCL002 Subscribe Send Feedback The Altera SDK for OpenCL Programming Guide provides descriptions, recommendations and usage information on the Altera® Software Development Kit (SDK) for OpenCL™ (AOCL) compiler and tools. The AOCL(1) is an OpenCL(2)-based heterogeneous parallel programming environment for Altera FPGAs. Altera SDK for OpenCL Programming Guide Prerequisites The Altera SDK for OpenCL Programming Guide assumes that you are knowledgeable in OpenCL concepts and application programming interfaces (APIs). It also assumes that you have experience creating OpenCL applications and are familiar with the OpenCL Specification version 1.0. Before using the Altera SDK for OpenCL or the Altera Runtime Environment (RTE) for OpenCL to program your device, familiarize yourself with the respective getting started guides. This document assumes that you have performed the following tasks: • For developing and deploying OpenCL kernels, download the tar file and run the installers to install the AOCL, the Quartus® Prime software, and device support. • For deployment of OpenCL kernels, download and install the RTE. • If you want to use the AOCL or the RTE to program a Cyclone V SoC Development Kit, you also have to download and install the SoC Embedded Design Suite (EDS). • Install and set up your FPGA board. • Program your device with the device-compatible version of the hello_world example OpenCL applica‐ tion If you have not performed the tasks described above, refer to the AOCL getting starting guides for more information. Prior to creating an OpenCL design and programming your FPGA board, review the AOCL allocation limits. Related Information • Altera SDK for OpenCL Allocation Limits on page 3-11 (1) (2) The Altera SDK for OpenCL is based on a published Khronos Specification, and has passed the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/ conformance. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of the Khronos Group™. © 2016 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, ENPIRION, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and Trademark Office and in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/common/legal.html. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. www.altera.com 101 Innovation Drive, San Jose, CA 95134 ISO 9001:2008 Registered 1-2 UG-OCL002 2016.05.02 Altera SDK for OpenCL FPGA Programming Flow • • • • • OpenCL References Pages OpenCL Specification version 1.0 Altera SDK for OpenCL Getting Started Guide Altera RTE for OpenCL Getting Started Guide Altera SDK for OpenCL Cyclone V SoC Getting Started Guide Altera SDK for OpenCL FPGA Programming Flow The Altera SDK for OpenCL programs an FPGA with an OpenCL application in a two-step process. The Altera Offline Compiler (AOC) first compiles your OpenCL kernels. The host-side C compiler compiles your host application and then links the compiled OpenCL kernels to it. Figure 1-1: Schematic Diagram of the AOCL Programming Model Host Code Path Altera OpenCL Runtime Environment Kernel Code Path Custom Platform Path Host source code (.c or .cpp) Kernel source code (.cl) Board-specific Custom Platform Design Host compiler AOC Quartus Prime Design Suite Host binary FPGA image (.aocx) AOCL board directory for version-compatible target platform Port and/or customize to target platform Altera Reference Platform Design Execute host application on host Final computation results Runtime Execution Board developer-created item AOCL user-created item Third-party-supplied or open source tool Altera Corporation Tool-generated item Altera-supplied tool or design Board developer-supplied item Process or action Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Altera Offline Compiler Kernel Compilation Flows 1-3 Three main parts in the AOCL programming model: • The host application and the host compiler • The OpenCL kernel and the AOC • The Custom Platform The Custom Platform provides the board design. The AOC targets the board design when compiling the OpenCL kernel to generate the hardware image. The host then runs the host application to execute the hardware image onto the FPGA. Altera Offline Compiler Kernel Compilation Flows The Altera Offline Compiler can create your FPGA hardware configuration file in a one-step or a multistep process. The complexity of your kernel dictates the AOC compilation option you implement. Figure 1-2: The AOCL FPGA Programming Flow Kernel Source Kernel Source Kernel Source Code #1 (.cl) Code #2 (.cl) Code #3 (.cl) Host Code Kernel Source Kernel Source Kernel Source Code #4 (.cl) Code #5 (.cl) Code #6 (.cl) Altera Offline Compiler for OpenCL Kernels Standard C Compiler Altera Offline Compiler for OpenCL Kernels Consolidated Kernel Binary A (.aoco, .aocx) Host Binary Consolidated Kernel Binary B (.aoco, .aocx) Load .aocx into memory Kernel Binary B (.aocx) Load runtime PCIe Kernel Binary A (.aocx) Load runtime PCIe Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-4 UG-OCL002 2016.05.02 One-Step Compilation for Simple Kernels An OpenCL kernel source file (.cl) contains your OpenCL source code. The AOC groups one or more kernels into a temporary file and then compiles this file to generate the following files and folders: • An Altera Offline Compiler Object file (.aoco) is an intermediate object file that contains information for later stages of the compilation. • An Altera Offline Compiler Executable file (.aocx) is the hardware configuration file and contains information necessary at runtime. • The folder or subdirectory, which contains data necessary to create the .aocx file. The AOC creates the .aocx file from the contents of the folder or subdirectory. It also incorporates information from the .aoco file into the .aocx file during hardware compilation. The .aocx file contains data that the host application uses to create program objects for the target FPGA. The host application loads these program objects into memory. The host runtime then calls these program objects from memory and programs the target FPGA as required. One-Step Compilation for Simple Kernels By default, the Altera Offline Compiler compiles your OpenCL kernel and creates the hardware configuration file in a single step. Choose this compilation option only if your OpenCL application requires minimal optimizations. The following figure illustrates the OpenCL kernel design flow that has a single compilation step. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Multistep Altera SDK for OpenCL Design Flow 1-5 Figure 1-3: One-Step OpenCL Kernel Compilation Flow .cl aoc .cl [--report] Duration of compilation: hours Syntactic Errors? YES NO .aoco Estimated resource usage summary in .log (and on-screen with --report) NO Resource usage acceptable? YES Legend File abc Command .aocx Optimization report in .log Execute on FPGA Single work-item kernel performance satisfactory? Kernel Execution YES NO abc For single work-item kernel A successful compilation results in the following files and reports: • A .aoco file • A .aocx file • In the / .log file, the estimated resource usage summary provides a preliminary assessment of area usage. If you have a single work-item kernel, the optimization report identifies performance bottlenecks. Attention: It is very time consuming to iterate on your design using the one-step compilation flow. For each iteration, you must perform a full compilation, which takes hours. Then you must execute the kernel on the FPGA before you can assess its performance. Related Information Compiling Your Kernel to Create Hardware Configuration File on page 1-79 Multistep Altera SDK for OpenCL Design Flow Choose the multistep Altera SDK for OpenCL design flow if you want to iterate on your OpenCL kernel design to implement performance-improving optimizations . Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-6 UG-OCL002 2016.05.02 Multistep Altera SDK for OpenCL Design Flow The figure below outlines the stages in the AOCL design flow. The steps in the design flow serve as checkpoints for identifying functional errors and performance bottlenecks. They allow you to modify your OpenCL kernel code without performing a full compilation after each iteration. Figure 1-4: The Multistep AOCL Design Flow .cl Intermediate Compilation aoc -c [-g] .cl [--report] Duration of compilation: minutes YES Resource Usage Acceptable? YES Optimization Repot in .log Emulation aoc -march=emulator [-g] .cl Duration of compilation: seconds Execute on emulation device NO .aoco Estimated resource usage summary in .log (and on-screen with --report) .aoco NO Syntactic errors? NO Emulation successful? .aocx Single work-item kernel? YES YES Kernel performance satisfactory? NO Review Area Report NO YES aocl analyze-area .aoco or aocl analyze-area .aocx Resource Usage Optimized? .aoco-area-report.html or .aocx-area-report.html NO YES Profiling aoc --profile .cl Duration of compilation: hours .aocx profile.mon Execute kernel on FPGA aocl report .aocx profile.mon Profiler GUI NO Legend File abc Command Kernel Execution GUI abc Single work-item-step Kernel performance satisfactory? YES Full Deployment aoc .cl Duration of compilation: hours .aocx Execute kernel on FPGA The AOCL design flow includes the following steps: 1. Intermediate compilation The intermediate compilation step checks for syntatic errors. It then generates a .aoco file without building the hardware configuration file. The estimated resource usage summary in the / .log file can provide insight into the type of kernel Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Obtaining General Information on Software, Compiler, and Custom Platform 1-7 optimizations you can perform. For a single work-item kernel, include the -g option to insert source information in the optimization report in the .log file. 2. Emulation Assess the functionality of your OpenCL kernel by executing it on one or multiple emulation devices on an x86-64 host. For Linux systems, include the -g option to enable symbolic debug support. Symbolic debug allows you to locate the origins of functional errors in your kernel code. 3. Review Area Report Review the HTML area report of your OpenCL application to determine whether the estimated resource usage is acceptable. The area report also provides suggestions on how you can modify your kernel to reduce hardware consumption. 4. Profiling Instruct the Altera Offline Compiler to instrument performance counters in the Verilog code in the .aocx file. During execution, the performance counters collect performance information which you can then review in the Profiler GUI. 5. Full deployment If you are satisfied with the performance of your OpenCL kernel throughout the design flow, perform a full compilation. You can then execute the .aocx file on the FPGA. Related Information • Compiling Your OpenCL Kernel on page 1-78 • Emulating and Debugging Your OpenCL Kernel on page 1-88 • Profiling Your OpenCL Kernel on page 1-97 Obtaining General Information on Software, Compiler, and Custom Platform The Altera SDK for OpenCL includes two sets of command options: the AOCL utility commands (aocl ) and the Altera Offline Compiler commands (aoc ). Each set of commands includes options you can invoke to obtain general information on the software, the compiler, and the Custom Platform. Displaying the Software Version (version) on page 1-8 To display the version of the Altera SDK for OpenCL, invoke the version utility command. Displaying the Compiler Version (--version) on page 1-8 To display the version of the Altera Offline Compiler, invoke the --version compiler command. Listing the Altera SDK for OpenCL Utility Command Options (help) on page 1-8 To display information on the Altera SDK for OpenCL utility command options, invoke the help utility command. Listing the Altera Offline Compiler Command Options (no argument, --help, or -h) on page 1-9 To display information on the Altera Offline Compiler command options, invoke the compiler command without an argument, or invoke the compiler command with the --help or -h command option. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-8 UG-OCL002 2016.05.02 Displaying the Software Version (version) Listing the Available FPGA Boards in Your Custom Platform (--list-boards) on page 1-9 To list the FPGA boards available in your Custom Platform, include the --list-boards option in the aoc command. Displaying the Software Version (version) To display the version of the Altera SDK for OpenCL, invoke the version utility command. • At the command prompt, invoke the aocl version command. Example output: aocl . (Altera SDK for OpenCL, Version Build , Copyright (C) Altera Corporation) Displaying the Compiler Version (--version) To display the version of the Altera Offline Compiler, invoke the --version compiler command. • At a command prompt, invoke the aoc --version command. Example output: Altera SDK for OpenCL, 64-Bit Offline Compiler Version Build Copyright (C) Altera Corporation Listing the Altera SDK for OpenCL Utility Command Options (help) To display information on the Altera SDK for OpenCL utility command options, invoke the help utility command. • At a command prompt, invoke the aocl help command. The AOCL categorizes the utility command options based on their functions. It also provides a description for each option. Displaying Information on an Altera SDK for OpenCL Utility Command Option (help ) To display information on a specific Altera SDK for OpenCL utility command option, include the command option as an argument of the help utility command. • At a command prompt, invoke the aocl help command. For example, to obtain more information on the install utility command option, invoke the aocl help install command. Example output: aocl install - Installs a board onto your host system. Usage: aocl install Description: This command installs a board's drivers and other necessary software for the host operating system to communicate with the board. For example this might install PCIe drivers. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Listing the Altera Offline Compiler Command Options (no argument,... 1-9 Listing the Altera Offline Compiler Command Options (no argument, --help, or -h) To display information on the Altera Offline Compiler command options, invoke the compiler command without an argument, or invoke the compiler command with the --help or -h command option. • At a command prompt, invoke one of the following commands: • aoc • aoc --help • aoc -h The Altera SDK for OpenCL categorizes the AOC command options based on their functions. It also provides a description for each option. Listing the Available FPGA Boards in Your Custom Platform (--list-boards) To list the FPGA boards available in your Custom Platform, include the --list-boards option in the aoc command. Before you begin To view the list of available boards in your Custom Platform, you must first set the environment variable AOCL_BOARD_PACKAGE_ROOT to point to the location of your Custom Platform. • At a command prompt, invoke the aoc --list-boards command. The Altera Offline Compiler generates an output that resembles the following: Board list: ... Where is the board name you use in your aoc command to target a specific FPGA board. Managing an FPGA Board The Altera SDK for OpenCL includes utility commands you can invoke to install, uninstall, diagnose, and program your FPGA board. Installing an FPGA Board (install) on page 1-10 To install your board into the host system, invoke the install utility command. Uninstalling the FPGA Board (uninstall) on page 1-11 To uninstall an FPGA board, invoke the uninstall utility command, uninstall the Custom Platform, and unset the relevant environment variables. Querying the Device Name of Your FPGA Board (diagnose) on page 1-11 When you query a list of accelerator boards, the AOCL produces a list of installed devices on your machine in the order of their device names. Running a Board Diagnostic Test (diagnose ) on page 1-12 To perform a detailed diagnosis on a specific FPGA board, include as an argument of the diagnose utility command. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-10 UG-OCL002 2016.05.02 Installing an FPGA Board (install) Programming the FPGA Offline or without a Host (program ) on page 1-12 To program an FPGA device offline or without a host, invoke the program utility command. Programming the Flash Memory (flash ) on page 1-13 If supported, invoke the flash utility command to initialize the FPGA with a specified startup configuration. Installing an FPGA Board (install) Before creating an OpenCL application for an FPGA board, you must first download and install the Custom Platform from your board vendor. Most Custom Platform installers require administrator privileges. To install your board into the host system, invoke the install utility command. The steps below outline the board installation procedure. Some Custom Platforms require additional installation tasks. Consult your board vendor's documentation for further information on board installation. Attention: If you are installing the Cyclone® V SoC Development Kit for use with the Cyclone V SoC Development Kit Reference Platform (c5soc), refer to Installing the Cyclone V SoC Develop‐ ment Kit in the Altera SDK for OpenCL Cyclone V SoC Getting Started Guide for more information. 1. Follow your board vendor's instructions to connect the FPGA board to your system. 2. Download the Custom Platform for your FPGA board from your board vendor's website. To download an Altera SDK for OpenCL Reference Platform (for example, the Stratix® V Network Reference Platform (s5_net)), refer to the Altera SDK for OpenCL FPGA Platforms page on the Altera website. 3. Install the Custom Platform in a directory that you own (that is, not a system directory). 4. Set the user environment variable AOCL_BOARD_PACKAGE_ROOT to point to the location of the Custom Platform subdirectory containing the board_env.xml file. For example, for s5_net, set AOCL_BOARD_PACKAGE_ROOT to point to the /s5_ net directory. 5. Set the QUARTUS_ROOTDIR_OVERRIDE user environment variable to point to the correct Quartus Prime software installation directory. If you have an Arria® 10 device, set QUARTUS_ROOTDIR_OVERRIDE to point to the installation directory of the Quartus Prime Pro Edition software. Otherwise, set QUARTUS_ROOTDIR_OVERRIDE to point to the installation directory of the Quartus Prime Standard Edition software. 6. Add the paths to the Custom Platform libraries (for example, the memory-mapped (MMD) library) to the PATH (Windows) or LD_LIBRARY_PATH (Linux) environment variable setting. For example, if you use s5_net, the Windows PATH environment variable setting is %AOCL_BOARD_PACKAGE_ROOT%\windows64\bin. The Linux LD_LIBRARY_PATH setting is $AOCL_BOARD_PACKAGE_ROOT/linux64/lib. The Altera SDK for OpenCL Getting Started Guide contains more information on the init_opencl script. For information on setting user environment variables and running the init_opencl script, refer to the Setting the Altera SDK for OpenCL User Environment Variables section. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Uninstalling the FPGA Board (uninstall) 1-11 7. Remember: You need administrative rights to install a board. To run a Windows command prompt as an administrator, click Start > All Programs > Accessories. Under Accessories, right click Command Prompt, In the right-click menu, click Run as Administrator. Invoke the command aocl install at a command prompt. Invoking aocl install also installs a board driver that allows communication between host applications and hardware kernel programs. 8. To query a list of FPGA devices installed in your machine, invoke the aocl diagnose command. The software generates an output that includes the , which is an acl number that ranges from acl0 to acl31. For more information on querying the of your accelerator board, refer to the Querying the Device Name of Your FPGA Board section. 9. To verify the successful installation of the FPGA board, invoke the command aocl diagnose to run any board vendor-recommended diagnostic test. Related Information • • • • • Installing the Cyclone V SoC Development Kit Querying the Device Name of Your FPGA Board (diagnose) on page 1-11 Setting the Altera SDK for OpenCL User Environment Variables (Windows) Setting the Altera SDK for OpenCL User Environment Variables (Linux) Altera SDK for OpenCL FPGA Platforms page Uninstalling the FPGA Board (uninstall) To uninstall an FPGA board, invoke the uninstall utility command, uninstall the Custom Platform, and unset the relevant environment variables. You must uninstall the existing FPGA board if you migrate your OpenCL application to another FPGA board that belongs to a different Custom Platform. To uninstall your FPGA board, perform the following tasks: 1. Following your board vendor's instructions to disconnect the board from your machine. 2. Invoke the aocl uninstall utility command to remove the current host computer drivers (for example, PCIe® drivers). The Altera SDK for OpenCL uses these drivers to communicate with the FPGA board. 3. Uninstall the Custom Platform. 4. Unset the LD_LIBRARY_PATH (for Linux) or PATH (for Windows) environment variable. 5. Unset the AOCL_BOARD_PACKAGE_ROOT environment variable. Querying the Device Name of Your FPGA Board (diagnose) Some Altera SDK for OpenCL utility commands require you to specify the device name ( ). The refers to the acl number (e.g. acl0 to acl31) that corresponds to the FPGA device. When you query a list of accelerator boards, the AOCL produces a list of installed devices on your machine in the order of their device names. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-12 UG-OCL002 2016.05.02 Running a Board Diagnostic Test (diagnose ) • To query a list of installed devices on your machine, type aocl diagnose at a command prompt. The software generates an output that resembles the example shown below: aocl diagnose: Running diagnostic from ALTERAOCLSDKROOT/board/ / /libexec Verified that the kernel mode driver is installed on the host machine. Using board package from vendor: Querying information for all supported devices that are installed on the host machine ... device_name Status Information acl0 Passed PCIe dev_id = , bus:slot.func = 02:00.00, at Gen 2 with 8 lanes. FPGA temperature=43.0 degrees C. acl1 Passed PCIe dev_id = , bus:slot.func = 03:00.00, at Gen 2 with 8 lanes. FPGA temperature = 35.0 degrees C. Found 2 active device(s) installed on the host machine, to perform a full diagnostic on a specific device, please run aocl diagnose DIAGNOSTIC_PASSED Related Information Probing the OpenCL FPGA Devices on page 1-76 Running a Board Diagnostic Test (diagnose ) To perform a detailed diagnosis on a specific FPGA board, include as an argument of the diagnose utility command. • At a command prompt, invoke the aocl diagnose command, where is the acl number (for example, acl0 to acl31) that corresponds to your FPGA device. You can identify the when you query the list of installed boards in your system. Consult your board vendor's documentation for more board-specific information on using the diagnose utility command to run diagnostic tests on multiple FPGA boards. Programming the FPGA Offline or without a Host (program ) To program an FPGA device offline or without a host, invoke the program utility command. • At a command prompt, invoke the aocl program .aocx command where: refers to the acl number (for example, acl0 to acl31) that corresponds to your FPGA device, and .aocx is the Altera Offline Compiler Executable file you use to program the hardware. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Programming the Flash Memory (flash ) 1-13 Note: To program an SoC such as the Cyclone V SoC, you must specify the full path of the device when invoking the program utility command. For example, aocl program /dev/ .aocx. Programming the Flash Memory (flash ) If supported, invoke the flash utility command to initialize the FPGA with a specified startup configuration. Note: For instructions on programming the micro SD flash card of the Cyclone V SoC Development Kit, refer to the Writing an SD Card Image onto the Micro SD Flash Card section of the Altera SDK for OpenCL Cyclone V SoC Getting Started Guide. • At a command prompt, invoke the aocl flash .aocx command where: refers to the acl number (for example, acl0 to acl31) that corresponds to your FPGA device, and .aocx is the Altera Offline Compiler Executable file you use to program the hardware. Related Information • Writing an SD Card Image onto the Micro SD Flash Card on Windows • Writing an SD Card Image onto the Micro SD Flash Card on Linux Structuring Your OpenCL Kernel Altera offers recommendations on how to structure your OpenCL kernel code. Consider implementing these programming recommendations when you create a kernel or modify a kernel written originally to target another architecture. Guidelines for Naming the Kernel on page 1-14 Altera recommends that you include only alphanumeric characters in your file names. Programming Strategies for Optimizing Data Processing Efficiency on page 1-15 Optimize the data processing efficiency of your kernel by implementing strategies such as unrolling loops, setting work-group sizes, and specifying compute units and work-items. Programming Strategies for Optimizing Memory Access Efficiency on page 1-18 Optimize the memory access efficiency of your kernel by implementing strategies such as specifying local memory pointer size and specifying global memory buffer location. Implementing the Altera SDK for OpenCL Channels Extension on page 1-19 The Altera SDK for OpenCL channels extension provides a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-14 UG-OCL002 2016.05.02 Guidelines for Naming the Kernel Implementing OpenCL Pipes on page 1-36 The Altera SDK for OpenCL provides preliminary support for OpenCL pipe functions. Using Predefined Preprocessor Macros in Conditional Compilation on page 1-50 You may take advantage of predefined preprocessor macros that allow you to conditionally compile portions of your kernel code. Declaring __constant Address Space Qualifiers on page 1-51 There are several limitations and workarounds you must consider when you include __constant address space qualifiers in your kernel. Including Structure Data Types as Arguments in OpenCL Kernels on page 1-52 Convert each structure parameter (struct) to a pointer that points to a structure. Inferring a Register on page 1-55 In general, the AOC chooses registers if the access to a variable is fixed and does not require any dynamic indexes. Enabling Double Precision Floating-Point Operations on page 1-57 The Altera SDK for OpenCL offers preliminary support for all double precision floating-point functions. Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels on page 1-57 Single work-item kernels that perform accumulation in a loop can leverage the Altera Offline Compiler's single-cycle floating-point accumulator feature. Guidelines for Naming the Kernel Altera recommends that you include only alphanumeric characters in your file names. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Programming Strategies for Optimizing Data Processing Efficiency 1-15 • Begin a file name with an alphanumeric character. If the file name of your OpenCL application begins with a nonalphanumeric character, compilation fails with the following error message: Error: Quartus compilation FAILED See quartus_sh_compile.log for the output log. • Do not differentiate file names using nonalphanumeric characters. The Altera Offline Compiler translates any nonalphanumeric character into an underscore ("_"). If you differentiate two file names by ending them with different nonalphanumeric characters only (for example, myKernel#.cl and myKernel&.cl), the AOC translates both file names to _.cl (for example, myKernel_.cl). • For Windows system, ensure that the combined length of the kernel file name and its file path does not exceed 260 characters. 64-bit Windows 7 and Windows 8.1 has a 260-character limit on the length of a file path. If the combined length of the kernel file name and its file path exceeds 260 characters, the AOC generates the following error message: The filename or extension is too long. The system cannot find the path specified. In addition to the AOC error message, the following error message appears in the /quartus_sh_compile.log file: Error: Can’t copy files: Can’t open for write: No such file or directory • Do not name your .cl OpenCL kernel source file "kernel". Naming the source file kernel.cl causes the AOC to generate intermediate design files that have the same names as certain internal files, which leads to an compilation error. Programming Strategies for Optimizing Data Processing Efficiency Optimize the data processing efficiency of your kernel by implementing strategies such as unrolling loops, setting work-group sizes, and specifying compute units and work-items. Unrolling a Loop The Altera Offline Compiler might unroll simple loops even if they are not annotated by a pragma. To direct the AOC to unroll a loop, insert an unroll kernel pragma in the kernel code preceding a loop you wish to unroll. Attention: Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-16 Specifying Work-Group Sizes UG-OCL002 2016.05.02 • Provide an unroll factor whenever possible. To specify an unroll factor N, insert the #pragma unroll directive before a loop in your kernel code. The AOC attempts to unroll the loop at most times. Consider the code fragment below. By assigning a value of 2 as an argument to #pragma unroll, you direct the AOC to unroll the loop twice. #pragma unroll 2 for(size_t k = 0; k < 4; k++) { mac += data_in[(gid * 4) + k] * coeff[k]; } • To unroll a loop fully, you may omit the unroll factor by simply inserting the #pragma unroll directive before a loop in your kernel code. The AOC attempts to unroll the loop fully if it understands the trip count. The AOC issues a warning if it cannot execute the unroll request. Specifying Work-Group Sizes Specify a maximum or required work-group size whenever possible. The Altera Offline Compiler relies on this specification to optimize hardware usage of the OpenCL kernel without involving excess logic. If you do not specify a max_work_group_size or a reqd_work_group_size attribute in your kernel, the work-group size assumes a default value depending on compilation time and runtime constraints. • If your kernel contains a barrier, the AOC sets a default maximum work-group size of 256 work-items. • If your kernel contains a barrier or refers to the local work-item ID, or if you query the work-group size in your host code, the runtime defaults the work-group size to one work-item. • If your kernel does not contain a barrier or refer to the local work-item ID, or if your host code does not query the work-group size, the runtime defaults the work-group size to the global NDRange size. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Specifying Number of Compute Units 1-17 To specify the work-group size, modify your kernel code in the following manner: • To specify the maximum number of work-items that the AOC may allocate to a work-group in a kernel, insert the max_work_group_size(N) attribute in your kernel source code. For example: __attribute__((max_work_group_size(512))) __kernel void sum (__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; } • To specify the required number of work-items that the AOC allocates to a work-group in a kernel, insert the reqd_work_group_size(X, Y, Z) attribute to your kernel source code. For example: __attribute__((reqd_work_group_size(64,1,1))) __kernel void sum (__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; } The AOC allocates the exact amount of hardware resources to manage the work-items in a workgroup. Specifying Number of Compute Units To increase the data-processing efficiency of an OpenCL kernel, you can instruct the Altera Offline Compiler to generate multiple kernel compute units. Each compute unit is capable of executing multiple work-groups simultaneously. Caution: Multiplying the number of kernel compute units increases data throughput at the expense of global memory bandwidth contention among compute units. • To specify the number of compute units for a kernel, insert the num_compute_units(N) attribute in the kernel source code. For example, the code fragment below directs the AOC to instantiate two compute units in a kernel: __attribute__((num_compute_units(2))) __kernel void test(__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; } The AOC distributes work-groups across the specified number of compute units. Specifying Number of SIMD Work-Items To increase the data-processing efficiency of an OpenCL kernel, specify the number of work-items within a work-group that the Altera Offline Compiler executes in a single instruction multiple data (SIMD) manner. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-18 UG-OCL002 2016.05.02 Programming Strategies for Optimizing Memory Access Efficiency Important: Introduce the num_simd_work_items attribute in conjunction with the reqd_work_group_size attribute. The num_simd_work_items attribute you specify must evenly divide the work-group size you specify for the reqd_work_group_size attribute. • To specify the number of SIMD work-items in a work-group, insert the num_simd_work_item(N) attribute in the kernel source code. For example, the code fragment below assigns a fixed work-group size of 64 work-items to a kernel. It then consolidates the work-items within each work-group into four SIMD vector lanes: __attribute__((num_simd_work_items(4))) __attribute__((reqd_work_group_size(64,1,1))) __kernel void test(__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; } The AOC replicates the kernel datapath according to the value you specify for num_simd_work_items whenever possible. Programming Strategies for Optimizing Memory Access Efficiency Optimize the memory access efficiency of your kernel by implementing strategies such as specifying local memory pointer size and specifying global memory buffer location. Specifying Pointer Size in Local Memory Optimize local memory hardware footprint (that is, size) by specifying a pointer size in bytes. • To specify a pointer size other than the default size of 16 kilobytes (kB), include the local_mem_size(N) attribute in the pointer declaration within your kernel source code. For example: __kernel void myLocalMemoryPointer( __local float * A, __attribute__((local_mem_size(1024))) __local float * B, __attribute__((local_mem_size(32768))) __local float * C) { //statements } In the myLocalMemoryPointer kernel, 16 kB of local memory (default) is allocated to pointer A, 1 kB is allocated to pointer B, and 32 kB is allocated to pointer C. Specifying Buffer Location in Global Memory Specify the global memory type to which the host allocates a buffer. 1. Determine the names of the global memory types available on your FPGA board in the following manners: Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Implementing the Altera SDK for OpenCL Channels Extension 1-19 • Refer to the board vendor's documentation for more information. • Find the names in the board_spec.xml file of your board Custom Platform. For each global memory type, the name is the unique string assigned to the name attribute of the global_mem element. 2. To instruct the host to allocate a buffer to a specific global memory type, insert the buffer_location(" ") attribute, where is the name of the global memory type provided by your board vendor. For example: __kernel void foo(__global __attribute__((buffer_location("DDR"))) int *x, __global __attribute__((buffer_location("QDR"))) int *y) If you do not specify the buffer_location attribute, the host allocates the buffer to the default memory type automatically. To determine the default memory type, consult the documentation provided by your board vendor. Alternatively, in the board_spec.xml file of your Custom Platform, search for the memory type that is defined first or has the attribute default=1 assigned to it. Altera recommends that you define the buffer_location attribute in a preprocessor macro for ease of reuse, as shown below: #define QDR\ __global\ __attribute__((buffer_location("QDR"))) #define DDR\ __global\ __attribute__((buffer_location("DDR"))) __kernel void foo (QDR uint * data, DDR uint * lup) { //statements } Attention: If you assign a kernel argument to a non-default memory (for example, QDR uint * data and DDR uint * lup from the code above), you cannot declare that argument using the const keyword. In addition, you cannot perform atomic operations with pointers derived from that argument. Implementing the Altera SDK for OpenCL Channels Extension The Altera SDK for OpenCL channels extension provides a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency. Attention: If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes instead. Related Information Implementing OpenCL Pipes on page 1-36 Overview of the AOCL Channels Extension The Altera SDK for OpenCL channels extension allows kernels to communicate directly with each other via FIFO buffers. Implementation of channels decouples kernel execution from the host processor. Unlike the typical OpenCL execution model, the host does not need to coordinate data movement across kernels. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-20 UG-OCL002 2016.05.02 Channel Data Behavior Figure 1-5: Overview of Channels Implementation Host Processor Initialize ( ) FIFO Kernel 0 FIFO Kernel 1 FIFO FIFO Kernel 2 FIFO Kernel N I/O Channel FIFO I/O Channel RAM Channel Data Behavior Data written to a channel remains in a channel as long as the kernel program remains loaded on the FPGA device. In other words, data written to a channel persists across multiple work-groups and NDRange invocations. However, data is not persistent across multiple or different invocations of kernel programs. Consider the following code example: #pragma OPENCL EXTENSION cl_altera_channels : enable channel int c0; __kernel void producer() { for(int i=0; i < 10; i++) { write_channel_altera(c0, i); } } __kernel void consumer( __global uint * restrict dst ) { for(int i=0; i < 5; i++) { dst[i] = read_channel_altera(c0); } } Figure 1-6: Channel Data FIFO Ordering Producer Altera Corporation 9 8 7 6 5 4 3 2 1 0 Consumer Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Multiple Work-Item Ordering for Channels 1-21 The kernel producer writes ten elements ([0, 9]) to the channel. The kernel consumer reads five elements from the channel per NDRange invocation. During the first invocation, the kernel consumer reads values 0 to 4 from the channel. Because the data persists across NDRange invocations, the second time you execute the kernel consumer, it reads values 5 to 9. For this example, to avoid a deadlock from occurring, you need to invoke the kernel consumer twice for every invocation of the kernel producer. If you call consumer less than twice, producer stalls because the channel becomes full. If you call consumer more than twice, consumer stalls because there is insufficient data in the channel. Multiple Work-Item Ordering for Channels The OpenCL specification does not define a work-item ordering. The Altera SDK for OpenCL enforces a work-item order to maintain the consistency in channel read and write operations. Multiple work-item accesses to a channel can be useful in some scenarios. For example, they are useful when data words in the channel are independent, or when the channel is implemented for control logic. The main concern regarding multiple work-item accesses to a channel is the order in which the kernel writes data to and reads data from the channel. If possible, the AOCL channels extension processes workitems read and write operations to the channel in a deterministic order. As such, the read and write operations remain consistent across kernel invocations. Requirements for Deterministic Multiple Work-Item Ordering To guarantee deterministic ordering, the AOCL checks that the channel call is work-item invariant based on the following characteristics: • All paths through the kernel must execute the channel call. • If the first requirement is not satisfied, none of the branch conditions that reach the channel call should execute in a work-item-dependent manner. If the AOCL cannot guarantee deterministic ordering of multiple work-item accesses to a channel, it warns you that the channels might not have well-defined ordering with nondeterministic execution. Primarily, the AOCL fails to provide deterministic ordering if you have work-item-variant code on loop executions with channel calls, as illustrated below: __kernel void ordering( __global int * restrict check, __global int * restrict data ) { int condition = check[get_global_id(0)]; if(condition) { for(int i=0; i < N, i++) { process(data); write_channel_altera(req, data[i]); } } else { process(data); } } Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-22 UG-OCL002 2016.05.02 Work-Item Serial Execution of Channels Work-Item Serial Execution of Channels Work-item serial execution refers to an ordered execution behavior where work-item sequential IDs determine their execution order in the compute unit. When you implement channels in a kernel, the Altera Offline Compiler enforces that kernel behavior is equivalent to having at most one work-group in flight. The AOC also ensures that the kernel executes channels in work-item serial execution, where the kernel executes work-items with smaller IDs first. A work-item has the identifier (x, y, z, group), where x, y, z are the local 3D identifiers, and group is the work-group identifier. The work-item ID (x0, y0, z0, group0) is considered to be smaller than the ID (x1, y1, z1, group1) if one of the following conditions is true: • • • • group0 < group1 group0 = group1 and z0 < z1 group0 = group1 and z0 = z1 and y0 < y1 group0 = group1 and z0 = z1 and y0 = y1 and x0 < x1 Work-items with incremental IDs execute in a sequential order. For example, the work-item with an ID (x0, y0, z0, group0) executes the write channel call first. Then, the work-item with an ID (x1, y0, z0, group0) executes the call, and so on. Defining this order ensures that the system is verifiable with external models. Channel Execution in Loop with Multiple Work-Items When channels exist in the body of a loop with multiple work-items, as shown below, each loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a workgroup executes before iteration 1 of each work-item in a work-group, and so on. __kernel void ordering( __global int * data ) { write_channel_altera(req, data[get_global_id(0)]); } Restrictions in the Implementation of AOCL Channels Extension There are certain design restrictions to the implementation of channels in your OpenCL application. Single Call Site Because the channel read and write operations do not function deterministically, for a given kernel, you can only assign one call site per channel ID. For example, the Altera Offline Compiler cannot compile the following code example: in_data1 = read_channel_altera(channel1); in_data2 = read_channel_altera(channel2); in_data3 = read_channel_altera(channel1); The second read_channel_altera call to channel1 causes compilation failure because it creates a second call site to channel1. To gather multiple data from a given channel, divide the channel into multiple channels, as shown below: in_data1 = read_channel_altera(channel1); in_data2 = read_channel_altera(channel2); in_data3 = read_channel_altera(channel3); Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Restrictions in the Implementation of AOCL Channels Extension 1-23 Because you can only assign a single call site per channel ID, you cannot unroll loops containing channels. Consider the following code: #pragma unroll 4 for (int i=0; i < 4; i++) { in_data = read_channel_altera(channel1); } The AOC issues the following warning message during compilation: Compiler Warning: Unroll is required but the loop cannot be unrolled. Feedback and Feed-Forward Channels Channels within a kernel can be either read_only or write_only. Performance of a kernel that reads and writes to the same channel is poor. Static Indexing The Altera SDK for OpenCL channels extension does not support dynamic indexing into arrays of channel IDs. Consider the following example: #pragma OPENCL EXTENSION cl_altera_channels : enable channel int ch[WORKGROUP_SIZE]; __kernel void consumer() { int gid = get_global_id(0); int value = read_channel_altera(ch[gid]); //statements } Compilation of this example kernel fails with the following error message: Compiler Error: Indexing into channel array ch could not be resolved to all constant To avoid this compilation error, index into arrays of channel IDs statically, as shown below: #pragma OPENCL EXTENSION cl_altera_channels : enable channel int ch[WORKGROUP_SIZE]; __kernel void consumer() { int gid = get_global_id(0); int value; switch(gid) { case0: value case1: value case2: value case3: value //statements Altera SDK for OpenCL Programming Guide Send Feedback = = = = read_channel_altera(ch[0]); read_channel_altera(ch[1]); read_channel_altera(ch[2]); read_channel_altera(ch[3]); break; break; break; break; Altera Corporation 1-24 UG-OCL002 2016.05.02 Enabling the AOCL Channels for OpenCL Kernel case WORKGROUP_SIZE-1:read_channel_altera(ch[WORKGROUP_SIZE-1]); break; } //statements } Kernel Vectorization Support You cannot vectorize kernels that use channels; that is, do not include the num_simd_work_items kernel attribute in your kernel code. Vectorizing a kernel that uses channels creates multiple channel masters and requires arbitration, which the AOCL channels extension does not support. Instruction-Level Parallelism on read_channel_altera and write_channel_altera Calls If no data dependencies exist between read_channel_altera and write_channel_altera calls, the AOC attempts to execute these instructions in parallel. As a result, the AOC might execute these read_channel_altera and write_channel_altera calls in an order that does not follow the sequence expressed in the OpenCL kernel code. Consider the following code sequence: in_data1 = read_channel_altera(channel1); in_data2 = read_channel_altera(channel2); in_data3 = read_channel_altera(channel3); Because there are no data dependencies between the read_channel_altera calls, the AOC can execute them in any order. Enabling the AOCL Channels for OpenCL Kernel To implement the Altera SDK for OpenCL channels extension, modify your OpenCL kernels to include channels-specific pragma and API calls. Channel declarations are unique within a given OpenCL kernel program. Also, channel instances are unique for every OpenCL kernel program device pair. If the runtime loads a single OpenCL kernel program onto multiple devices, each device will have a single copy of the channel. However, these channel copies are independent and do not share data across the devices. Declaring the Channels OPENCL EXTENSION pragma To enable the Altera SDK for OpenCL channels extension, declare the OPENCL EXTENSION pragma for channels at the beginning of your kernel source code. • To enable the AOCL channels extension, include the following line in your kernel source code to declare the OPENCL EXTENSION pragma: #pragma OPENCL EXTENSION cl_altera_channels : enable Declaring the Channel Handle Use the channel variable to define the connectivity between kernels or between kernels and I/O. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Implementing Blocking Channel Write Extensions 1-25 To read from and write to a channel, the kernel must pass the channel variable to each of the corresponding API call. • Declare the channel handle as a file scope variable in the kernel source code in the following convention: channel For example: channel int c; • The Altera SDK for OpenCL channel extension supports simultaneous channel accesses by multiple variables declared in a data structure. Declare a struct data structure for a channel in the following manner: typedef struct type_ { int a; int b; } type_t; channel type_t foo; Implementing Blocking Channel Write Extensions The write_channel_altera API call allows you to send data across a channel. Note: The write channel calls support single-call sites only. For a given channel, only one write channel call to it can exist in the entire kernel program. • To implement a blocking channel write, include the following write_channel_altera function signature: void write_channel_altera (channel channel_id, const data); Where: channel_id identifies the buffer to which the channel connects, and it must match the channel_id of the corresponding read channel (read_channel_altera). data is the data that the channel write operation writes to the channel. Data must match the of the channel_id. defines a channel data width, which cannot be a constant. Follow the OpenCL conversion rules to ensure that data the kernel writes to a channel is convertible to . The following code snippet demonstrates the implementation of the write_channel_altera API call: //Enables the channels extension. #pragma OPENCL EXTENSION cl_altera_channels : enable //Defines chan, the kernel file-scope channel variable. channel long chan; /*Defines the kernel which reads eight bytes (size of long) from global memory, and passes this data to the channel.*/ __kernel void kernel_write_channel( __global const long * src ) { for(int i=0; i < N; i++) { //Writes the eight bytes to the channel. write_channel_altera(chan, src[i]); } } Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-26 UG-OCL002 2016.05.02 Implementing Nonblocking Channel Write Extensions Caution: When you send data across a write channel using the write_channel_altera API call, keep in mind that if the channel is full (that is, if the FIFO buffer is full of data), your kernel will stall. Use the Altera SDK for OpenCL Profiler to check for channel stalls. Related Information Profiling Your OpenCL Kernel on page 1-97 Implementing Nonblocking Channel Write Extensions Perform nonblocking channel writes to facilitate applications where data write operations might not occur. A nonblocking channel write extension returns a Boolean value that indicates whether data is written to the channel. Consider a scenario where your application has one data producer with two identical workers. Assume the time each worker takes to process a message varies depending on the contents of the data. In this case, there might be situations where one worker is busy while the other is free. A nonblocking write can facilitate work distribution such that both workers are busy. • To implement a nonblocking channel write, include the following write_channel_nb_altera function signature: bool write_channel_nb_altera(channel channel_id, const data); The following code snippet of the kernel producer facilitates work distribution using the nonblocking channel write extension: #pragma OPENCL EXTENSION cl_altera_channels : enable channel long worker0, worker1; __kernel void producer( __global const long * src ) { for(int i=0; i < N; i++) { bool success = false; do { success = write_channel_nb_altera(worker0, src[i]); if(!success) { success = write_channel_nb_altera(worker1, src[i]); } } while(!success); } } Implementing Blocking Channel Read Extensions The read_channel_altera API call allows you to receive data across a channel. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Implementing Nonblocking Channel Read Extensions 1-27 Note: The read channel calls support single-call sites only. For a given channel, only one read channel call to it can exist in the entire kernel program. • To implement a blocking channel read, include the following read_channel_altera function signature: read_channel_altera(channel channel_id); Where: channel_id identifies the buffer to which the channel connects, and it must match the channel_id of the corresponding write channel (write_channel_altera). defines a channel data width, which cannot be a constant. Ensure that the variable the kernel assigns to read the channel data is convertible from . The following code snippet demonstrates the implementation of the read_channel_altera API call: //Enables the channel extension. #pragma OPENCL EXTENSION cl_altera_channels : enable; //Defines chan, the kernel file-scope channel variable. channel long chan; /*Defines the kernel, which reads eight bytes (size of long) from the channel and writes it back to global memory.*/ __kernel void kernel_read_channel( __global long * dst ); { for(int i=0; i < N; i++) { //Reads the eight bytes from the channel. dst[i] = read_channel_altera(chan); } } Caution: If the channel is empty (that is, if the FIFO buffer is empty), you cannot receive data across a read channel using the read_channel_altera API call. Doing so causes your kernel to stall. Implementing Nonblocking Channel Read Extensions Perform nonblocking reads to facilitate applications where data is not always available. The nonblocking reads signature is similar to blocking reads. However, it returns an integer value that indicates whether a read operation takes place successfully. • To implement a blocking channel write, include the following read_channel_nb_altera function signature: read_channel_nb_altera(channel channel_id, bool * valid); The following code snippet demonstrates the use of the nonblocking channel read extension: #pragma OPENCL EXTENSION cl_altera_channels : enable channel long chan; __kernel void kernel_read_channel( __global long * dst ) { int i=0; while(i < N) { bool valid0, valid1; Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-28 UG-OCL002 2016.05.02 Implementing I/O Channels Using the io Channels Attribute long data0 = read_channel_nb_altera(chan, &valid0); long data1 = read_channel_nb_altera(chan, &valid1); if (valid0) { process(data0); } if (valid1) process(data1); { process(data1); } } } Implementing I/O Channels Using the io Channels Attribute Include an io attribute in your channel declaration to declare a special I/O channel to interface with input or output features of an FPGA board. These features might include network interfaces, PCIe, cameras, or other data capture or processing devices or protocols. The io("chan_id") attribute specifies the I/O feature of an accelerator board with which a channel interfaces, where chan_id is the name of the I/O interface listed in the board_spec.xml file of your Custom Platform. Because peripheral interface usage might differ for each device type, consult your board vendor's documentation when you implement I/O channels in your kernel program. Your OpenCL kernel code must be compatible with the type of data generated by the peripheral interfaces. Caution: • Implicit data dependencies might exist for channels that connect to the board directly and communicate with peripheral devices via I/O channels. These implicit data dependencies might lead to compilation issues because the Altera Offline Compiler cannot identify these dependencies. • External I/O channels communicating with the same peripherals do not obey any sequential ordering. Ensure that the external device does not require sequential ordering because unexpected behavior might occur. 1. Consult the board_spec.xml file in your Custom Platform to identify the input and output features available on your FPGA board. For example, a board_spec.xml file might include the following information on I/O features: Altera Corporation port="udp0_out" port="udp0_in" port="udp1_out" port="udp1_in" type="streamsource" width="256" type="streamsink" width="256" type="streamsource" width="256" type="streamsink" width="256" Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Implementing I/O Channels Using the io Channels Attribute 1-29 The width attribute of an interface element specifies the width, in bits, of the data type used by that channel. For the example above, both the uint and float data types are 32 bits wide. Other bigger or vectorized data types must match the appropriate bit width specified in the board_spec.xml file. 2. Implement the io channel attribute as demonstrated in the following code example. The io channel attribute names must match those of the I/O channels (chan_id) specified in the board_spec.xml file. channel QUDPWord udp_in_IO __attribute__((depth(0))) __attribute__((io("eth0_in"))); channel QUDPWord udp_out_IO __attribute__((depth(0))) __attribute__((io("eth0_out"))); __kernel void io_in_kernel( __global ulong4 *mem_read, uchar read_from, int size ) { int index = 0; ulong4 data; int half_size = size >> 1; while (index < half_size) { if (read_from & 0x01) { data = read_channel_altera(udp_in_IO); } else { data = mem_read[index]; } write_channel_altera(udp_in, data); index++; } } __kernel void io_out_kernel( __global ulong2 *mem_write, uchar write_to, int size ) { int index = 0; ulong4 data; int half_size = size >> 1; while (index < half_size) { ulong4 data = read_channel_altera(udp_out); if (write_to & 0x01) { write_channel_altera(udp_out_IO, data); } else { //only write data portion ulong2 udp_data; udp_data.s0 = data.s0; udp_data.s1 = data.s1; mem_write[index] = udp_data; } index++; } } Attention: Declare a unique io("chan_id") handle for each I/O channel specified in the channels eXtensible Markup Language (XML) element within the board_spec.xml file. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-30 UG-OCL002 2016.05.02 Implementing Buffered Channels Using the depth Channels Attribute Implementing Buffered Channels Using the depth Channels Attribute You may have buffered or unbuffered channels in your kernel program. If there are imbalances in channel read and write operations, create buffered channels to prevent kernel stalls by including the depth attribute in your channel declaration. Buffered channels decouple the operation of concurrent work-items executing in different kernels. You may use a buffered channel to control data traffic, such as limiting throughput or synchronizing accesses to shared memory. In an unbuffered channel, a write operation cannot proceed until the read operation reads a data value. In a buffered channel, a write operation cannot proceed until the data value is copied to the buffer. If the buffer is full, the operation cannot proceed until the read operation reads a piece of data and removes it from the channel. • If you expect any temporary mismatch between the consumption rate and the production rate to the channel, set the buffer size using the depth channel attribute. The following example demonstrates the use of the depth channel attribute in kernel code that implements the Altera SDK for OpenCL channels extension. The depth(N) attribute specifies the minimum depth of a buffered channel, where N is the number of data values. #pragma OPENCL EXTENSION cl_altera_channels : enable channel int c __attribute__((depth(10))); __kernel void producer( __global int * in_data ) { for(int i=0; i < N; i++) { if(in_data[i]) { write_channel_altera(c, in_data[i]); } } } __kernel void consumer( __global int * restrict check_data, __global int * restrict out_data ) { int last_val = 0; for(int i=0; i< N, i++) { if(check_data[i]) { last_val = read_channel_altera(c); } out_data[i] = last_val; } } In this example, the write operation can write ten data values to the channel without blocking. Once the channel is full, the write operation cannot proceed until an associated read operation to the channel occurs. Because the channel read and write calls are conditional statements, the channel might experience an imbalance between read and write calls. You may add a buffer capacity to the channel to ensure that the producer and consumer kernels are decoupled. This step is particularly important if the producer kernel is writing data to the channel when the consumer kernel is not reading from it. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Enforcing the Order of Channel Calls 1-31 Enforcing the Order of Channel Calls To enforce the order of channel calls, introduce memory fence or barrier functions in your kernel program to control memory accesses. A memory fence function is necessary to create a control flow dependence between the channel synchronization calls before and after the fence. When the Altera Offline Compiler generates a compute unit, it does not create instruction-level parallelism on all instructions that are independent of each other. As a result, channel read and write operations might not execute independently of each other even if there is no control or data dependence between them. When channel calls interact with each other, or when channels write data to external devices, deadlocks might occur. For example, the code snippet below consists of a producer kernel and a consumer kernel. Channels c0 and c1 are unbuffered channels. The schedule of the channel read operations from c0 and c1 might occur in the reversed order as the channel write operations to c0 and c1. That is, the producer kernel writes to c0 but the consumer kernel might read from c1 first. This rescheduling of channel calls might cause a deadlock because the consumer kernel is reading from an empty channel. __kernel void producer( __global const uint * src, const uint iterations ) { for(int i=0; i < iterations; i++) { write_channel_altera(c0, src[2*i]); write_channel_altera(c1, src[2*i+1]); } } __kernel void consumer( __global uint * dst, const uint iterations ) { for(int i=0; i < iterations; i++) { /*During compilation, the AOC might reorder the way the consumer kernel writes to memory to optimize memory access. Therefore, c1 might be read before c0, which is the reverse of what appears in code.*/ dst[2*i+1] = read_channel_altera(c0); dst[2*i] = read_channel_altera(c1); Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-32 UG-OCL002 2016.05.02 Defining Memory Consistency Across Kernels When Using Channels } } • To prevent deadlocks from occurring by enforcing the order of channel calls, include memory fence functions (mem_fence) in your kernel. Inserting the mem_fence call with each kernel's channel flag forces the sequential ordering of the write and read channel calls. The code snippet below shows the modified producer and consumer kernels: #pragma OPENCL EXTENSION cl_altera_channels : enable channel uint c0 __attribute__((depth(0))); channel uint c1 __attribute__((depth(0))); __kernel void producer( __global const uint * src, const uint iterations ) { for(int i=0; i < iterations; i++) { write_channel_altera(c0, src[2*i]); mem_fence(CLK_CHANNEL_MEM_FENCE); write_channel_altera(c1, src[2*i+1]); } } __kernel void consumer( __global uint * dst; const uint iterations ) { for(int i=0; i < iterations; i++) { dst[2*i+1] = read_channel_altera(c0); mem_fence(CLK_CHANNEL_MEM_FENCE); dst[2*i] = read_channel_altera(c1); } } In this example, mem_fence in the producer kernel ensures that the channel write operation to c0 occurs before that to c1. Similarly, mem_fence in the consumer kernel ensures that the channel read operation from c0 occurs before that from c1. Defining Memory Consistency Across Kernels When Using Channels According to the OpenCL Specification version 1.0, memory behavior is undefined unless a kernel completes execution. A kernel must finish executing before other kernels can visualize any changes in memory behavior. However, kernels that use channels can share data through common global memory buffers and synchronized memory accesses. To ensure that data written to a channel is visible to the read channel after execution passes a memory fence, define memory consistency across kernels with respect to memory fences. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Use Models of AOCL Channels Implementation 1-33 • To create a control flow dependency between the channel synchronization calls and the memory operations, add the CLK_GLOBAL_MEM_FENCE flag to the mem_fence call. For example: __kernel void producer( __global const uint * src, const uint iterations ) { for(int i=0; i < iterations; i++) { write_channel_altera(c0, src[2*i]); mem_fence(CLK_CHANNEL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); write_channel_altera(c1, src[2*i+1]); } } In this kernel, the mem_fence function ensures that the write operation to c0 and memory access to src[2*i] occur before the write operation to c1 and memory access to src[2*i+1]. This allows data written to c0 to be visible to the read channel before data is written to c1. Use Models of AOCL Channels Implementation Concurrent execution can improve the effectiveness of channels implementation in your OpenCL kernels. During concurrent execution, the host launches the kernels in parallel. The kernels share memory and can communicate with each other through channels where applicable. The use models provide an overview on how to exploit concurrent execution safely and efficiently. Feed-Forward Design Model Implement the feed-forward design model to send data from one kernel to the next without creating any cycles between them. Consider the following code example: __kernel void producer( __global const uint * src, const uint iterations ) { for(int i=0; i < iterations; i++) { write_channel_altera(c0, src[2*i]); mem_fence(CLK_CHANNEL_MEM_FENCE); write_channel_altera(c1, src[2*i+1]); } } __kernel void consumer( __global uint * dst, const uint iterations ) { for (int i=0;i #include #include #include "CL/opencl.h" #define SIZE 1000 const char *kernel_source = "__kernel void pipe_writer(__global int *in," " write_only pipe int p_in)\n" "{\n" " int gid = get_global_id(0);\n" Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-42 UG-OCL002 2016.05.02 Ensuring Compatibility with Other OpenCL SDKs " write_pipe(p_in, &in[gid]);\n" "}\n" "__kernel void pipe_reader(__global int *out," " read_only pipe int p_out)\n" "{\n" " int gid = get_global_id(0);\n" " read_pipe(p_out, &out[gid]);\n" "}\n"; int main() { int *input = (int *)malloc(sizeof(int) * SIZE); int *output = (int *)malloc(sizeof(int) * SIZE); memset(output, 0, sizeof(int) * SIZE); for (int i = 0; i != SIZE; ++i) { input[i] = rand(); } cl_int status; cl_platform_id platform; cl_uint num_platforms; status = clGetPlatformIDs(1, &platform, &num_platforms); cl_device_id device; cl_uint num_devices; status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, &num_devices); cl_context context = clCreateContext(0, 1, &device, NULL, NULL, &status); cl_command_queue queue = clCreateCommandQueue(context, device, 0, &status); size_t len = strlen(kernel_source); cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernel_source, &len, &status); status = clBuildProgram(program, num_devices, &device, "", NULL, NULL); cl_kernel pipe_writer = clCreateKernel(program, "pipe_writer", &status); cl_kernel pipe_reader = clCreateKernel(program, "pipe_reader", &status); cl_mem in_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, input, &status); cl_mem out_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int) * SIZE, NULL, &status); cl_mem pipe = clCreatePipe(context, 0, sizeof(cl_int), SIZE, NULL, &status); status status status status = = = = clSetKernelArg(pipe_writer, clSetKernelArg(pipe_writer, clSetKernelArg(pipe_reader, clSetKernelArg(pipe_reader, 0, 1, 0, 1, sizeof(cl_mem), sizeof(cl_mem), sizeof(cl_mem), sizeof(cl_mem), &in_buffer); &pipe); &out_buffer); &pipe); size_t size = SIZE; Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Ensuring Compatibility with Other OpenCL SDKs 1-43 cl_event sync; status = clEnqueueNDRangeKernel(queue, pipe_writer, 1, NULL, &size, &size, 0, NULL, &sync); status = clEnqueueNDRangeKernel(queue, pipe_reader, 1, NULL, &size, &size, 1, &sync, NULL); status = clFinish(queue); status = clEnqueueReadBuffer(queue, out_buffer, CL_TRUE, 0, sizeof(int) * SIZE, output, 0, NULL, NULL); int golden = 0, result = 0; for (int i = 0; i != SIZE; ++i) { golden += input[i]; result += output[i]; } int ret = 0; if (golden != result) { printf("FAILED!"); ret = 1; } else { printf("PASSED!"); } printf("\n"); return ret; } Kernel Code Modification If your kernel code runs on OpenCL SDKs that conforms to the OpenCL Specification version 2.0, you must modify it before running it on the AOCL. To modify the kernel code, perform the following modifications: • Rename the pipe arguments so that they are the same in both kernels. For example, rename p_in and p_out to p. • Specify the depth attribute for the pipe arguments. Assign a depth attribute value that equals to the maximum number of packets that the pipe creates to hold in the host. • Execute the kernel program in the offline compilation mode because the AOCL has an offline compiler. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-44 UG-OCL002 2016.05.02 Declaring the Pipe Handle The modified kernel code appears as follows: #define SIZE 1000 __kernel void pipe_writer(__global int *in, write_only pipe int __attribute__((depth(SIZE))) p) { int gid = get_global_id(0); write_pipe(p, &in[gid]); } __kernel void pipe_reader(__global int *out, read_only pipe int __attribute__((depth(SIZE))) p) { int gid = get_global_id(0); read_pipe(p, &out[gid]); } Declaring the Pipe Handle Use the pipe variable to define the static pipe connectivity between kernels or between kernels and I/O. To read from and write to a pipe, the kernel must pass the pipe variable to each of the corresponding API call. • Declare the pipe handle as a file scope variable in the kernel source code in the following convention: pipe The of the pipe may be any OpenCL built-in scalar or vector data type with a scalar size of 1024 bits or less. It may also be any user-defined type that is comprised of scalar or vector data type with a scalar size of 1024 bits or less. Consider the following pipe handle declarations: __kernel void first (pipe int c) __kernel void second (write_only pipe int c) The first example declares a read-only pipe handle of type int in the kernel first. The second example declares a write-only pipe in the kernel second. The kernel first may only read from pipe c, and the kernel second may only write to pipe c. Important: The Altera Offline Compiler statically infers the connectivity of pipes in your system by matching the names of the pipe arguments. In the example above, the kernel first is connected to the kernel second by the pipe c. In an Altera OpenCL system, only one kernel may read to a pipe. Similarly, only one kernel may write to a pipe. If a non-I/O pipe does not have at least one corresponding reading operation and one writing operation, the AOC issues an error. For more information in the Altera SDK for OpenCL I/O pipe implementation, refer to Implementing I/O Pipes Using the io Attribute. Related Information Implementing I/O Pipes Using the io Attribute on page 1-47 Implementing Pipe Writes The write_pipe API call allows you to send data across a pipe. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Implementing Pipe Reads 1-45 Altera only supports the convenience version of the write_pipe function. By default, write_pipe calls are nonblocking. Pipe write operations are successful only if there is capacity in the pipe to hold the incoming packet. Attention: The write pipe calls support single-call sites only. For a given pipe, only one write pipe call to it can exist in the entire kernel program. • To implement a pipe write, include the following write_pipe function signature: int write_pipe (write_only pipe pipe_id, const *data); Where: pipe_id identifies the buffer to which the pipe connects, and it must match the pipe_id of the corresponding read pipe (read_pipe). data is the data that the pipe write operation writes to the pipe. It is a pointer to the packet type of the pipe. Note that writing to the pipe might lead to a global or local memory load, depending on the source address space of the data pointer. defines a pipe data width. The return value indicates whether the pipe write operation is successful. If successful, the return value is 0. If pipe write is unsuccessful, the return value is -1. The following code snippet demonstrates the implementation of the write_pipe API call: /*Declares the writable nonblocking pipe, p, which contains packets of type int*/ __kernel void kernel_write_pipe (__global const long *src, write_only pipe int p) { for (int i=0; i < N; i++) { //Performs the actual writing //Emulates blocking behavior via the use of a while loop while (write_pipe(p, &src[i]) < 0) { } } } The while loop is unnecessary if you specify a blocking attribute. To facilitate better hardware implementations, Altera provides facility for blocking write_pipe calls by specifying the blocking attribute (that is, __attribute__((blocking))) on the pipe arugment declaration for the kernel. Blocking write_pipe calls always return success. Caution: When you send data across a blocking write pipe using the write_pipe API call, keep in mind that if the pipe is full (that is, if the FIFO buffer is full of data), your kernel will stall. Use the Altera SDK for OpenCL Profiler to check for pipe stalls. Related Information Profiling Your OpenCL Kernel on page 1-97 Implementing Pipe Reads The read_pipe API call allows you to receive data across a pipe. Altera only supports the convenience version of the read_pipe function. By default, read_pipe calls are nonblocking. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-46 UG-OCL002 2016.05.02 Implementing Buffered Pipes Using the depth Attribute Note: The read pipe calls support single-call sites only. For a given pipe, only one read pipe call to it can exist in the entire kernel program. • To implement a pipe read, include the following read_pipe function signature: int read_pipe (read_only_pipe pipe_id, *data); Where: pipe_id identifies the buffer to which the pipe connects, and it must match the pipe_id of the corresponding pipe write operation (write_pipe). data is the data that the pipe read operation reads from the pipe. It is a pointer to the location of the data. Note that write_pipe call might lead to a global or local memory load, depending on the source address space of the data pointer. defines the packet size of the data. The following code snippet demonstrates the implementation of the read_pipe API call: /*Declares the read_only_pipe that contains packets of type long.*/ /*Declares that read_pipe calls within the kernel will exhibit blocking behavior*/ __kernel void kernel_read_pipe (__global long *dst, read_only pipe long __attribute__((blocking)) p) { for (int i=0; i < N; i++) { /*Reads from a long from the pipe and stores it into global memory at the specified location*/ read_pipe(p, &dst[i]); } } To facilitate better hardware implementations, Altera provides facility for blocking write_pipe calls by specifying the blocking attribute (that is, __attribute__((blocking))) on the pipe arugment declaration for the kernel. Blocking write_pipe calls always return success. Caution: If the pipe is empty (that is, if the FIFO buffer is empty), you cannot receive data across a blocking read pipe using the read_pipe API call. Doing so causes your kernel to stall. Implementing Buffered Pipes Using the depth Attribute You may have buffered or unbuffered pipes in your kernel program. If there are imbalances in pipe read and write operations, create buffered pipes to prevent kernel stalls by including the depth attribute in your pipe declaration. Buffered pipes decouple the operation of concurrent work-items executing in different kernels. You may use a buffered pipe to control data traffic, such as limiting throughput or synchronizing accesses to shared memory. In an unbuffered pipe, a write operation can only proceed when the read operation is expecting to read data. Use unbuffered pipes in conjunction with blocking read and write behaviors in kernels that execute concurrently. The unbuffered pipes provide self-synchronizing data transfers efficiently. In a buffered pipe, a write operation can only proceed if there is capacity in the pipe to hold the incoming packet. A read operation can only proceed if there is at least one packet in the pipe. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Implementing I/O Pipes Using the io Attribute 1-47 Use buffered pipes if pipe calls are predicated differently in the writer and reader kernels, and the kernels do not execute concurrently. • If you expect any temporary mismatch between the consumption rate and the production rate to the pipe, set the buffer size using the depth attribute. The following example demonstrates the use of the depth attribute in kernel code that implements the OpenCL pipes. The depth(N) attribute specifies the minimum depth of a buffered pipe, where N is the number of data values. If the read and write kernels specify different depths for a given buffered pipe, the Altera Offline Compiler will use the larger depth of the two. __kernel void producer (__global int *in_data, write_only pipe int __attribute__((blocking)) __attribute__((depth(10))) c) { for (i=0; i < N; i++) { if (in_data[i]) { write_pipe( c, &in_data[i] ); } } } __kernel void consumer (__global int *check_data, __global int *out_data, read_only pipe int __attribute__((blocking)) c ) { int last_val = 0; for (i=0; i < N; i++) { if (check_data[i]) { read_pipe( c, &last_val ); } out_data[i] = last_val; } } In this example, the write operation can write ten data values to the pipe successfully. After the pipe is full, the write kernel returns failure until a read kernel consumes some of the data in the pipe. Because the pipe read and write calls are conditional statements, the pipe might experience an imbalance between read and write calls. You may add a buffer capacity to the pipe to ensure that the producer and consumer kernels are decoupled. This step is particularly important if the producer kernel is writing data to the pipe when the consumer kernel is not reading from it. Implementing I/O Pipes Using the io Attribute Include an io attribute in your OpenCL pipe declaration to declare a special I/O pipe to interface with input or output features of an FPGA board. These features might include network interfaces, PCIe, cameras, or other data capture or processing devices or protocols. In the Altera SDK for OpenCL channels extension, the io("chan_id") attribute specifies the I/O feature of an accelerator board with which a channel interfaces. The chan_id argument is the name of the I/O Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-48 UG-OCL002 2016.05.02 Enforcing the Order of Pipe Calls interface listed in the board_spec.xml file of your Custom Platform. The same I/O features can be used to identify I/O pipes. Because peripheral interface usage might differ for each device type, consult your board vendor's documentation when you implement I/O pipes in your kernel program. Your OpenCL kernel code must be compatible with the type of data generated by the peripheral interfaces. If there is a difference in the byte ordering between the external I/O pipes and the kernel, the Altera Offline Compiler converts the byte ordering seamlessly upon entry and exit. Caution: • Implicit data dependencies might exist for pipes that connect to the board directly and communicate with peripheral devices via I/O pipes. These implicit data dependencies might lead to compilation issues because the AOC cannot identify these dependencies. • External I/O pipes communicating with the same peripherals do not obey any sequential ordering. Ensure that the external device does not require sequential ordering because unexpected behavior might occur. 1. Consult the board_spec.xml file in your Custom Platform to identify the input and output features available on your FPGA board. For example, a board_spec.xml file might include the following information on I/O features: port="udp0_out" port="udp0_in" port="udp1_out" port="udp1_in" type="streamsource" width="256" type="streamsink" width="256" type="streamsource" width="256" type="streamsink" width="256" The width attribute of an interface element specifies the width, in bits, of the data type used by that pipe. For the example above, both the uint and float data types are 32 bits wide. Other bigger or vectorized data types must match the appropriate bit width specified in the board_spec.xml file. 2. Implement the io attribute as demonstrated in the following code example. The io attribute names must match those of the I/O channels (chan_id) specified in the board_spec.xml file. __kernel void test (pipe uint pkt __attribute__((io(“enet”))),; pipe float data __attribute__((io(“pcie”)))); Attention: Declare a unique io("chan_id") handle for each I/O pipe specified in the channels XML element within the board_spec.xml file. Enforcing the Order of Pipe Calls To enforce the order of pipe calls, introduce memory fence or barrier functions in your kernel program to control memory accesses. A memory fence function is necessary to create a control flow dependence between the pipe synchronization calls before and after the fence. When the Altera Offline Compiler generates a compute unit, it does not create instruction-level parallelism on all instructions that are independent of each other. As a result, pipe read and write operations might not execute independently of each other even if there is no control or data dependence between them. When pipe calls interact with each other, or when pipes write data to external devices, deadlocks might occur. For example, the code snippet below consists of a producer kernel and a consumer kernel. Pipes c0 and c1 are unbuffered pipes. The schedule of the pipe read operations from c0 and c1 might occur in the Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Enforcing the Order of Pipe Calls 1-49 reversed order as the pipe write operations to c0 and c1. That is, the producer kernel writes to c0 but the consumer kernel might read from c1 first. This rescheduling of pipe calls might cause a deadlock because the consumer kernel is reading from an empty pipe. __kernel void producer (__global const uint * restrict src, const uint iterations, write_only pipe uint __attribute__((blocking)) c0, write_only pipe uint __attribute__((blocking)) c1) { for (int i=0; i < iterations; i++) { write_pipe( c0, &src[2*i ] ); write_pipe( c1, &src[2*i+1] ); } } __kernel void consumer (__global uint * restrict dst, const uint iterations, read_only pipe uint __attribute__((blocking)) c0, read_only pipe uint __attribute__((blocking)) c1) { for (int i=0; i < iterations; i++) { read_pipe( c0, &dst[2*i+1] ); read_pipe( c1, &dst[2*i] ); } } • To prevent deadlocks from occurring by enforcing the order of pipe calls, include memory fence functions (mem_fence) in your kernel. Inserting the mem_fence call with each kernel's pipe flag forces the sequential ordering of the write and read pipe calls. The code snippet below shows the modified producer and consumer kernels: __kernel void producer (__global const uint * src, const uint iterations, write_only_pipe uint __attribute__((blocking)) c0, write_only_pipe uint __attribute__((blocking)) c1) { for(int i=0; i < iterations; i++) { write_pipe(c0, &src[2*i ]); mem_fence(CLK_CHANNEL_MEM_FENCE); write_pipe(c1, &src[2*i+1]); } } __kernel void consumer (__global uint * dst; const uint iterations, read_only_pipe uint __attribute__((blocking)) c0, read_only_pipe uint __attribute__((blocking)) c1) { for(int i=0; i < iterations; i++) { read_pipe(c0, &dst[2*i ]); mem_fence(CLK_CHANNEL_MEM_FENCE); read_pipe(c1, &dst[2*i+1]); } } In this example, mem_fence in the producer kernel ensures that the pipe write operation to c0 occurs before that to c1. Similarly, mem_fence in the consumer kernel ensures that the pipe read operation from c0 occurs before that from c1. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-50 UG-OCL002 2016.05.02 Defining Memory Consistency Across Kernels When Using Pipes Defining Memory Consistency Across Kernels When Using Pipes According to the OpenCL Specification version 2.0, memory behavior is undefined unless a kernel completes execution. A kernel must finish executing before other kernels can visualize any changes in memory behavior. However, kernels that use pipes can share data through common global memory buffers and synchronized memory accesses. To ensure that data written to a pipe is visible to the read pipe after execution passes a memory fence, define memory consistency across kernels with respect to memory fences. • To create a control flow dependency between the pipe synchronization calls and the memory operations, add the CLK_GLOBAL_MEM_FENCE flag to the mem_fence call. For example: __kernel void producer (__global const uint * restrict src, const uint iterations, write_only pipe uint __attribute__((blocking)) c0, write_only pipe uint __attribute__((blocking)) c1) { for (int i=0;i to 1. If is FPGA_board_1, the Altera Offline Compiler will compile the FPGA_board_1-specific parameters and features. • To introduce AOC-specific compiler features and optimizations, structure your kernel program in the following manner: #if defined(ALTERA_CL) //statements #else //statements #endif Where ALTERA_CL is the Altera predefined preprocessor macro for the AOC. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Declaring __constant Address Space Qualifiers 1-51 Related Information Defining Preprocessor Macros to Specify Kernel Parameters (-D ) on page 1-83 Declaring __constant Address Space Qualifiers There are several limitations and workarounds you must consider when you include __constant address space qualifiers in your kernel. Function Scope __constant Variables The Altera Offline Compiler does not support function scope __constant variables. Replace function scope __constant variables with file scope constant variables. You can also replace function scope __constant variables with __constant buffers that the host passes to the kernel. File Scope __constant Variables If the host always passes the same constant data to your kernel, consider declaring that data as a constant preinitialized file scope array within the kernel file. Declaration of a constant preinitialized file scope array creates a ROM directly in the hardware to store the data. This ROM is available to all work-items in the NDRange. The AOC supports only scalar file scope constant data. For example, you may set the __constant address space qualifier as follows: __constant int my_array[8] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7}; __kernel void my_kernel (__global int * my_buffer) { size_t gid = get_global_id(0); my_buffer[gid] += my_array[gid % 8]; } In this case, the AOC sets the values for my_array in a ROM because the file scope constant data does not change between kernel invocations. Warning: Do not set your file scope __constant variables in the following manner because the AOC does not support vector type __constant arrays declared at the file scope: __constant int2 my_array[4] = {(0x0, 0x1), (0x2, 0x3); (0x4, 0x5), (0x6, 0x7)}; Pointers to __constant Parameters from the Host You can replace file scope constant data with a pointer to a __constant parameter in your kernel code. You must then modify your host application in the following manner: 1. Create cl_mem memory objects associated with the pointers in global memory. 2. Load constant data into cl_mem objects with clEnqueueWriteBuffer prior to kernel execution. 3. Pass the cl_mem objects to the kernel as arguments with the clSetKernelArg function. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-52 UG-OCL002 2016.05.02 Including Structure Data Types as Arguments in OpenCL Kernels For simplicity, if a constant variable is of a complex type, use a typedef argument, as shown in the table below: Table 1-1: Replacing File Scope __constant Variable with Pointer to __constant Parameter If your source code is structured as follows: __constant int Payoff[2][2] = {{ 1, 3}, {5, 3}}; __kernel void original(__global int * A) { *A = Payoff[1][2]; // and so on } Rewrite your code to resemble the following syntax: __kernel void modified(__global int * A, __constant Payoff_type * PayoffPtr ) { *A = (PayoffPtr)[1][2]; // and so on } Attention: Use the same type definition in both your host application and your kernel. Including Structure Data Types as Arguments in OpenCL Kernels Convert each structure parameter (struct) to a pointer that points to a structure. The table below describes how you can convert structure parameters: Table 1-2: Converting Structure Parameters to Pointers that Point to Structures If your source code is structured as follows: Rewrite your code to resemble the following syntax: struct Context { float param1; float param2; int param3; uint param4; }; struct Context { float param1; float param2; int param3; uint param4; }; __kernel void algorithm(__global float * A, struct Context c) { if ( c.param3 ) { // statements } } __kernel void algorithm(__global float * A, __global struct Context * restrict c) { if ( c->param3 ) { // Dereference through a // pointer and so on } } Attention: The __global struct declaration creates a new buffer to store the structure. To prevent pointer aliasing, include a restrict qualifier in the declaration of the pointer to the structure. Matching Data Layouts of Host and Kernel Structure Data Types If you use structure data types (struct) as arguments in OpenCL kernels, match the member data types and align the data members between the host application and the kernel code. To match member data types, use the cl_ version of the data type in your host application that corresponds to the data type in the kernel code. The cl_ version of the data type is available in the Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Matching Data Layouts of Host and Kernel Structure Data Types 1-53 opencl.h header file. For example, if you have a data member of type float4 in your kernel code, the corresponding data member you declare in the host application is cl_float4. Align the structures and align the struct data members between the host and kernel applications. Manage the alignments carefully because of the variability among different host compilers. For example, if you have float 4 OpenCL data types in the struct, the alignments of these data items must satisfy the OpenCL specification (that is, 16-byte alignment for float4). The following rules apply when the Altera Offline Compiler compiles your OpenCL kernels: 1. Alignment of built-in scalar and vector types follow the rules outlined in Section 6.1.5 of the OpenCL Specification version 1.0. The AOC usually aligns a data type based on its size. However, the AOC aligns a value of a threeelement vector the same way it aligns a four-element vector. 2. An array has the same alignment as one of its elements. 3. A struct (or a union) has the same alignment as the maximum alignment necessary for any of its data members. Consider the following example: struct my_struct { char data[3]; float4 f4; int index; }; The AOC aligns the struct elements above at 16-byte boundaries because of the float4 data type. As a result, both data and index also have 16-byte alignment boundaries. 4. The AOC does not reorder data members of a struct. 5. Normally, the AOC inserts a minimum amount of data structure padding between data members of a struct to satisfy the alignment requirements for each data member. a. In your OpenCL kernel code, you may specify data packing (that is, no insertion of data structure padding) by applying the packed attribute to the struct declaration. If you impose data packing, ensure that the alignment of data members satisfies the OpenCL alignment requirements. The Altera SDK for OpenCL does not enforce these alignment requirements. Ensure that your host compiler respects the kernel attribute and sets the appropriate alignments. b. In your OpenCL kernel code, you may specify the amount of data structure padding by applying the aligned(N) attribute to a data member, where N is the amount of padding. The AOCL does not enforce these alignment requirements. Ensure that your host compiler respects the kernel attribute and sets the appropriate alignments. For Windows systems, some versions of the Microsoft Visual Studio compiler pack structure data types by default. If you do not want to apply data packing, specify an amount of data structure padding as shown below: struct my_struct { __declspec(align(16)) char data[3]; /*Note that cl_float4 is the only known float4 definition on the host*/ __declspec(align(16)) cl_float4 f4; Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-54 UG-OCL002 2016.05.02 Disabling Insertion of Data Structure Padding __declspec(align(16)) int index; }; Tip: An alternative way of adding data structure padding is to insert dummy struct members of type char or array of char. Related Information • Modifying Host Program for Structure Parameter Conversion on page 1-65 • OpenCL Specification version 1.0 Disabling Insertion of Data Structure Padding You may instruct the Altera Offline Compiler to disable automatic padding insertion between members of a struct data structure. • To disable automatic padding insertion, insert the packed attribute prior to the kernel source code for a struct data structure. For example: __attribute__((packed)) struct Context { float param1; float param2; int param3; uint param4; }; __kernel void algorithm(__global float * restrict A, __global struct Context * restrict c) { if ( c->param3 ) { // Dereference through a pointer and so on } } For more information, refer to the Align a Struct with or without Padding section of the Altera SDK for OpenCL Best Practices Guide. Related Information Align a Struct with or without Padding Specifying the Alignment of a Struct You may instruct the Altera Offline Compiler to set a specific alignment of a struct data structure. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Inferring a Register 1-55 • To specify the struct alignment, insert the aligned(N) attribute prior to the kernel source code for a struct data structure. For example: __attribute__((aligned(2))) struct Context { float param1; float param2; int param3; uint param4; }; __kernel void algorithm(__global float * A, _global struct Context * restrict c) { if ( c->param3 ) { // Dereference through a pointer and so on } } For more information, refer to the Align a Struct with or without Padding section of the Altera SDK for OpenCL Best Practices Guide. Related Information Align a Struct with or without Padding Inferring a Register The Altera Offline Compiler can implement data that is in the private address space in registers or in block RAMs. In general, the AOC chooses registers if the access to a variable is fixed and does not require any dynamic indexes. Accessing an array with a variable index usually forces the array into block RAMs. Implementing private data as registers is beneficial for data access that occurs in a single cycle (for example, feedback in a single work-item loop). The AOC infers private arrays as registers either as single values or in a piecewise fashion. Piecewise implementation results in very efficient hardware; however, the AOC must be able to determine data accesses statically. To facilitate piecewise implementation, hardcode the access points into the array. You can also facilitate register inference by unrolling loops that access the array. If array accesses are not inferable statically, the AOC might infer the array as registers. However, the AOC limits the size of these arrays to 64 bytes in length for single work-item kernels. There is effectively no size limit for kernels with multiple work-items Consider the following code example: int array[SIZE]; for (int j = 0; j < N; ++j) { for (int i = 0; i < SIZE - 1; ++i) { array[i] = array[i + 1]; } } The indexing into array[i] is not inferable statically because the loop is not unrolled. If the size of array[i] is less than or equal to 64 bytes for single work-item kernels, the AOC implements array[i] in block RAMs. If the size of array[i] is greater than 64 bytes, or if the kernel has multiple work-items, the AOC implements the entire array into registers as a single value. In this case, the AOC implements data Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-56 UG-OCL002 2016.05.02 Inferring a Shift Register accesses as nonconstant shifts and masks. With complicated addressing, the AOC implements the array in block RAMs and instantiates specialized hardware for each load or store operation. Inferring a Shift Register The shift register design pattern is a very important design pattern for many applications. However, the implementation of a shift register design pattern might seem counterintuitive at first. Consider the following code example: channel int in, out; #define SIZE 512 //Shift register size must be statically determinable __kernel void foo() { int shift_reg[SIZE]; //The key is that the array size is a compile time constant // Initialization loop #pragma unroll for (int i=0; i < SIZE; i++) { //All elements of the array should be initialized to the same value shift_reg[i] = 0; } while(1) { // Fully unrolling the shifting loop produces constant accesses #pragma unroll for (int j=0; j < SIZE–1; j++) { shift_reg[j] = shift_reg[j + 1]; } shift_reg[SIZE – 1] = read_channel_altera(in); // Using fixed access points of the shift register int res = (shift_reg[0] + shift_reg[1]) / 2; // ‘out’ channel will have running average of the input channel write_channel_altera(out, res); } } In each clock cycle, the kernel shifts a new value into the array. By placing this shift register into a block RAM, the Altera Offline Compiler can efficiently handle multiple access points into the array. The shift register design pattern is ideal for implementing filters (for example, image filters like a Sobel filter or time-delay filters like a finite impulse response (FIR) filter). Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Enabling Double Precision Floating-Point Operations 1-57 When implementing a shift register in your kernel code, keep in mind the following key points: 1. Unroll the shifting loop so that it can access every element of the array. 2. All access points must have constant data accesses. For example, if you write a calculation in nested loops using multiple access points, unroll these loops to establish the constant access points. 3. Initialize all elements of the array to the same value. Alternatively, you may leave the elements uninitialized if you do not require a specific initial value. 4. If some accesses to a large array are not inferable statically, they force the AOC to create inefficient hardware. If these accesses are necessary, use __local memory instead of __private memory. 5. Do not shift a large shift register conditionally. The shifting must occur in very loop iteration that contains the shifting code to avoid creating inefficient hardware. Enabling Double Precision Floating-Point Operations The Altera SDK for OpenCL offers preliminary support for all double precision floating-point functions. Before declaring any double precision floating-point data type in your OpenCL kernel, include the following OPENCL EXTENSION pragma in your kernel code: #pragma OPENCL EXTENSION cl_khr_fp64 : enable Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels Single work-item kernels that perform accumulation in a loop can leverage the Altera Offline Compiler's single-cycle floating-point accumulator feature. The AOC searches for these kernel instances and attempts to map an accumulation that executes in a loop into the accumulator structure. The AOC supports an accumulator that adds or subtracts a value. To leverage this feature, describe the accumulation in a way that allows the AOC to infer the accumulator. Attention: • • • • The accumulator is only available on Arria 10 devices. The accumulator must be part of a loop. The accumulator must have an initial value of 0. The accumulator cannot be conditional. Below are examples of a description that results in the correct inference of the accumulator by the AOC. #pragma OPENCL EXTENSION cl_altera_channels : enable channel float4 RANDOM_STREAM; __kernel void acc_test(__global float *a, int k) { // Simplest example of an accumulator. // In this loop, the accumulator acc is incremented by 5. int i; float acc = 0.0f; for (i = 0; i < k; i++) { acc+=5; } a[0] = acc; } __kernel void acc_test2(__global float *a, int k) { // Extended example showing that an accumulator can be // conditionally incremented. The key here is to describe the increment // as conditional, not the accumulation itself. int i; float acc = 0.0f; Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-58 UG-OCL002 2016.05.02 Programming Strategies for Inferring the Accumulator for (i = 0; i < k; i++) { acc += ((i < 30) ? 5 : 0); } a[0] = acc; } __kernel void acc_test3(__global float *a, int k) { // A more complex case where the accumulator is fed // by a dot product. int i; float acc = 0.0f; for (i = 0; i < k; i++ ){ float4 v = read_channel_altera(RANDOM_STREAM); float x1 = v.x; float x2 = v.y; float y1 = v.z; float y2 = v.w; acc += (x1*y1+x2*y2); } a[0] = acc; } __kernel void loader(__global float *a, int k) { int i; float4 my_val = 0; for(i = 0; i < k; i++) { if ((i%4) == 0) write_channel_altera(RANDOM_STREAM, my_val); if ((i%4) == 0) my_val.x = a[i]; if ((i%4) == 1) my_val.y = a[i]; if ((i%4) == 2) my_val.z = a[i]; if ((i%4) == 3) my_val.w = a[i]; } } Programming Strategies for Inferring the Accumulator To leverage the single cycle floating-point accumulator feature, you can modify the accumulator description in your kernel code to improve efficiency or work around programming restrictions. Describing an Accumulator Using Multiple Loops Consider a case where you want to describe an accumulator using multiple loops, with some of the loops being unrolled: float acc = 0.0f; for (i = 0; i < k; i++) { #pragma unroll for(j=0;j < 16; j++) acc += (x[i+j]*y[i+j]); } In this situation, it is important to compile the kernel with the --fp-relaxed Altera Offline Compiler command option to enable the AOC to rearrange the operations in a way that exposes the accumulation. If you do not compile the kernel with --fp-relaxed, the resulting accumulator structure will have a high initiation interval (II). II is the launch frequency of a new loop iteration. The higher the II value, the longer the accumulator structure must wait before it can process the next loop iteration. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Designing Your Host Application 1-59 Modifying a Multi-Loop Accumulator Description In cases where you cannot compile an accumulator description using the --fp-relaxed AOC command option, rewrite the code to expose the accumulation. For the code example above, rewrite it in the following manner: float acc = 0.0f; for (i = 0; i < k; i++) { float my_dot = 0.0f; #pragma unroll for(j=0;j < 16; j++) my_dot += (x[i+j]*y[i+j]); acc += my_dot; } Modifying an Accumulator Description Containing a Variable or Non-Zero Initial Value Consider a situation where you might want to apply an offset to a description of an accumulator that begins with a non-zero value: float acc = array[0]; for (i = 0; i < k; i++) { acc += x[i]; } Because the accumulator hardware does not support variable or non-zero initial values in a description, you must rewrite the description. float acc = 0.0f; for (i = 0; i < k; i++) { acc += x[i]; } acc += array[0]; Rewriting the description in the above manner enables the kernel to use an accumulator in a loop. The loop structure is then followed by an increment of array[0]. Designing Your Host Application Altera offers guidelines on host requirements and procedures on structuring the host application. If applicable, implement these design strategies when you create or modify a host application for your OpenCL kernels. Host Programming Requirements on page 1-60 When designing your OpenCL host application for use with the Altera SDK for OpenCL, ensure that the application satisfies the following host programming requirements. Allocating OpenCL Buffer for Manual Partitioning of Global Memory on page 1-61 Collecting Profile Data During Kernel Execution on page 1-63 In cases where kernel execution finishes after the host application completes, you can query the FPGA explicitly to collect profile data during kernel execution. Accessing Custom Platform-Specific Functions on page 1-65 To reference Custom Platform-specific user-accessible functions while linking to the ACD, include the clGetBoardExtensionFunctionAddressAltera extension in your host application. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-60 UG-OCL002 2016.05.02 Host Programming Requirements Modifying Host Program for Structure Parameter Conversion on page 1-65 If you convert any structure parameters to pointers-to-constant structures in your OpenCL kernel, you must modify your host application accordingly. Allocating Shared Memory for OpenCL Kernels Targeting SoCs on page 1-66 Altera recommends that OpenCL kernels that run on Altera SoCs access shared memory instead of the FPGA DDR memory. Managing Host Application on page 1-68 The Altera SDK for OpenCL includes utility commands you can invoke to obtain information on flags and libraries necessary for compiling and linking your host application. Host Programming Requirements When designing your OpenCL host application for use with the Altera SDK for OpenCL, ensure that the application satisfies the following host programming requirements. Host Machine Memory Requirements The machine that runs the host application must have enough host memory to support several components simultaneously. The host machine must support the following components: • The host application and operating system. • The working set for the host application. • The maximum amount of OpenCL memory buffers that can be allocated at once. Every device-side cl_mem buffer is associated with a corresponding storage area in the host process. Therefore, the amount of host memory necessary might be as large as the amount of external memory supported by the FPGA. Host Binary Requirement When compiling the host application, target one of these architectures: x86-64 (64-bit), big-endian (64bit), or ARM® 32-bit ARMV7-A for devices such as the Cyclone V SoC. The Altera SDK for OpenCL host runtime does not support x86-32 (32-bit) binaries. Multiple Host Threads The Altera SDK for OpenCL host library is thread-safe. All OpenCL APIs are thread safe except the clSetKernelArg function. It is safe to call clSetKernelArg from any host thread or as an reentrant as long as concurrent calls to any combination of clSetKernelArg calls operate on different cl_kernel objects. Related Information Multi-Threaded Host Application Out-of-Order Command Queues The OpenCL host runtime command queues do not support out-of-order command execution. Requirement for Multiple Command Queues in Channels or Pipes Implementation Although the Altera SDK for OpenCL channels extension or OpenCL pipes implementation allows multiple kernels to execute in parallel, channels or pipes facilitate this concurrent behavior only when Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Allocating OpenCL Buffer for Manual Partitioning of Global Memory 1-61 cl_command_queue objects are in order. To enable multiple command queues , instantiate a separate command for each kernel you wish to run concurrently. Allocating OpenCL Buffer for Manual Partitioning of Global Memory Manual partitioning of global memory buffers allows you to control memory accesses across buffers to maximize the memory bandwidth. Before you partition the memory, first you have to disable burstinterleaving during OpenCL kernel compilation. Then, in the host application, you must specify the memory bank to which you allocate the OpenCL buffer. By default, the Altera Offline Compiler configures each global memory type in a burst-interleaved fashion. Usually, the burst-interleaving configuration leads to the best load balancing between the memory banks. However, there might be situations where it is more efficient to partition the memory into non-interleaved regions. The figure below illustrates the differences between burst-interleaved and non-interleaved memory partitions. Burst-Interleaved Address 0x7FFF_FFFF 0x7FFF_FC00 0x7FFF_FBFF Bank 2 Separate Partitions Address 0x7FFF_FFFF Bank 1 Bank 2 0x7FFF_F800 0x0000_0FFF Bank 2 0x0000_0C00 0x0000_0BFF 0x4000_0000 0x3FFF_FFFF Bank 1 0x0000_0800 0x0000_07FF Bank 1 Bank 2 0x0000_0400 0x0000_03FF Bank 1 0x0000_0000 0x0000_0000 To manually partition some or all of the available global memory types, perform the following tasks: 1. Compile your OpenCL kernel using the --no-interleaving flag to configure the memory bank(s) of the specified memory type as separate addresses. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-62 UG-OCL002 2016.05.02 Allocating OpenCL Buffer for Manual Partitioning of Global Memory For more information on the usage of the --no-interleaving flag, refer to the Disabling Burst-Interleaving of Global Memory (--no-interleaving ) section. 2. Create an OpenCL buffer in your host application, and allocate the buffer to one of the banks using the CL_MEM_HETEROGENEOUS_ALTERA and CL_MEM_BANK flags. • Specify CL_MEM_BANK_1_ALTERA to allocate the buffer to the lowest available memory region. • Specify CL_MEM_BANK_2_ALTERA to allocation memory to the second bank (if available). Attention: Allocate each buffer to a single memory bank only. By default, the host allocates buffers into the main memory when you load kernels into the OpenCL runtime via the clCreateProgramWithBinary function. During kernel invocation, the host automati‐ cally relocates heterogeneous memory buffers that are bound to kernel arguments to the main memory . To avoid the initial allocation of heterogeneous memory buffers in the main memory, include the CL_MEM_HETEROGENEOUS_ALTERA flag when you call the clCreateBuffer function, as shown below: mem = clCreateBuffer(context, flags|CL_MEM_HETEROGENEOUS_ALTERA, memSize, NULL, &errNum); For example, the following clCreateBuffer call allocates memory into the lowest available memory region of a nondefault memory bank: mem = clCreateBuffer(context, (CL_MEM_HETEROGENEOUS_ALTERA|CL_MEM_BANK_1_ALTERA), memSize, NULL, &errNum); The clCreateBuffer call allocates memory into a certain global memory type based on what you specify in the kernel argument. If a memory (cl_mem) object residing in a memory type is set as a kernel argument that corresponds to a different memory technology, the host moves the memory object automatically when it queues the kernel. Do not pass a buffer as kernel arguments that associate it with multiple memory technologies. Attention: If the second bank is not available at runtime, the memory is allocated to the first bank. If no global memory is available, the clCreateBuffer call fails with the error message CL_MEM_OBJECT_ALLOCATION_FAILURE. For more information on optimizing heterogeneous global memory accesses, refer to the Heterogeneous Memory Buffers and the Manual Partitioning of Global Memory sections of the Altera SDK for OpenCL Best Practices Guide. Related Information • Disabling Burst-Interleaving of Global Memory (--no-interleaving ) on page 1-86 • Manual Partitioning of Global Memory • Heterogeneous Memory Buffers Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Creating a Pipe Object in Your Host Application 1-63 Creating a Pipe Object in Your Host Application To implement OpenCL pipes in your kernel, you must create Altera SDK for OpenCL-specific pipe objects in your host application. An AOCL-specific pipe object is not a true OpenCL pipe object as described in the OpenCL Specification version 2.0. This implementation allows you to migrate away from Altera devices with a conformant solution. The AOCL-specific pipe object is a memory object (cl_mem); however, the host does not allocate any memory for the pipe itself. The following clCreatePipe host API creates a pipe object: cl_mem clCreatePipe(cl_context context, cl_mem_flags flags, cl_uint pipe_packet_size, cl_uint pipe_max_packets, const cl_pipe_properties *properties, cl_int *errcode_ret) For more information on the clCreatePipe host API function, refer to section 5.4.1 of the OpenCL Specification version 2.0. Below is an example syntax of the clCreatePipe host API function: cl_int status; cl_mem c0_pipe = clCreatePipe(context, 0, sizeof(int), 1, NULL, &status); status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &c0_pipe); Caution: The AOCL does not support dynamic channel assignment at runtime. The AOCL statically links the pipes during compilation. Related Information OpenCL Specification version 2.0 (API) Collecting Profile Data During Kernel Execution In cases where kernel execution finishes after the host application completes, you can query the FPGA explicitly to collect profile data during kernel execution. When you profile your OpenCL kernel during compilation, a profile.mon file is generated automatically. The profile data is then written to profile.mon after kernel execution completes on the FPGA. However, if kernel execution completes after the host application completes, no profiling information for that kernel Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-64 UG-OCL002 2016.05.02 Collecting Profile Data During Kernel Execution invocation will be available in the profile.mon file. In this case, you can modify your host code to acquire profiling information during kernel execution. • To query the FPGA to collect profile data while the kernel is running, call the following host library call: extern CL_API_ENTRY cl_int CL_API_CALL clGetProfileInfoAltera(cl_event); where cl_event is the kernel event. The kernel event you pass to this host library call must be the same one you pass to the clEnqueueNDRangeKernel call. Important: If kernel execution completes before the invocation of clGetProfileInfoAltera, the function returns an event error message. Caution: Invoking the clGetProfileInfoAltera function during kernel execution disables the profile counters momentarily so that the Profiler can collect data from the FPGA. As a result, you will lose some profiling information during this interruption. If you call this function at very short intervals, the profile data might not accurately reflect the actual performance behavior of the kernel. Consider the following example host code: int main() { ... clEnqueueNDRangeKernel (queue, kernel, ..., NULL); ... clEnqueueNDRangeKernel (queue, kernel, .. , NULL); ... } This host application runs on the assumption that a kernel launches twice and then completes. In the profile.mon file, there will be two sets of profile data, one for each kernel invocation. To collect profile data while the kernel is running, modify the host code in the following manner: int main() { ... clEnqueueNDRangeKernel (queue, kernel, ..., &event); //Get the profile data before the kernel completes clGetProfileInfoAltera (event); //Wait until the kernel completes clFinish (queue); ... clEnqueueNDRangeKernel (queue, kernel, ..., NULL); ... } The call to clGetProfileInfoAltera adds a new entry in the profile.mon file. The Profiler GUI then parses this entry in the report. For more information on the Altera SDK for OpenCL Profiler, refer to the following sections: Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Accessing Custom Platform-Specific Functions 1-65 • Profile Your Kernel to Identify Performance Bottlenecks in the Altera SDK for OpenCL Best Practices Guide • Profiling Your OpenCL Kernel Related Information • Profile Your Kernel to Identify Performance Bottlenecks • Profiling Your OpenCL Kernel on page 1-97 Accessing Custom Platform-Specific Functions You have the option to include in your application user-accessible functions that are available in your Custom Platform. However, when you link your host applicaiton to the Altera Client Driver (ACD), you cannot directly reference these Custom Platform-specific functions. To reference Custom Platformspecific user-accessible functions while linking to the ACD, include the clGetBoardExtensionFunctionAddressAltera extension in your host application. The clGetBoardExtensionFunctionAddressAltera extension specifies an API that retrieves a pointer to a user-accessible function from the Custom Platform. Attention: For Linux systems, the clGetBoardExtensionFunctionAddressAltera function works with or without ACD. For Windows systems, the function only works in conjunction with ACD. Consult with your board vendor to determine if ACD is supported in your Custom Platform. Definitions of the extension interfaces are available in the ALTERAOCLSDKROOT/host/include/CL/cl_ext.h file. • To obtain a pointer to a user-accessible function in your Custom Platform, call the following function in your host application: void* clGetBoardExtensionFunctionAddressAltera ( const char* function_name, cl_device_id device ); Where: function_name is the name of the user-accessible function that your Custom Platform vendor provides, and device is the device ID returned by the clGetDeviceIDs function. After locating the user-accessible function, the clGetBoardExtensionFunctionAddressAltera function returns a pointer to the user-accessible function. If the function does not exist in the Custom Platform, clGetBoardExtensionFunctionAddressAltera returns NULL. Modifying Host Program for Structure Parameter Conversion If you convert any structure parameters to pointers-to-constant structures in your OpenCL kernel, you must modify your host application accordingly. Perform the following changes to your host application: 1. Allocate a cl_mem buffer to store the structure contents. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-66 UG-OCL002 2016.05.02 Allocating Shared Memory for OpenCL Kernels Targeting SoCs Attention: You need a separate cl_mem buffer for every kernel that uses a different structure value. 2. Set the structure kernel argument with a pointer to the structure buffer, not with a pointer to the structure contents. 3. Populate the structure buffer contents before queuing the kernel. Perform one of the following steps to ensure that the structure buffer is populated before the kernel launches: • Queue the structure buffer on the same command queue as the kernel queue. • Synchronize separate kernel queues and structure buffer queues with an event. 4. When your application no longer needs to call a kernel that uses the structure buffer, release the cl_mem buffer. Related Information • Including Structure Data Types as Arguments in OpenCL Kernels on page 1-52 • Matching Data Layouts of Host and Kernel Structure Data Types on page 1-52 Allocating Shared Memory for OpenCL Kernels Targeting SoCs Altera recommends that OpenCL kernels that run on Altera SoCs access shared memory instead of the FPGA DDR memory. FPGA DDR memory is accessible to kernels with very high bandwidths. However, read and write operations from the ARM CPU to FPGA DDR memory are very slow because they do not use direct memory access (DMA). Reserve FPGA DDR memory only for passing temporary data between kernels or within a single kernel for testing purposes. Before you begin Note: 1. Mark the shared buffers between kernels as volatile to ensure that buffer modification by one kernel is visible to the other kernel. 2. To access shared memory, you only need to modify the host code. Modifications to the kernel code are unnecessary. 3. You cannot use the library function malloc or the operator new to allocate physically shared memory. Also, the CL_MEM_USE_HOST_PTR flag does not work with shared memory. In DDR memory, shared memory must be physically contiguous. The FPGA cannot consume virtually contiguous memory without a scatter-gather direct memory access (SG-DMA) controller core. The malloc function and the new operator are for accessing memory that is virtually contiguous. 4. CPU caching is disabled for the shared memory. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Allocating Shared Memory for OpenCL Kernels Targeting SoCs 1-67 The ARM CPU and the FPGA can access the shared memory simultaneously. You do not need to include the clEnqueueReadBuffer and clEnqueueWriteBuffer calls in your host code to make data visible to either the FPGA or the CPU. • To allocate and access shared memory, structure your host code in a similar manner as the following example: cl_mem src = clCreateBuffer(…, CL_MEM_ALLOC_HOST_PTR, size, …); int *src_ptr = (int*)clEnqueueMapBuffer (…, src, size, …); *src_ptr = input_value; //host writes to ptr directly clSetKernelArg (…, src); clEnqueueNDRangeKernel(…); clFinish(); printf (“Result = %d\n”, *dst_ptr); //result is available immediately clEnqueueUnmapMemObject(…, src, src_ptr, …); clReleaseMemObject(src); // actually frees physical memory You can include the CONFIG_CMA_SIZE_MBYTES kernel configuration option to control the maximum total amount of shared memory available for allocation. In practice, the total amount of allocated shared memory is smaller than the value of CONFIG_CMA_SIZE_MBYTES. Important: 1. If your target board has multiple DDR memory banks, the clCreateBuffer(..., CL_MEM_READ_WRITE, ...) function allocates memory to the nonshared DDR memory banks. However, if the FPGA has access to a single DDR bank that is shared memory, then clCreateBuffer(..., CL_MEM_READ_WRITE, ...) allocates to shared memory, similar to using the CL_MEM_ALLOC_HOST_PTR flag. 2. The shared memory that you request with the clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, size, ...) function is allocated in the Linux OpenCL kernel driver, and it relies on the contiguous memory allocator (CMA) feature of the Linux kernel. For detailed information on enabling and configuring the CMA, refer to the Recompiling the Linux Kernel and the OpenCL Linux Kernel Driver section of the Altera Cyclone V SoC Development Kit Reference Platform Porting Guide. • To transfer data from shared hard processor system (HPS) DDR to FPGA DDR efficiently, include a kernel that performs the memcpy function, as shown below. __attribute__((num_simd_work_items(8))) mem_stream(__global uint * src, __global uint * dst) { size_t gid = get_global_id(0); dst[gid] = src[gid]; } Attention: Allocate the src pointer in the HPS DDR as shared memory using the CL_MEM_ALLOC_HOST_PTR flag. • If the host allocates constant memory to shared HPS DDR system and then modifies it after kernel execution, the modifications might not take effect. As a result, subsequent kernel executions might use outdated data. To prevent kernel execution from using outdated constant memory, perform one of the following tasks: 1. Do not modify constant memory after its initialization. 2. Create multiple constant memory buffers if you require multiple __constant data sets. 3. If available, allocate constant memory to the FPGA DDR on your accelerator board. Related Information Recompiling the Linux Kernel and the OpenCL Linux Kernel Driver Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-68 UG-OCL002 2016.05.02 Managing Host Application Managing Host Application The Altera SDK for OpenCL includes utility commands you can invoke to obtain information on flags and libraries necessary for compiling and linking your host application. Attention: To cross-compile your host application to an SoC board, include the --arm option in your utility command. Caution: For Linux systems, if you debug your host application using the GNU Project Debugger (GDB), invoke the following command prior to running the host application: handle SIG44 nostop Without this command, the GDB debugging process terminates with the following error message: Program received signal SIG44, Real-time event 44. Displaying Example Makefile Fragments (example-makefile or makefile) To display example Makefile fragments for compiling and linking a host application against host runtime libraries available with the Altera SDK for OpenCL, invoke the example-makefile or makefile utility command. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Compiling and Linking Your Host Application 1-69 • At a command prompt, invoke the aocl example-makefile or aocl makefile utility command. The software displays an output similar to the following: The following are example Makefile fragments for compiling and linking a host program against the host runtime libraries included with the Altera SDK for OpenCL. Example GNU makefile on Linux, with GCC toolchain: AOCL_COMPILE_CONFIG=$(shell aocl compile-config) AOCL_LINK_CONFIG=$(shell aocl link-config) host_prog : host_prog.o g++ -o host_prog host_prog.o $(AOCL_LINK_CONFIG) host_prog.o : host_prog.cpp g++ -c host_prog.cpp $(AOCL_COMPILE_CONFIG) Example GNU makefile on Windows, with Microsoft Visual C++ command line compiler: AOCL_COMPILE_CONFIG=$(shell aocl compile-config) AOCL_LINK_CONFIG=$(shell aocl link-config) host_prog.exe : host_prog.obj link -nologo /OUT:host_prog.exe host_prog.obj $(AOCL_LINK_CONFIG) host_prog.obj : host_prog.cpp cl /MD /Fohost_prog.obj -c host_prog.cpp $(AOCL_COMPILE_CONFIG) Example GNU makefile cross-compiling to ARM SoC from Linux or Windows, with Linaro GCC cross-compiler toolchain: CROSS-COMPILER=arm-linux-gnueabihfAOCL_COMPILE_CONFIG=$(shell aocl compile-config --arm) AOCL_LINK_CONFIG=$(shell aocl link-config --arm) host_prog : host_prog.o $(CROSS-COMPILER)g++ -o host_prog host_prog.o $(AOCL_LINK_CONFIG) host_prog.o : host_prog.cpp $(CROSS-COMPILER)g++ -c host_prog.cpp $(AOCL_COMPILE_CONFIG) Compiling and Linking Your Host Application The OpenCL host application uses standard OpenCL runtime APIs to manage device configuration, data buffers, kernel launches, and synchronization. The host application also contains functions such as file I/O, or portions of the source code that do not run on an accelerator device. The Altera SDK for OpenCL includes utility commands you can invoke to obtain information on C header files describing the OpenCL APIs, and board-specific MMD and host runtime libraries with which you must link your host application. Important: For Windows systems, you must add the /MD flag to link the host runtime libraries against the multithreaded dynamic link library (DLL) version of the Microsoft C Runtime library. You must also compile your host application with the /MD compilation flag, or use the / NODEFAULTLIB linker option to override the selection of runtime library. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-70 UG-OCL002 2016.05.02 Displaying Flags for Compiling Host Application (compile-config) Remember: Include the path to the ALTERAOCLSDKROOT/host/ /bin folder in your library search path when you run your host application. Displaying Flags for Compiling Host Application (compile-config) on page 1-70 To display a list of flags necessary for compiling a host application, invoke the compile-config utility command. Displaying Paths to OpenCL Host Runtime and MMD Libraries (ldflags) on page 1-70 To display the paths necessary for linking a host application to the OpenCL host runtime and MMD libraries, invoke the ldflags utility command. Listing OpenCL Host Runtime and MMD Libraries (ldlibs) on page 1-70 To display the names of the OpenCL host runtime and MMD libraries necessary for linking a host application, invoke the ldlibs utility command. Displaying Information on OpenCL Host Runtime and MMD Libraries (link-config or linkflags) on page 1-71 To display a list of flags necessary for linking a host application with OpenCL host runtime and MMD libraries, invoke the link-config or linkflags utility command. Displaying Flags for Compiling Host Application (compile-config) To display a list of flags necessary for compiling a host application, invoke the compile-config utility command. 1. At a command prompt, invoke the aocl compile-config utility command. The software displays the path to the folder or directory in which the OpenCL API header files reside. For example: • For Windows systems, the path is -I%ALTERAOCLSDKROOT%/host/include • For Linux systems, the path is -I$ALTERAOCLSDKROOT/host/include where ALTERAOCLSDKROOT points to the location of the software installation. 2. Add this path to your C preprocessor. Attention: In your host source, include the opencl.h OpenCL header file, located in the ALTERAOCLSDK‐ ROOT/host/include/CL folder or directory. Displaying Paths to OpenCL Host Runtime and MMD Libraries (ldflags) To display the paths necessary for linking a host application to the OpenCL host runtime and MMD libraries, invoke the ldflags utility command. • At a command prompt, invoke the aocl ldflags utility command. The software displays the paths for linking your host application with the following libraries: 1. The OpenCL host runtime libraries that provide OpenCL platform and runtime APIs. The OpenCL host runtime libraries are available in the ALTERAOCLSDKROOT/host/ /lib directory. 2. The path to the Custom Platform-specific MMD libraries. The MMD libraries are available in the / /lib directory of your Custom Platform. Listing OpenCL Host Runtime and MMD Libraries (ldlibs) To display the names of the OpenCL host runtime and MMD libraries necessary for linking a host application, invoke the ldlibs utility command. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Displaying Information on OpenCL Host Runtime and MMD Libraries... 1-71 • At a command prompt, invoke the aocl ldlibs utility command. The software lists the OpenCL host runtime libraries residing in the ALTERAOCLSDKROOT/host/ /lib directory. It also lists the Custom Platform-specific MMD libraries residing in the / / /lib directory of your Custom Platform. • For Windows systems, the output might resemble the following example: alterahalmmd.lib _ _mmd.[lib|so|a|dll] alteracl.lib acl_emulator_kernel_rt.lib pkg_editor.lib libelf.lib acl_hostxml.lib • For Linux systems, the output might resemble the following example: -lalteracl -ldl -lacl_emulator_kernel_rt -lalterahalmmd -l _ _mmd -lelf -lrt -lstdc++ Displaying Information on OpenCL Host Runtime and MMD Libraries (link-config or linkflags) To display a list of flags necessary for linking a host application with OpenCL host runtime and MMD libraries, invoke the link-config or linkflags utility command. This utility command combines the functions of the ldflags and ldlibs utility commands. 1. At a command prompt, invoke the aocl link-config or aocl linkflags command. The software displays the link options for linking your host application with the following libraries: 1. The path to and the names of OpenCL host runtime libraries that provide OpenCL platform and runtime APIs. The OpenCL host runtime libraries are available in the ALTERAOCLSDKROOT/host/ /lib directory . 2. The path to and the names of the Custom Platform-specific MMD libraries. The MMD libraries are available in the / /lib directory of your Custom Platform. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-72 UG-OCL002 2016.05.02 Linking Your Host Application to the Khronos ICD Loader Library • For Windows systems, the link options might resemble the following example output: /libpath:%ALTERAOCLSDKROOT%/board/ /windows64/lib /libpath:%ALTERAOCLSDKROOT%/host/windows64/lib alterahalmmd.lib _ _mmd.[lib|so|a|dll] alteracl.lib acl_emulator_kernel_rt.lib pkg_editor.lib libelf.lib acl_hostxml.lib • For Linux systems, the link options might resemble the following example output: -L/$ALTERAOCLSDKROOT/board/ /linux64/lib -L/$ALTERAOCLSDKROOT/host/linux64/lib -lalterac -ldl -lacl_emulator_kernel_rt -lalterahalmmd -l _ _mmd -lelf -lrt -lstdc++ Linking Your Host Application to the Khronos ICD Loader Library The Altera SDK for OpenCL supports the OpenCL Installable Client Driver (ICD) extension from the Khronos Group. The OpenCL ICD extension allows you to have multiple OpenCL implementations on your system. With the OpenCL ICD Loader Library, you may choose from a list of installed platforms and execute OpenCL API calls that are specific to your OpenCL implementation of choice. In addition to the AOCL host runtime libraries, Altera supplies a version of the ICD Loader Library that supports the OpenCL Specification version 1.0. To use an ICD library from another vendor, consult the vendor's documentation on how to link to their ICD library. Linking to the ICD Loader Library on Windows on page 1-72 To link your Windows OpenCL host application to the ICD Loader Library, modify the Makefile and set up the Altera Client Driver. Linking to the ICD Loader Library on Linux on page 1-73 To link your Linux OpenCL host application to the ICD Loader Library, modify the Makefile. For Cyclone V SoC boards, you also have to create an Altera.icd file. Linking to the ICD Loader Library on Windows To link your Windows OpenCL host application to the ICD Loader Library, modify the Makefile and set up the Altera Client Driver. Attention: For Windows systems, you must use the ICD in conjunction with the ACD. If the custom platform from your board vendor does not currently support ACD, you can set it up manually. 1. Prior to linking your host application to any Altera SDK for OpenCL host runtime libraries, link it to the OpenCL library by modifying the Makefile. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Linking to the ICD Loader Library on Linux 1-73 A modified Makefile might include the following lines: AOCL_COMPILE_CONFIG=$(shell aocl compile-config) AOCL_LDFLAGS=$(shell aocl ldflags) AOCL_LDLIBS=$(shell aocl ldlibs) host_prog.exe : host_prog.obj link -nologo /OUT:host_prog.exe host_prog.obj $(AOCL_ LDFLAGS) OpenCL.lib host_prog.obj : host_prog.cpp cl /MD /Fohost_prog.obj -c host_prog.cpp $(AOCL_COMPILE_CONFIG) 2. If you need to manually set up ACD support for your Custom Platform, perform the following tasks: a. Consult with your board vendor to identify the libraries that the ACD requires. Alternatively, you may invoke the aocl ldlibs command and identify the libraries that your OpenCL application requires. b. Specify the libraries in the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Altera\OpenCL \Boards. Enter one value for each library. Each value must include the path to the library as the string value, and a DWORD setting of 0. Attention: If your board vendor provides multiple libraries, you might need to load them in a particular order. Consult with your board vendor to determine the correct order to load the libraries. List the libraries in the registry in their loading order. To enumerate board vendor-specific ICDs, the ICD Loader scans the values in the HKEY_LOCAL_MACHINE\SOFTWARE\Altera\OpenCL\Boards registry key. For each value in the key that has a DWORD value of 0, the ACD Loader opens the corresponding DLL specified in the key. Consider the following registry key value: [HKEY_LOCAL_MACHINE\SOFTWARE\Altera\OpenCL\Boards] "c:\\board_vendor a\ \my_board_mmd.dll"=dword:00000000 The ICD Loader scans this value, and then the ACD Loader opens the library my_board_mmd.dll from the board_vendor a folder. Attention: If your host application fails to run while it is linking to the ICD, ensure that the HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors registry key contains the following value: [HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors] "alteracl_icd.dll"=dword:00000000 Linking to the ICD Loader Library on Linux To link your Linux OpenCL host application to the ICD Loader Library, modify the Makefile. For Cyclone V SoC boards, you also have to create an Altera.icd file. 1. Prior to linking your host application to any Altera SDK for OpenCL host runtime libraries, link it to the OpenCL library by modifying the Makefile. A modified Makefile might include the following lines: AOCL_LDFLAGS=$(shell aocl ldflags) AOCL_LDLIBS=$(shell aocl ldlibs) Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-74 UG-OCL002 2016.05.02 Programming an FPGA via the Host host_prog : host_prog.o g++ -o host_prog host_prog.o $(AOCL_LDFLAGS) -lOpenCL $(AOCL_LDLIBS) 2. For Cyclone V SoC boards, when you build the SD flash card image for your Custom Platform, create an Altera.icd file containing the text libalteracl.so. Store the Altera.icd file in the /etc/OpenCL/ vendors directory of your Custom Platform. Refer to Building an SD Flash Card Image section of the Altera Cyclone V SoC Development Kit Reference Platform Porting Guide for more information. Attention: If your host application fails to run while linking to the ICD, ensure that the file /etc/OpenCL/vendors/Altera.icd matches the file found in the directory that ALTERAOCLSDKROOT specifies. The environment variable ALTERAOCLSDKROOT points to the location of the AOCL installation. If the files do not match, or if it is missing from /etc/OpenCL/vendors, copy the Altera.icd file from ALTERAOCLSDKROOT to /etc/OpenCL/vendors. Related Information Building an SD Flash Card Image Programming an FPGA via the Host The Altera Offline Compiler is an offline compiler that compiles kernels independently of the host application. To load the kernels into the OpenCL runtime, include the clCreateProgramWithBinary function in your host application. Caution: If your host system consists of multiple processors, only one processor can access the FPGA at a given time. Consider an example where there are two host applications, corresponding to two processors, attempting to launch kernels onto the same FPGA at the same time. The second host application wil receive an error message indicating that the device is busy. The second host application cannot run until the first host application releases the OpenCL context. 1. Compile your OpenCL kernel with the AOC to create the .aocx file. 2. Include the clCreateProgramWithBinary function in your host application to create the cl_program OpenCL program objects from the .aocx file. 3. Include the clBuildProgram function in your host application to create the program executable for the specified device. Below is an example host code on using clCreateProgramWithBinary to program an FPGA device: size_t lengths[1]; unsigned char* binaries[1] ={NULL}; cl_int status[1]; cl_int error; cl_program program; const char options[] = ""; FILE *fp = fopen("program.aocx","rb"); fseek(fp,0,SEEK_END); lengths[0] = ftell(fp); binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]); rewind(fp); fread(binaries[0],lengths[0],1,fp); fclose(fp); program = clCreateProgramWithBinary(context, 1, device_list, Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Programming Multiple FPGA Devices 1-75 lengths, (const unsigned char **)binaries, status, &error); clBuildProgram(program,1,device_list,options,NULL,NULL); If the clBuildProgram function executes successfully, it returns CL_SUCCESS. 4. Create kernel objects from the program executable using the clCreateKernelsInProgram or clCreateKernel function. 5. Include the kernel execution function to instruct the host runtime to execute the scheduled kernel(s) on the FPGA. • To enqueue a command to execute an NDRange kernel, use clEnqueueNDRangeKernel. • To enqueue a single work-item kernel, use clEnqueueTask. Attention: Altera recommends that you release an event object when it is not in use. The AOCL keeps an event object live until you explicitly instruct it to release the event object. Keeping an unused event object live causes unnecessary memory usage. To release an event object, call the clReleaseEvent function. You can load multiple FPGA programs into memory, which the host then uses to reprogram the FPGA as required. For more information on these OpenCL host runtime API calls, refer to the OpenCL Specification version 1.0. Related Information OpenCL Specification version 1.0 Programming Multiple FPGA Devices If you install multiple FPGA devices in your system, you can direct the host runtime to program a specific FPGA device by modifying your host code. Important: You may only program multiple FPGA devices from the same Custom Platform because the AOCL_BOARD_PACKAGE_ROOT environment variable points to the location of a single Custom Platform. You can present up to 32 FPGA devices to your system in the following manner: • Multiple FPGA accelerator boards, each consisting of a single FPGA. • Multiple FPGAs on a single accelerator board that connects to the host system via a PCIe switch. • Combinations of the above. The host runtime can load kernels onto each and every one of the FPGA devices. The FPGA devices can then operate in a parallel fashion. 1. Probing the OpenCL FPGA Devices on page 1-76 The host must identify the number of OpenCL FPGA devices installed into the system. 2. Querying Device Information on page 1-76 You can direct the host to query information on your OpenCL FPGA devices. 3. Loading Kernels for Multiple FPGA Devices on page 1-77 If your system contains multiple FPGA devices, you can create specific cl_program objects for each FPGA and load them into the OpenCL runtime. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-76 UG-OCL002 2016.05.02 Probing the OpenCL FPGA Devices Probing the OpenCL FPGA Devices The host must identify the number of OpenCL FPGA devices installed into the system. 1. To query a list of FPGA devices installed in your machine, invoke the aocl diagnose command. 2. To direct the host to identify the number of OpenCL FPGA devices, add the following lines of code to your host application: //Get the platform ciErrNum = clGetPlatformID(&cpPlatform); //Get the devices ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_ALL, 0, NULL, &ciDeviceCount); cdDevices = (cl_device_id * )malloc(ciDeviceCount * sizeof(cl_device_id)); ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_ALL, ciDeviceCount, cdDevices, NULL); For example, on a system with two OpenCL FPGA devices, ciDeviceCount has a value of 2, and cdDevices contains a list of two device IDs (cl_device_id). Related Information Querying the Device Name of Your FPGA Board (diagnose) on page 1-11 Querying Device Information You can direct the host to query information on your OpenCL FPGA devices. • To direct the host to output a list of OpenCL FPGA devices installed into your system, add the following lines of code to your host application: char buf[1024]; for (unsigned i = 0; i < ciDeviceCount; i++); { clGetDeviceInfo(cdDevices[i], CL_DEVICE_NAME, 1023, buf, 0); printf("Device %d: '%s'\n", i, buf); } When you query the device information, the host will list your FPGA devices in the following manner: Device : : Where: is the device number. is the board designation you use to target your FPGA device when you invoke the aoc command. is the advertised name of the FPGA board. For example, if you have two identical FPGA boards on your system, the host generates an output that resembles the following: Device 0: board_1: Stratix V FPGA Board Device 1: board_1: Stratix V FPGA Board Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Loading Kernels for Multiple FPGA Devices 1-77 Note: The clGetDeviceInfo function returns the board type (for example, board_1) that the Altera Offline Compiler lists on-screen when you invoke the aoc --list-boards command. If your accelerator board contains more than one FPGA, each device is treated as a "board" and is given a unique name. Related Information Listing the Available FPGA Boards in Your Custom Platform (--list-boards) on page 1-9 Loading Kernels for Multiple FPGA Devices If your system contains multiple FPGA devices, you can create specific cl_program objects for each FPGA and load them into the OpenCL runtime. The following host code demonstrates the usage of the clCreateProgramWithBinary and createMultiDeviceProgram functions to program multiple FPGA devices: cl_program createMultiDeviceProgram(cl_context context, const cl_device_id *device_list, cl_uint num_devices, const char *aocx_name); // Utility function for loading file into Binary String // unsigned char* load_file(const char* filename, size_t *size_ret) { FILE *fp = fopen(aocx_name,"rb"); fseek(fp,0,SEEK_END); size_t len = ftell(fp); char *result = (unsigned char*)malloc(sizeof(unsigned char)*len); rewind(fp); fread(result,len,1,fp); fclose(fp); *size_ret = len; return result; } //Create a Program that is compiled for the devices in the "device_list" // cl_program createMultiDeviceProgram(cl_context context, const cl_device_id *device_list, cl_uint num_devices, const char *aocx_name) { printf("creating multi device program %s for %d devices\n", aocx_name, num_devices); const unsigned char **binaries = (const unsigned char**)malloc(num_devices*sizeof(unsigned char*)); size_t *lengths=(size_t*)malloc(num_devices*sizeof(size_t)); cl_int err; for(cl_uint i=0; i .cl [ .cl ...] command. Where [ .cl ...] are the optional space-delimited file names of kernels that you can compile in addition to .cl. The Altera Offline Compiler groups the .cl files into a temporary file. It then compiles this file to generate the .aocx file. You must specify the order of the kernels in this temporary file on the command line. Compiling a Kernel for a Big-Endian System (--big-endian) To direct the Altera Offline Compiler to compile your OpenCL kernel and generate a hardware configuration file for use in a big-endian system (for example, the IBM POWER system), include the -big-endian option in the aoc command. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-80 Compiling Your Kernel without Building Hardware (-c) UG-OCL002 2016.05.02 If you create an OpenCL kernel program that targets a big-endian architecture, you have to specify bigendian ordering for the host and global memories. If not, the AOC automatically defaults to little-endian ordering. • At a command prompt, invoke the aoc .cl --big-endian command. Compiling Your Kernel without Building Hardware (-c) To direct the Altera Offline Compiler to compile your OpenCL kernel and generate a Quartus Prime hardware design project without creating a hardware configuration file, include the -c option in your aoc command. • At a command prompt, invoke the aoc -c .cl [ .cl ...] command. Where [ .cl ...] are the optional space-delimited file names of kernels that you can compile in addition to .cl. When you invoke the aoc command with the -c flag, the AOC compiles the kernel and creates the following files and directories: • The .aoco file. The AOC creates the .aoco file in a matter of seconds to minutes. If you compile multiple kernels, their information in the .aoco file appears in the order in which you list them on the command line. • A folder or subdirectory. It contains intermediate files that the Altera SDK for OpenCL uses to build the hardware configuration file necessary for FPGA programming. Specifying the Location of Header Files (-I ) To add a directory to the list of directories that the Altera Offline Compiler searches for header files during kernel compilation, include the -I option in your aoc command. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Specifying the Name of an AOC Output File (-o ) 1-81 If the header files are in the same directory as your kernel, you do not need to include the -I option in your aoc command. The AOC automatically searches the current folder or directory for header files. • At a command prompt, invoke the aoc -I .cl command. Caution: For Windows systems, ensure that your include path does not contain any trailing slashes. The AOC considers a trailing forward slash (/) or backward slash (\) as illegal. The AOC generates an error message if you invoke the aoc command in the following manner: aoc -I \ \ ... \ \ .cl or aoc -I / / ... / / .cl The correct way to specify the include path is as follows: aoc -I \ \ ... \ .cl or aoc -I / / ... / .cl Specifying the Name of an AOC Output File (-o ) To specify the name of a .aoco file or a .aocx file, include the -o option in your aoc command. • If you implement the multistep compilation flow, specify the names of the output files in the following manner: 1. To specify the name of the .aoco file that the Altera Offline Compiler creates during an intermediate compilation step, invoke the aoc -c -o .aoco .cl command. 2. To specify the name of the .aocx file that the AOC creates during the final compilation step, invoke the aoc -o .aocx .aoco command. • If you implement the one-step compilation flow, specify the name of the .aocx file by invoking the aoc -o .aocx .cl command. Compiling a Kernel for a Specific FPGA Board (--board ) To compile your OpenCL kernel for a specific FPGA board, include the --board option in the aoc command. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-82 UG-OCL002 2016.05.02 Compiling a Kernel for a Specific FPGA Board (--board ) Before you begin To compile a kernel for a specific board in your Custom Platform, you must first set the environment variable AOCL_BOARD_PACKAGE_ROOT to point to the location of your Custom Platform. Attention: If you want to program multiple FPGA devices, you may select board types that are available in the same Custom Platform because AOCL_BOARD_PACKAGE_ROOT only points to the location of one Custom Platform. When you compile your kernel by including the --board option in the aoc command, the Altera Offline Compiler defines the preprocessor macro AOCL_BOARD_ to be 1, which allows you to compile device-optimized code in your kernel. 1. To obtain the names of the available FPGA boards in your Custom Platform, invoke the aoc -list-boards command. For example, the AOC generates the following output: Board List: FPGA_board_1 where FPGA_board_1 is the . 2. To compile your OpenCL kernel for FPGA_board_1, invoke the aoc --board FPGA_board_1 .cl command. The AOC defines the preprocessor macro AOCL_BOARD_FPGA_board_1 to be 1 and compiles kernel code that targets FPGA_board_1. Tip: To readily identify compiled kernel files that target a specific FPGA board, Altera recommends that you rename the kernel binaries by including the -o option in the aoc command. To target your kernel to FPGA_board_1 in the one-step compilation flow, invoke the following command: aoc --board FPGA_board_1 .cl -o _FPGA_board_1.aocx To target your kernel to FPGA_board_1 in the multistep compilation flow, perform the following tasks: 1. Invoke the following command to generate the .aoco file: aoc -c --board FPGA_board_1 .cl -o _FPGA_board_1.aoco 2. Invoke the following command to generate the .aocx file: aoc --board FPGA_board_1 _FPGA_board_1.aoco -o _FPGA_board_1.aocx If you have an accelerator board consisting of two FPGAs, each FPGA device has an equivalent "board" name (for example, board_fpga_1 and board_fpga_2). To target a kernel_1.cl to board_fpga_1 and a kernel_2.cl to board_fpga_2, invoke the following commands: aoc --board board_fpga1 kernel_1.cl aoc --board board_fpga2 kernel_2.cl Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Resolving Hardware Generation Fitting Errors during Kernel Compilation... 1-83 Related Information Specifying the Name of an AOC Output File (-o ) on page 1-81 Resolving Hardware Generation Fitting Errors during Kernel Compilation (--higheffort) Sometimes, OpenCL kernel compilation fails during the hardware generation stage because the design fails to meet fitting constraints. In this case, recompile the kernel using the --high-effort option of the aoc command. When kernel compilation fails because of a fitting constraint problem, the Altera Offline Compiler displays the following error message: Error: Kernel fit error, recommend using --high-effort. Error: Cannot fit kernel(s) on device • To overcome this problem, recompile your kernel by invoking the following command: aoc --high-effort .cl After you invoke the command, the AOC displays the following message: High-effort hardware generation selected, compile time may increase significantly. The AOC will make three attempts to recompile your kernel and generate hardware. Modify your kernel if compilation still fails after the --high-effort attempt. Defining Preprocessor Macros to Specify Kernel Parameters (-D ) The Altera Offline Compiler supports preprocessor macros that allow you to pass macro definitions and compile code on a conditional basis. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-84 UG-OCL002 2016.05.02 Defining Preprocessor Macros to Specify Kernel Parameters (-D... • To pass a preprocessor macro definition to the AOC, invoke the aoc -D .cl command. • To override the existing value of a defined preprocessor macro, invoke the aoc -D = .cl command. Consider the following code snippet for the kernel sum: #ifndef UNROLL_FACTOR #define UNROLL_FACTOR 1 #endif __kernel void sum (__global const int * restrict x, __global int * restrict sum) { int accum = 0; #pragma unroll UNROLL_FACTOR for(size_t i = 0; i < 4; i++) { accum += x[i + get_global_id(0) * 4]; } sum[get_global_id(0)] = accum; } To override the UNROLL_FACTOR of 1 and set it to 4, invoke the aoc -D UNROLL_FACTOR=4 sum.cl command. Invoking this command is equivalent to replacing the line #define UNROLL_FACTOR 1 with #define UNROLL_FACTOR 4 in the sum kernel source code. • To use preprocessor macros to control how the AOC optimizes your kernel without modifying your kernel source code, invoke the aoc -o .aocx -D = .cl Where: -o is the AOC option you use to specify the name of the .aocx file that the AOC generates. is the name of the .aocx file that the AOC generates using the preprocessor macro value you specify. Tip: To preserve the results from both compilations on your file system, compile your kernels as separate binaries by using the -o flag of the aoc command. For example, if you want to compile the same kernel multiple times with required work-group sizes of 64 and 128, you can define a WORK_GROUP_SIZE preprocessor macro for the kernel attribute reqd_work_group_size, as shown below: __attribute__((reqd_work_group_size(WORK_GROUP_SIZE,1,1))) __kernel void myKernel(...) for (size_t i = 0; i < 1024; i++) { // statements } Compile the kernel multiple times by typing the following commands: aoc –o myKernel_64.aocx –D WORK_GROUP_SIZE=64 myKernel.cl aoc –o myKernel_128.aocx –D WORK_GROUP_SIZE=128 myKernel.cl Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Generating Compilation Progress Report (-v) 1-85 Generating Compilation Progress Report (-v) To direct the Altera Offline Compiler to report on the progress of a compilation, include the -v option in your aoc command. • To direct the AOC to report on the progress of a full compilation, invoke the aoc -v .cl command. The AOC generates a compilation progress report similar to the following example: aoc: Environment checks are completed successfully. You are now compiling the full flow!! aoc: Selected target board s5_net aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Compiling.... aoc: Linking with IP library ... aoc: First stage compilation completed successfully. aoc: Setting up project for CvP revision flow.... aoc: Hardware generation completed successfully. • To direct the AOC to report on the progress of an intermediate compilation step that does not build hardware, invoke the aoc -c -v .cl command. The AOC generates a compilation progress report similar to the following example: aoc: aoc: aoc: aoc: aoc: aoc: aoc: aoc: Environment checks are completed successfully. Selected target board s5_net Running OpenCL parser.... OpenCL parser completed successfully. Compiling.... Linking with IP library ... First stage compilation completed successfully. To compile this project, run "aoc .aoco" • To direct the AOC to report on the progress of a compilation for emulation, invoke the aoc march=emulator -v .cl command. The AOC generates a compilation progress report similar to the following example: aoc: Environment checks are completed successfully. You are now compiling the full flow!! aoc: Selected target board s5_net aoc: Running OpenCL parser....ex aoc: OpenCL parser completed successfully. aoc: Compiling for Emulation .... aoc: Emulator Compilation completed successfully. Emulator flow is successful. Related Information • Compiling Your Kernel without Building Hardware (-c) on page 1-80 • Emulating and Debugging Your OpenCL Kernel on page 1-88 Displaying the Estimated Resource Usage Summary On-Screen (--report) By default, the Altera Offline Compiler estimates hardware resource usage during compilation . The AOC factors in the usage of external interfaces such as PCIe, memory controller, and DMA engine in its calculations. During kernel compilation, the AOC generates an estimated resource usage summary in the .log file within the directory. To review the estimated resource usage summary on-screen, include the --report option in the aoc command. Altera SDK for OpenCL Programming Guide Send Feedback Altera Corporation 1-86 UG-OCL002 2016.05.02 Suppressing AOC Warning Messages (-W) You can review the estimated resource usage summary without performing a full compilation. To review the summary on-screen prior to generating the hardware configuration file, include the -c option in your aoc command. • At a command prompt, invoke the aoc -c .cl --report command. The AOC generates an output similar to the following example: +--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 35% ; ; ALUTs ; 22% ; ; Dedicated logic registers ; 15% ; ; Memory blocks ; 29% ; ; DSP blocks ; 0% ; +----------------------------------------+---------------------------; Related Information Compiling Your Kernel without Building Hardware (-c) on page 1-80 Suppressing AOC Warning Messages (-W) To suppress all warning messages, include the -W option in your aoc command. • At a command prompt, invoke the aoc -W .cl command. Converting AOC Warning Messages into Error Messages (-Werror) To convert all warning messages into error messages, include the -Werror option in your aoc command. • At a command prompt, invoke the aoc -Werror .cl command. Adding Source References to Optimization Reports (-g) Include the -g option in your aoc command to add source references to compilation reports. When you compile a single work-item kernel, the Altera Offline Compiler automatically generates an optimization report in the .log file in the subfolder or subdirectory. Adding source information such as line numbers and variable names in the optimization report allows you to pinpoint the locations of loop-carried dependencies in your kernel source code. • To add source information in the optimization report, invoke the aoc -g .cl command. Disabling Burst-Interleaving of Global Memory (--no-interleaving ) The Altera Offline Compiler cannot burst-interleave global memory across different memory types. You can disable burst-interleaving for all global memory banks of the same type and manage them manually by including the --no-interleaving option in your aoc command. Manual partitioning of memory buffers overrides the default burst-interleaved configuration of global memory. Altera Corporation Altera SDK for OpenCL Programming Guide Send Feedback UG-OCL002 2016.05.02 Configuring Constant Memory Cache Size (--const-cache-bytes ) 1-87 Caution: The --no-interleaving option requires a global memory type parameter. If you do not specify a memory type, the AOC issues an error message. • To direct the AOC to disable burst-interleaving for the default global memory, invoke the aoc .cl --no-interleaving default command. Your accelerator board might include multiple global memory types. To identify the default global memory type, refer to board vendor's documentation for your Custom Platform. • For a heterogeneous memory system, to direct the AOC to disable burst-interleaving of a specific global memory type, perform the following tasks: 1. Consult the board_spec.xml file of your Custom Platform for the names of the available global memory types (for example, DDR and quad data rate (QDR)). 2. To disable burst-interleaving for one of the memory types (for example, DDR), invoke the aoc .cl --no-interleaving DDR command. The AOC enables manual partitioning for the DDR memory bank, and configures the other memory bank in a burst-interleaved fashion. 3. To disable burst-interleaving for more than one type of global memory buffers, include a --nointerleaving option for each global memory type. For example, to disable burst-interleaving for both DDR and QDR, invoke the aoc .cl --no-interleaving DDR --no-interleaving QDR command. Caution: Do not pass a buffer as kernel arguments that associate it with multiple memory technologies. Configuring Constant Memory Cache Size (--const-cache-bytes ) Include the --const-cache-bytes flag in your aoc command to direct the Altera Offline Compiler to configure the constant memory cache size (rounded up to the closest power of 2). The default constant cache size is 16 kB. • To configure the constant memory cache size, invoke the aoc --const-cache-bytes .cl command, where is the cache size in bytes. For example, to configure a 32 kB cache during compilation of the OpenCL kernel myKernel.cl, invoke the aoc --const-cache-bytes 32768 myKernel.cl command. Note: This argument has no effect if none of the kernels uses the __constant address space. Relaxing the Order of Floating-Point Operations (--fp-relaxed) Include the --fp-relaxed option in your aoc command to direct the Altera Offline Compiler to relax the order of arithmetic floating-point operations using a balanced tree hardware implementation. Implementing a balanced tree structure leads to more efficient hardware at the expense of numerical variation in results. Caution: To implement this optimization control, your program must be able to tolerate small variations in the floating-point results. • To direct the AOC to execute a balanced tree hardware implementation, invoke the aoc --fprelaxed