OpenCL_Programming_Guide AMD Accelerated Parallel Processing Open CL Programming Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 286 [warning: Documents this large are best viewed by clicking the View PDF Link!]

AMD Accelerated Parallel Processing OpenCL‰
Preface
Contents
Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing
Chapter 2 Building and Running OpenCL Programs
Chapter 3 Debugging OpenCL
- 3.1 AMD gDEBugger
- 3.2 Debugging CPU Kernels with GDB
Chapter 4 OpenCL Performance and Optimization
Chapter 5 OpenCL Performance and Optimization for Southern Islands Devices
Chapter 6 OpenCL Performance and Optimization for Evergreen and Northern Islands Devices
Chapter 7 OpenCL Static C++ Programming Language
Appendix A OpenCL Optional Extensions
Appendix B The OpenCL Installable Client Driver (ICD)
- B.1 Overview
- B.2 Using ICD
Appendix C Compute Kernel
Appendix D Device Parameters
Appendix E OpenCL Binary Image Format (BIF) v2.0
- E.1 Overview
  - E.1.1 Executable and Linkable Format (ELF) Header
    - Table E.1 ELF Header Fields
  - E.1.2 Bitness
- E.2 BIF Options
Appendix F Open Decode API Tutorial
Appendix G OpenCL-OpenGL Interoperability
- G.1 Under Windows
- G.2 Linux Operating System
  - G.2.1 Single GPU Environment
    - G.2.1.1 Creating CL Context from a GL Context
  - G.2.2 Multi-GPU Configuration
    - G.2.2.1 Creating CL Context from a GL Context
Index

rev2.3

AMD Accelerated Parallel Processing

OpenCL™

Programming Guide

July 2012

AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI,

the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade-

marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows

Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic-

tions. Other names are for informational purposes only and may be trademarks of their

respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by

permission by Khronos.

The contents of this document are provided in connection with Advanced Micro Devices,

Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the

accuracy or completeness of the contents of this publication and reserves the right to

make changes to specifications and product descriptions at any time without notice. The

information contained herein may be of a preliminary or advance nature and is subject to

change without notice. No license, whether express, implied, arising by estoppel or other-

wise, to any intellectual property rights is granted by this publication. Except as set forth

in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever,

and disclaims any express or implied warranty, relating to its products including, but not

limited to, the implied warranty of merchantability, fitness for a particular purpose, or

infringement of any intellectual property right.

AMD’s products are not designed, intended, authorized or warranted for use as compo-

nents in systems intended for surgical implant into the body, or in other applications

intended to support or sustain life, or in any other application in which the failure of AMD’s

product could create a situation where personal injury, death, or severe property or envi-

ronmental damage may occur. AMD reserves the right to discontinue or make changes to

its products at any time without notice.

Advanced Micro Devices, Inc.

One AMD Place

P.O. Box 3453

Sunnyvale, CA 94088-3453

www.amd.com

For AMD Accelerated Parallel Processing:

URL: developer.amd.com/appsdk

Developing: developer.amd.com/

Support: developer.amd.com/appsdksupport

Forum: developer.amd.com/openclforum

AMD ACCELERATED PARALLEL PROCESSING

Preface iii

Preface

About This Document

This document provides a basic description of the AMD Accelerated Parallel

Processing environment and components. It describes the basic architecture of

stream processors and provides useful performance tips. This document also

provides a guide for programmers who want to use AMD Accelerated Parallel

Processing to accelerate their applications.

Audience

This document is intended for programmers. It assumes prior experience in

writing code for CPUs and a basic understanding of threads (work-items). While

a basic understanding of GPU architectures is useful, this document does not

assume prior graphics knowledge. It further assumes an understanding of

chapters 1, 2, and 3 of the OpenCL Specification (for the latest version, see

http://www.khronos.org/registry/cl/ ).

Organization

This AMD Accelerated Parallel Processing document begins, in Chapter 1, with

an overview of: the AMD Accelerated Parallel Processing programming models,

OpenCL, the AMD Compute Abstraction Layer (CAL), the AMD APP Kernel

Analyzer, and the AMD APP Profiler. Chapter 2 discusses the compiling and

running of OpenCL programs. Chapter 3 describes using GNU debugger (GDB)

to debug OpenCL programs. Chapter 4 is a discussion of general performance

and optimization considerations when programming for AMD Accelerated Parallel

Processing devices. Chapter 5 details performance and optimization

considerations specifically for Southern Island devices. Chapter 6 details

performance and optimization devices for Evergreen and Northern Islands

devices. Appendix A describes the supported optional OpenCL extensions.

Appendix B details the installable client driver (ICD) for OpenCL. Appendix C

details the compute kernel and contrasts it with a pixel shader. Appendix D lists

the device parameters. Appendix E describes the OpenCL binary image format

(BIF). Appendix F describes the OpenVideo Decode API. Appendix G describes

the interoperability between OpenCL and OpenGL. The last section of this book

is a glossary of acronyms and terms, as well as an index.

AMD ACCELERATED PARALLEL PROCESSING

iv Preface

Conventions

The following conventions are used in this document.

Related Documents

•The OpenCL Specification, Version 1.1, Published by Khronos OpenCL

Working Group, Aaftab Munshi (ed.), 2010.

•AMD, R600 Technology, R600 Instruction Set Architecture, Sunnyvale, CA,

est. pub. date 2007. This document includes the RV670 GPU instruction

details.

•ISO/IEC 9899:TC2 - International Standard - Programming Languages - C

•Kernighan Brian W., and Ritchie, Dennis M., The C Programming Language,

Prentice-Hall, Inc., Upper Saddle River, NJ, 1978.

•I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P.

Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” ACM

Trans. Graph., vol. 23, no. 3, pp. 777–786, 2004.

•AMD Compute Abstraction Layer (CAL) Intermediate Language (IL)

Reference Manual. Published by AMD.

•Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Hanrahan, Pat;

Houston, Mike; Fatahalian, Kayvon. “BrookGPU”

http://graphics.stanford.edu/projects/brookgpu/

•Buck, Ian. “Brook Spec v0.2”. October 31, 2003.

http://merrimac.stanford.edu/brook/brookspec-05-20-03.pdf

•OpenGL Programming Guide, at http://www.glprogramming.com/red/

•Microsoft DirectX Reference Website, at http://msdn.microsoft.com/en-

us/directx

•GPGPU: http://www.gpgpu.org, and Stanford BrookGPU discussion forum

http://www.gpgpu.org/forums/

mono-spaced font A filename, file path, or code.

* Any number of alphanumeric characters in the name of a code format, parameter,

or instruction.

[1,2) A range that includes the left-most value (in this case, 1) but excludes the right-most

value (in this case, 2).

[1,2] A range that includes both the left-most and right-most values (in this case, 1 and 2).

{x | y} One of the multiple options listed. In this case, x or y.

0.0f

0.0

A single-precision (32-bit) floating-point value.

A double-precision (64-bit) floating-point value.

1011b A binary value, in this example a 4-bit value.

7:4 A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.

italicized word or phrase The first use of a term or concept basic to the understanding of stream computing.

AMD ACCELERATED PARALLEL PROCESSING

Preface v

Contact Information

URL: developer.amd.com/appsdk

Developing: developer.amd.com/

Support: developer.amd.com/appsdksupport

Forum: developer.amd.com/openclforum

REVISION HISTORY

Rev Description

1.3 e Deleted encryption reference.

1.3f Added basic guidelines to CL-GL Interop appendix.

Corrected code in two samples in Chpt. 4.

1.3g Numerous changes to CL-GL Interop appendix.

Added subsections to Additional Performance Guidance on CPU Programmers

Using OpenCL to Program CPUs and Using Special CPU Instructions in the

Optimizing Kernel Code subsection.

2.0 Added ELF Header section in Appendix E.

2.1 New Profiler and KernelAnalyzer sections in chapter 4.

New AMD gDEBugger section in chapter 3.

Added extensions to Appendix A.

Numerous updates throughout for Southern Islands, especially in Chapters 1

and 5.

Split original chapter 4 into three chapters. Now, chapter 4 is general consid-

erations for Evergreen, Northern Islands, and Southern Islands; chapter 5 is

specifically for Southern Islands devices; chapter 6 is for Evergreen and

Northern Islands devices.

Update of Device Parameters in Appendix D.

2.1a Reinstated some supplemental compiler options in Section 2.1.4.

Changes/additions to Table 4.3

2.1b Minor change in Section 1.8.3, indicating that LDS model has not changed

from previous GPU families.

2.2 Addition of channel mapping information (chpt 5). Minor corrections

throughout. Deletion of duplicate material from chpt 6.

2.3 Inclusion of upgraded index. Minor rewording and corrections.

AMD ACCELERATED PARALLEL PROCESSING

vi Preface

AMD ACCELERATED PARALLEL PROCESSING

Contents vii

Contents

Preface

Contents

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Software Overview ........................................................................................................................... 1-1

1.1.1 Data-Parallel Programming Model .................................................................................1-1

1.1.2 Task-Parallel Programming Model .................................................................................1-1

1.1.3 Synchronization................................................................................................................1-1

1.2 Hardware Overview for Southern Islands Devices...................................................................... 1-2

1.3 Hardware Overview for Evergreen and Northern Islands Devices ............................................ 1-4

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL ................................... 1-6

1.4.1 Work-Item Processing .....................................................................................................1-9

1.4.2 Flow Control ...................................................................................................................1-10

1.4.3 Work-Item Creation ........................................................................................................1-11

1.5 Memory Architecture and Access................................................................................................ 1-11

1.5.1 Memory Access..............................................................................................................1-13

1.5.2 Global Buffer...................................................................................................................1-13

1.5.3 Image Read/Write ...........................................................................................................1-13

1.5.4 Memory Load/Store........................................................................................................1-14

1.6 Communication Between Host and GPU in a Compute Device............................................... 1-14

1.6.1 PCI Express Bus ............................................................................................................1-14

1.6.2 Processing API Calls: The Command Processor ......................................................1-14

1.6.3 DMA Transfers ................................................................................................................1-15

1.6.4 Masking Visible Devices................................................................................................1-15

1.7 GPU Compute Device Scheduling ............................................................................................... 1-15

1.8 Terminology .................................................................................................................................... 1-17

1.8.1 Compute Kernel..............................................................................................................1-17

1.8.2 Wavefronts and Work-groups.......................................................................................1-18

1.8.3 Local Data Store (LDS)..................................................................................................1-18

1.9 Programming Model ...................................................................................................................... 1-18

1.10 Example Programs......................................................................................................................... 1-20

1.10.1 First Example: Simple Buffer Write .............................................................................1-20

1.10.2 Example: Parallel Min() Function .................................................................................1-23

AMD ACCELERATED PARALLEL PROCESSING

viii Contents

Chapter 2 Building and Running OpenCL Programs

2.1 Compiling the Program ................................................................................................................... 2-2

2.1.1 Compiling on Windows ...................................................................................................2-2

2.1.2 Compiling on Linux .........................................................................................................2-3

2.1.3 Supported Standard OpenCL Compiler Options..........................................................2-4

2.1.4 AMD-Developed Supplemental Compiler Options .......................................................2-4

2.2 Running the Program...................................................................................................................... 2-5

2.2.1 Running Code on Windows............................................................................................2-6

2.2.2 Running Code on Linux ..................................................................................................2-7

2.3 Calling Conventions ........................................................................................................................ 2-7

Chapter 3 Debugging OpenCL

3.1 AMD gDEBugger .............................................................................................................................. 3-1

3.2 Debugging CPU Kernels with GDB ...............................................................................................3-2

3.2.1 Setting the Environment .................................................................................................3-2

3.2.2 Setting the Breakpoint in an OpenCL Kernel...............................................................3-2

3.2.3 Sample GDB Session ......................................................................................................3-3

3.2.4 Notes..................................................................................................................................3-4

Chapter 4 OpenCL Performance and Optimization

4.1 AMD APP Profiler............................................................................................................................. 4-1

4.1.1 Collecting OpenCL Application Trace ...........................................................................4-2

4.1.2 Collecting OpenCL GPU Kernel Performance Counters.............................................4-5

4.1.3 OpenCL Kernel Occupancy Modeler .............................................................................4-6

4.2 AMD APP KernelAnalyzer............................................................................................................... 4-8

4.3 Analyzing Processor Kernels ......................................................................................................... 4-9

4.3.1 Intermediate Language and GPU Disassembly............................................................4-9

4.3.2 Generating IL and ISA Code.........................................................................................4-10

4.4 Estimating Performance................................................................................................................ 4-10

4.4.1 Measuring Execution Time ...........................................................................................4-10

4.4.2 Using the OpenCL timer with Other System Timers .................................................4-11

4.4.3 Estimating Memory Bandwidth ....................................................................................4-12

4.5 OpenCL Memory Objects.............................................................................................................. 4-13

4.5.1 Types of Memory Used by the Runtime......................................................................4-13

4.5.2 Placement........................................................................................................................4-16

4.5.3 Memory Allocation .........................................................................................................4-18

4.5.4 Mapping...........................................................................................................................4-18

4.5.5 Reading, Writing, and Copying ....................................................................................4-21

4.5.6 Command Queue ...........................................................................................................4-21

4.6 OpenCL Data Transfer Optimization............................................................................................4-21

4.6.1 Definitions.......................................................................................................................4-22

4.6.2 Buffers.............................................................................................................................4-22

AMD ACCELERATED PARALLEL PROCESSING

Contents ix

4.7 Using Multiple OpenCL Devices .................................................................................................. 4-29

4.7.1 CPU and GPU Devices ..................................................................................................4-29

4.7.2 When to Use Multiple Devices .....................................................................................4-31

4.7.3 Partitioning Work for Multiple Devices........................................................................4-32

4.7.4 Synchronization Caveats...............................................................................................4-34

4.7.5 GPU and CPU Kernels...................................................................................................4-35

4.7.6 Contexts and Devices....................................................................................................4-36

Chapter 5 OpenCL Performance and Optimization for Southern Islands Devices

5.1 Global Memory Optimization .......................................................................................................... 5-1

5.1.1 Channel Conflicts.............................................................................................................5-2

5.1.2 Coalesced Writes .............................................................................................................5-8

5.1.3 Hardware Variations.........................................................................................................5-9

5.2 Local Memory (LDS) Optimization ................................................................................................. 5-9

5.3 Constant Memory Optimization.................................................................................................... 5-12

5.4 OpenCL Memory Resources: Capacity and Performance ........................................................ 5-14

5.5 Using LDS or L1 Cache ................................................................................................................ 5-15

5.6 NDRange and Execution Range Optimization............................................................................ 5-16

5.6.1 Hiding ALU and Memory Latency ................................................................................5-16

5.6.2 Resource Limits on Active Wavefronts.......................................................................5-17

5.6.3 Partitioning the Work.....................................................................................................5-20

5.6.4 Summary of NDRange Optimizations ..........................................................................5-22

5.7 Instruction Selection Optimizations............................................................................................. 5-23

5.7.1 Instruction Bandwidths .................................................................................................5-23

5.7.2 AMD Media Instructions ................................................................................................5-24

5.7.3 Math Libraries.................................................................................................................5-24

5.7.4 Compiler Optimizations.................................................................................................5-25

5.8 Additional Performance Guidance............................................................................................... 5-25

5.8.1 Loop Unroll pragma......................................................................................................5-25

5.8.2 Memory Tiling.................................................................................................................5-26

5.8.3 General Tips....................................................................................................................5-27

5.8.4 Guidance for CUDA Programmers Using OpenCL ....................................................5-29

5.8.5 Guidance for CPU Programmers Using OpenCL to Program GPUs .......................5-29

5.8.6 Optimizing Kernel Code ................................................................................................5-30

5.8.7 Optimizing Kernels for Southern Island GPUs...........................................................5-31

5.9 Specific Guidelines for Southern Islands GPUs ........................................................................ 5-32

Chapter 6 OpenCL Performance and Optimization for Evergreen and Northern Islands

Devices

6.1 Global Memory Optimization .......................................................................................................... 6-1

6.1.1 Two Memory Paths...........................................................................................................6-3

6.1.2 Channel Conflicts.............................................................................................................6-6

6.1.3 Float4 Or Float1..............................................................................................................6-11

AMD ACCELERATED PARALLEL PROCESSING

x Contents

6.1.4 Coalesced Writes ...........................................................................................................6-12

6.1.5 Alignment........................................................................................................................6-14

6.1.6 Summary of Copy Performance...................................................................................6-16

6.1.7 Hardware Variations.......................................................................................................6-16

6.2 Local Memory (LDS) Optimization............................................................................................... 6-16

6.3 Constant Memory Optimization.................................................................................................... 6-19

6.4 OpenCL Memory Resources: Capacity and Performance ........................................................ 6-20

6.5 Using LDS or L1 Cache ................................................................................................................ 6-22

6.6 NDRange and Execution Range Optimization............................................................................ 6-23

6.6.1 Hiding ALU and Memory Latency ................................................................................6-23

6.6.2 Resource Limits on Active Wavefronts.......................................................................6-24

6.6.3 Partitioning the Work.....................................................................................................6-28

6.6.4 Optimizing for Cedar .....................................................................................................6-32

6.6.5 Summary of NDRange Optimizations..........................................................................6-32

6.7 Using Multiple OpenCL Devices .................................................................................................. 6-33

6.7.1 CPU and GPU Devices ..................................................................................................6-33

6.7.2 When to Use Multiple Devices .....................................................................................6-35

6.7.3 Partitioning Work for Multiple Devices .......................................................................6-35

6.7.4 Synchronization Caveats ..............................................................................................6-37

6.7.5 GPU and CPU Kernels...................................................................................................6-39

6.7.6 Contexts and Devices....................................................................................................6-40

6.8 Instruction Selection Optimizations ............................................................................................ 6-41

6.8.1 Instruction Bandwidths .................................................................................................6-41

6.8.2 AMD Media Instructions................................................................................................6-42

6.8.3 Math Libraries.................................................................................................................6-42

6.8.4 VLIW and SSE Packing .................................................................................................6-43

6.8.5 Compiler Optimizations.................................................................................................6-45

6.9 Clause Boundaries ........................................................................................................................ 6-46

6.10 Additional Performance Guidance............................................................................................... 6-48

6.10.1 Loop Unroll pragma......................................................................................................6-48

6.10.2 Memory Tiling.................................................................................................................6-48

6.10.3 General Tips....................................................................................................................6-49

6.10.4 Guidance for CUDA Programmers Using OpenCL ....................................................6-51

6.10.5 Guidance for CPU Programmers Using OpenCL to Program GPUs .......................6-52

6.10.6 Optimizing Kernel Code ................................................................................................6-52

6.10.7 Optimizing Kernels for Evergreen and 69XX-Series GPUs.......................................6-53

Chapter 7 OpenCL Static C++ Programming Language

7.1 Overview ........................................................................................................................................... 7-1

7.1.1 Supported Features .........................................................................................................7-1

7.1.2 Unsupported Features.....................................................................................................7-2

7.1.3 Relations with ISO/IEC C++ ............................................................................................7-2

AMD ACCELERATED PARALLEL PROCESSING

Contents xi

7.2 Additions and Changes to Section 5 - The OpenCL C Runtime ............................................... 7-2

7.2.1 Additions and Changes to Section 5.7.1 - Creating Kernel Objects .........................7-2

7.2.2 Passing Classes between Host and Device .................................................................7-3

7.3 Additions and Changes to Section 6 - The OpenCL C Programming Language .................... 7-3

7.3.1 Building C++ Kernels.......................................................................................................7-3

7.3.2 Classes and Derived Classes.........................................................................................7-3

7.3.3 Namespaces......................................................................................................................7-4

7.3.4 Overloading.......................................................................................................................7-4

7.3.5 Templates ..........................................................................................................................7-5

7.3.6 Exceptions ........................................................................................................................7-6

7.3.7 Libraries ............................................................................................................................7-6

7.3.8 Dynamic Operation ..........................................................................................................7-6

7.4 Examples........................................................................................................................................... 7-6

7.4.1 Passing a Class from the Host to the Device and Back.............................................7-6

7.4.2 Kernel Overloading ..........................................................................................................7-7

7.4.3 Kernel Template................................................................................................................7-8

Appendix A OpenCL Optional Extensions

A.1 Extension Name Convention ..........................................................................................................A-1

A.2 Querying Extensions for a Platform..............................................................................................A-1

A.3 Querying Extensions for a Device.................................................................................................A-2

A.4 Using Extensions in Kernel Programs..........................................................................................A-2

A.5 Getting Extension Function Pointers ............................................................................................A-3

A.6 List of Supported Extensions that are Khronos-Approved........................................................A-3

A.7 cl_ext Extensions.........................................................................................................................A-4

A.8 AMD Vendor-Specific Extensions ..................................................................................................A-4

A.8.1 cl_amd_fp64.................................................................................................................A-4

A.8.2 cl_amd_vec3.................................................................................................................A-4

A.8.3 cl_amd_device_persistent_memory .................................................................. A-4

A.8.4 cl_amd_device_attribute_query....................................................................... A-5

A.8.5 cl_amd_device_profiling_timer_offset ...................................................... A-5

A.8.6 cl_amd_device_topology ....................................................................................... A-5

A.8.7 cl_amd_device_board_name................................................................................... A-5

A.8.8 cl_amd_compile_options ...................................................................................... A-6

A.8.9 cl_amd_offline_devices ....................................................................................... A-6

A.8.10 cl_amd_event_callback.......................................................................................... A-6

A.8.11 cl_amd_popcnt ............................................................................................................ A-7

A.8.12 cl_amd_media_ops ..................................................................................................... A-7

A.8.13 cl_amd_media_ops2................................................................................................... A-9

A.8.14 cl_amd_printf .......................................................................................................... A-12

A.9 cl_amd_predefined_macros .................................................................................................A-13

A.10 Supported Functions for cl_amd_fp64 / cl_khr_fp64.......................................................A-15

AMD ACCELERATED PARALLEL PROCESSING

xii Contents

A.11 Extension Support by Device.......................................................................................................A-15

Appendix B The OpenCL Installable Client Driver (ICD)

B.1 Overview ...........................................................................................................................................B-1

B.2 Using ICD..........................................................................................................................................B-1

Appendix C Compute Kernel

C.1 Differences from a Pixel Shader....................................................................................................C-1

C.2 Indexing.............................................................................................................................................C-1

C.3 Performance Comparison ...............................................................................................................C-2

C.4 Pixel Shader .....................................................................................................................................C-2

C.5 Compute Kernel ...............................................................................................................................C-3

C.6 LDS Matrix Transpose.....................................................................................................................C-4

C.7 Results Comparison ........................................................................................................................C-4

Appendix D Device Parameters

Appendix E OpenCL Binary Image Format (BIF) v2.0

E.1 Overview ...........................................................................................................................................E-1

E.1.1 Executable and Linkable Format (ELF) Header........................................................... E-2

E.1.2 Bitness.............................................................................................................................. E-3

E.2 BIF Options....................................................................................................................................... E-3

Appendix F Open Decode API Tutorial

F.1 Overview ........................................................................................................................................... F-1

F.2 Initializing.......................................................................................................................................... F-2

F.3 Creating the Context ....................................................................................................................... F-2

F.4 Creating the Session ....................................................................................................................... F-3

F.5 Decoding........................................................................................................................................... F-3

F.6 Destroying Session and Context ................................................................................................... F-4

Appendix G OpenCL-OpenGL Interoperability

G.1 Under Windows................................................................................................................................G-1

G.1.1 Single GPU Environment ...............................................................................................G-2

G.1.2 Multi-GPU Environment..................................................................................................G-4

G.1.3 Limitations .......................................................................................................................G-7

G.2 Linux Operating System .................................................................................................................G-8

G.2.1 Single GPU Environment ...............................................................................................G-8

G.2.2 Multi-GPU Configuration .............................................................................................. G-11

Index

AMD ACCELERATED PARALLEL PROCESSING

Contents xiii

Figures

1.1 Generalized AMD GPU Compute Device Structure for Southern Islands Devices ...............1-2

1.2 AMD Radeon™ HD 79XX Device Partial Block Diagram ......................................................1-3

1.3 Generalized AMD GPU Compute Device Structure................................................................1-4

1.4 Simplified Block Diagram of and Evergreen-Family GPU ......................................................1-5

1.5 AMD Accelerated Parallel Processing Software Ecosystem ..................................................1-6

1.6 Simplified Mapping of OpenCL onto AMD Accelerated Parallel Processing for

Evergreen and Northern Island Devices .................................................................................1-8

1.7 Work-Item Grouping Into Work-Groups and Wavefronts ........................................................1-9

1.8 Interrelationship of Memory Domains for Southern Islands Devices ...................................1-12

1.9 Dataflow between Host and GPU .........................................................................................1-12

1.10 Simplified Execution Of Work-Items On A Single Stream Core ...........................................1-16

1.11 Stream Core Stall Due to Data Dependency ........................................................................1-17

1.12 OpenCL Programming Model ................................................................................................1-19

2.1 OpenCL Compiler Toolchain....................................................................................................2-1

2.2 Runtime Processing Structure .................................................................................................2-6

4.1 Timeline and API Trace View in Microsoft Visual Studio 2010 ..............................................4-3

4.2 Context Summary Page View in Microsoft Visual Studio 2010 ..............................................4-4

4.3 Warning(s) and Error(s) Page .................................................................................................4-5

4.4 Sample Session View in Microsoft Visual Studio 2010 ..........................................................4-6

4.5 Sample Kernel Occupancy Modeler Screen ...........................................................................4-7

4.6 AMD APP Kernel Analyzer......................................................................................................4-9

5.1 Memory System .......................................................................................................................5-1

5.2 Channel Remapping/Interleaving.............................................................................................5-4

5.3 Transformation to Staggered Offsets.......................................................................................5-7

5.4 One Example of a Tiled Layout Format................................................................................5-27

5.5 Northern Islands Compute Unit Arrangement .......................................................................5-35

5.6 Southern Island Compute Unit Arrangement ........................................................................5-35

6.1 Memory System .......................................................................................................................6-2

6.2 FastPath (blue) vs CompletePath (red) Using float1 ..............................................................6-3

6.3 Transformation to Staggered Offsets.......................................................................................6-9

6.4 Two Kernels: One Using float4 (blue), the Other float1 (red) .............................................. 6-11

6.5 Effect of Varying Degrees of Coalescing - Coal (blue), NoCoal (red), Split (green) ..........6-13

6.6 Unaligned Access Using float1..............................................................................................6-15

6.7 Unmodified Loop....................................................................................................................6-43

6.8 Kernel Unrolled 4X.................................................................................................................6-44

6.9 Unrolled Loop with Stores Clustered.....................................................................................6-44

6.10 Unrolled Kernel Using float4 for Vectorization ......................................................................6-45

6.11 One Example of a Tiled Layout Format................................................................................6-49

C.1 Pixel Shader Matrix Transpose .............................................................................................. C-2

C.2 Compute Kernel Matrix Transpose......................................................................................... C-3

C.3 LDS Matrix Transpose ............................................................................................................ C-4

F.1 Open Decode with Optional Post-Processing ........................................................................ F-1

AMD ACCELERATED PARALLEL PROCESSING

xiv Contents

AMD ACCELERATED PARALLEL PROCESSING

Contents xv

Tables

4.1 Memory Bandwidth in GB/s (R = read, W = write) in GB/s ................................................4-14

4.2 OpenCL Memory Object Properties .....................................................................................4-17

4.3 Transfer policy on clEnqueueMapBuffer / clEnqueueMapImage /

clEnqueueUnmapMemObject for Copy Memory Objects ........................................................4-20

4.4 CPU and GPU Performance Characteristics ........................................................................4-30

5.1 Hardware Performance Parameters ......................................................................................5-14

5.2 Effect of LDS Usage on Wavefronts/CU1 ............................................................................5-19

5.3 Instruction Throughput (Operations/Cycle for Each Stream Processor) .............................5-23

5.4 Resource Limits for Northern Islands and Southern Islands ................................................5-34

6.1 Bandwidths for 1D Copies.......................................................................................................6-4

6.2 Bandwidths for Different Launch Dimensions .........................................................................6-8

6.3 Bandwidths Including float1 and float4..................................................................................6-12

6.4 Bandwidths Including Coalesced Writes ...............................................................................6-14

6.5 Bandwidths Including Unaligned Access...............................................................................6-15

6.6 Hardware Performance Parameters ......................................................................................6-21

6.7 Impact of Register Type on Wavefronts/CU..........................................................................6-26

6.8 Effect of LDS Usage on Wavefronts/CU1 ............................................................................6-28

6.9 CPU and GPU Performance Characteristics ........................................................................6-33

6.10 Instruction Throughput (Operations/Cycle for Each Stream Processor) .............................6-41

6.11 Native Speedup Factor ..........................................................................................................6-43

A.1 Extension Support for AMD GPU Devices 1 ....................................................................... A-15

A.2 Extension Support for Older AMD GPUs and CPUs........................................................... A-16

D.1 Parameters for 7xxx Devices ................................................................................................. D-2

D.2 Parameters for 68xx and 69xx Devices ................................................................................. D-3

D.3 Parameters for 65xx, 66xx, and 67xx Devices ...................................................................... D-4

D.4 Parameters for 64xx Devices ................................................................................................. D-5

D.5 Parameters for Zacate and Ontario Devices ......................................................................... D-6

D.6 Parameters for 56xx, 57xx, 58xx, Eyfinity6, and 59xx Devices ............................................ D-7

D.7 Parameters for Exxx, Cxx, 54xx, and 55xx Devices ............................................................. D-8

E.1 ELF Header Fields ................................................................................................................ E-2

AMD ACCELERATED PARALLEL PROCESSING

xvi Contents

AMD ACCELERATED PARALLEL PROCESSING

AMD Accelerated Parallel Processing - OpenCL Programming Guide 1-1

Chapter 1

OpenCL Architecture and AMD

Accelerated Parallel Processing

This chapter provides a general software and hardware overview of the AMD

Accelerated Parallel Processing implementation of the OpenCL standard. It

explains the memory structure and gives simple programming examples.

1.1 Software Overview

OpenCL supports data-parallel and task-parallel programming models, as well as

hybrids of these models. Of the two, the primary one is the data-parallel model.

1.1.1 Data-Parallel Programming Model

In the data parallel programming model, a computation is defined in terms of a

sequence of instructions that executes at each point in an N-dimensional index

space. It is a common, though by not required, formulation of an algorithm that

each computation index maps to an element in an input data set.

The OpenCL data-parallel programming model is hierarchical. The hierarchical

subdivision can be specified in two ways:

•Explicitly - the developer defines the total number of work-items to execute

in parallel, as well as the division of work-items into specific work-groups.

•Implicitly - the developer specifies the total number of work-items to execute

in parallel, and OpenCL manages the division into work-groups.

1.1.2 Task-Parallel Programming Model

In this model, a kernel instance is executed independently of any index space.

This is equivalent to executing a kernel on a compute device with a work-group

and NDRange containing a single work-item. Parallelism is expressed using

vector data types implemented by the device, enqueuing multiple tasks, and/or

enqueuing native kernels developed using a programming model orthogonal to

OpenCL.

1.1.3 Synchronization

The two domains of synchronization in OpenCL are work-items in a single work-

group and command-queue(s) in a single context. Work-group barriers enable

synchronization of work-items in a work-group. Each work-item in a work-group

must first execute the barrier before executing any instruction beyond this barrier.

Either all of, or none of, the work-items in a work-group must encounter the

AMD ACCELERATED PARALLEL PROCESSING

1-2 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

barrier. A barrier or mem_fence operation does not have global scope, but is

relevant only to the local workgroup on which they operate. However, atomic

operations done on global memory do have a global scope, hence may provide

a way to do global synchronization.

There are two types of synchronization between commands in a command-

queue:

•command-queue barrier - enforces ordering within a single queue. Any

resulting changes to memory are available to the following commands in the

queue.

•events - enforces ordering between, or within, queues. Enqueued commands

in OpenCL return an event identifying the command as well as the memory

object updated by it. This ensures that following commands waiting on that

event see the updated memory objects before they execute.

1.2 Hardware Overview for Southern Islands Devices

A general OpenCL device comprises compute units (CUs), each of which has

sub-modules that ultimately have ALUs. A work-item (or SPMD kernel instance)

executes on an ALU, as shown in Figure 1.1).

Figure 1.1 Generalized AMD GPU Compute Device Structure for Southern

Islands Devices

For AMD Radeon™ HD 79XX devices, each of the 32 CUs has one Scalar Unit

and four Vector Units, each of which contain an array of 16 PEs. Each PE

consists of one ALU. Figure 1.2 shows only two compute units of the array that

comprises the compute device of the AMD Radeon™ HD 7XXX family. The four

Vector Units use SIMD execution of a scalar instruction. This makes it possible

to run four separate instructions at once, but they are dynamically scheduled (as

GPU

Compute Device

GPU

Compute Device

Compute

Unit

Compute

Unit

Compute

Unit

1 Scalar Unit 4 Vector Units

16 Processing Elements

AMD ACCELERATED PARALLEL PROCESSING

1.2 Hardware Overview for Southern Islands Devices 1-3

opposed to those for the AMD Radeon™ HD 69XX devices, which are statically

scheduled.

Figure 1.2 AMD Radeon™ HD 79XX Device Partial Block Diagram

In Figure 1.2, there are two command processors, which can process two

command queues concurrently. The Scalar Unit, Vector Unit, Level 1 data cache

(L1), and Local Data Share (LDS) are the components of one compute unit, of

which there are 32. The SC cache is the scalar unit data cache, and the Level

2 cache consists of instructions and data.

As noted, the AMD Radeon™ HD 79XX devices also have a scalar unit, and the

instruction stream contains both scalar and vector instructions. On each cycle, it

selects a scalar instruction and a vector instruction (as well as a memory

operation and a branch operation, if available); it issues one to the scalar unit,

the other to the vector unit; this takes four cycles to issue over the four vector

cores (the same four cycles over which the 16 units execute 64 work-items).

The number of compute units in an AMD GPU, and the way they are structured,

varies with the device family, as well as device designations within a family. Each

of these vector units possesses ALUs (processing elements). For devices in the

Northern Islands (AMD Radeon™ HD 69XX) and Southern Islands (AMD

Radeon™ HD 7XXX) families, these ALUs are arranged in four (in the Evergreen

family, there are five) SIMD arrays consisting of 16 processing elements each.

(See Section 1.3, “Hardware Overview for Evergreen and Northern Islands

ZĞĂĚͬtƌŝƚĞŵĞŵŽƌǇŝŶƚĞƌĨĂĐĞ

'ZϱDĞŵŽƌǇ^ǇƐƚĞŵ

ƐǇŶĐŚƌŽŶŽƵƐŽŵƉƵƚĞŶŐŝŶĞ

ͬŽŵŵĂŶĚWƌŽĐĞƐƐŽƌ

/ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ >ϭ >^^ĐĂůĂƌhŶŝƚ sĞĐƚŽƌhŶŝƚ>ϭ>^ ^ĐĂůĂƌhŶŝƚ

>ĞǀĞůϮĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ >ϭ >^^ĐĂůĂƌhŶŝƚ

^ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ>ϭ>^ ^ĐĂůĂƌhŶŝƚ

ƐǇŶĐŚƌŽŶŽƵƐŽŵƉƵƚĞŶŐŝŶĞ

ͬŽŵŵĂŶĚWƌŽĐĞƐƐŽƌ

/ĐĂĐŚĞ

^ĐĂĐŚĞ

/ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ >ϭ >^^ĐĂůĂƌhŶŝƚ

^ĐĂĐŚĞ

/ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ >ϭ >^^ĐĂůĂƌhŶŝƚ

^ĐĂĐŚĞ

/ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ >ϭ >^^ĐĂůĂƌhŶŝƚ

^ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ>ϭ>^ ^ĐĂůĂƌhŶŝƚ

/ĐĂĐŚĞ

^ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ>ϭ>^ ^ĐĂůĂƌhŶŝƚ

/ĐĂĐŚĞ

^ĐĂĐŚĞ

sĞĐƚŽƌhŶŝƚ>ϭ>^ ^ĐĂůĂƌhŶŝƚ

/ĐĂĐŚĞ

^ĐĂĐŚĞ

AMD ACCELERATED PARALLEL PROCESSING

1-4 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

Devices.”) Each of these arrays executes a single instruction across each lane

for each of a block of 16 work-items. That instruction is repeated over four cycles

to make the 64-element vector called a wavefront. On devices in the Southern

Island family, the four stream cores execute code from four different wavefronts.

1.3 Hardware Overview for Evergreen and Northern Islands Devices

A general OpenCL device comprises compute units, each of which can have

multiple processing elements. A work-item (or SPMD kernel instance) executes

on a single processing element. The processing elements within a compute unit

can execute in lock-step using SIMD execution. Compute units, however,

execute independently (see Figure 1.3).

AMD GPUs consists of multiple compute units. The number of them and the way

they are structured varies with the device family, as well as device designations

within a family. Each of these processing elements possesses ALUs. For devices

in the Northern Islands and Southern Islands families, these ALUs are arranged

in four (in the Evergreen family, there are five) processing elements with arrays

of 16 ALUs. Each of these arrays executes a single instruction across each lane

for each of a block of 16 work-items. That instruction is repeated over four cycles

to make the 64-element vector called a wavefront. On devices in the Southern

Island family, the four processing elements execute code from four different

wavefronts. On Northern Islands and Evergreen family devices, the four arrays

execute instructions from one wavefront, so that each work-item issues four (for

Northern Islands) or five (for Evergreen) instructions at once in a very-long-

instruction-word (VLIW) packet.

Figure 1.3 shows a simplified block diagram of a generalized AMD GPU compute

device.

Figure 1.3 Generalized AMD GPU Compute Device Structure

GPU

Compute Device

GPU

Compute Device

Compute

Unit

Compute

Unit

Compute

Unit

Processing Elements

ALUs

AMD ACCELERATED PARALLEL PROCESSING

1.3 Hardware Overview for Evergreen and Northern Islands Devices 1-5

Figure 1.4 is a simplified diagram of an AMD GPU compute device. Different

GPU compute devices have different characteristics (such as the number of

compute units), but follow a similar design pattern.

Figure 1.4 Simplified Block Diagram of and Evergreen-Family GPU1

GPU compute devices comprise groups of compute units. Each compute unit

contains numerous processing elements, which are responsible for executing

kernels, each operating on an independent data stream. Processing elements, in

turn, contain numerous processing elements, which are the fundamental,

Compute

Unit

Compute

Unit

Compute

Unit

Ultra-Threaded Dispatch Processor (UTDP)

General-Purpose Registers

Branch

Execution

Unit

Instruction

and Control

Flow

ALUs

Processing Element

Compute

Unit

1. Much of this is transparent to the programmer.

AMD ACCELERATED PARALLEL PROCESSING

1-6 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

programmable ALUs that perform integer, single-precision floating-point, double-

precision floating-point, and transcendental operations. All processing elements

within a compute unit execute the same instruction sequence in lock-step for

Evergreen and Northern Islands devices; different compute units can execute

different instructions.

A processing element is arranged as a five-way or four-way (depending on the

GPU type) very long instruction word (VLIW) processor (see bottom of

Figure 1.4). Up to five scalar operations (or four, depending on the GPU type)

can be co-issued in a VLIW instruction, each of which are executed on one of

the corresponding five ALUs. ALUs can execute single-precision floating point or

integer operations. One of the five ALUs also can perform transcendental

operations (sine, cosine, logarithm, etc.). Double-precision floating point

operations are processed (where supported) by connecting two or four of the

ALUs (excluding the transcendental core) to perform a single double-precision

operation. The processing element also contains one branch execution unit to

handle branch instructions.

Different GPU compute devices have different numbers of processing elements.

For example, the ATI Radeon™ HD 5870 GPU has 20 compute units, each with

16 processing elements, and each processing elements contains five ALUs; this

yields 1600 physical ALUs.

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL

AMD Accelerated Parallel Processing harnesses the tremendous processing

power of GPUs for high-performance, data-parallel computing in a wide range of

applications. The AMD Accelerated Parallel Processing system includes a

software stack and the AMD GPUs. Figure 1.5 illustrates the relationship of the

AMD Accelerated Parallel Processing components.

Figure 1.5 AMD Accelerated Parallel Processing Software Ecosystem

The AMD Accelerated Parallel Processing software stack provides end-users and

developers with a complete, flexible suite of tools to leverage the processing

Stream Applications

Third-Party Tools

AMD GPUs

Multicore

CPUs

Libraries

OpenCL Runtime

AMD ACCELERATED PARALLEL PROCESSING

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL 1-7

power in AMD GPUs. AMD Accelerated Parallel Processing software embraces

open-systems, open-platform standards. The AMD Accelerated Parallel

Processing open platform strategy enables AMD technology partners to develop

and provide third-party development tools.

The software includes the following components:

•OpenCL compiler and runtime

•Performance Profiling Tools – AMD APP Profiler and AMD APP

KernelAnalyzer.

•Performance Libraries – AMD Core Math Library (ACML) for optimized

NDRange-specific algorithms.

The latest generations of AMD GPUs use unified shader architectures capable

of running different kernel types interleaved on the same hardware.

Programmable GPU compute devices execute various user-developed programs,

known to graphics programmers as shaders and to compute programmers as

kernels. These GPU compute devices can execute non-graphics functions using

a data-parallel programming model that maps executions onto compute units. In

this programming model, known as AMD Accelerated Parallel Processing, arrays

of input data elements stored in memory are accessed by a number of compute

units.

Each instance of a kernel running on a compute unit is called a work-item. A

specified rectangular region of the output buffer to which work-items are mapped

is known as the n-dimensional index space, called an NDRange.

The GPU schedules the range of work-items onto a group of processing

elements, until all work-items have been processed. Subsequent kernels then

can be executed, until the application completes. A simplified view of the AMD

Accelerated Parallel Processing programming model and the mapping of work-

items to processing elements is shown in Figure 1.6.

AMD ACCELERATED PARALLEL PROCESSING

1-8 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

Figure 1.6 Simplified Mapping of OpenCL onto AMD Accelerated Parallel

Processing for Evergreen and Northern Island Devices

OpenCL maps the total number of work-items to be launched onto an n-

dimensional grid (ND-Range). The developer can specify how to divide these

items into work-groups. AMD GPUs execute on wavefronts (groups of work-items

executed in lock-step in a compute unit); there are an integer number of

wavefronts in each work-group. Thus, as shown in Figure 1.7, hardware that

schedules work-items for execution in the AMD Accelerated Parallel Processing

environment includes the intermediate step of specifying wavefronts within a

work-group. This permits achieving maximum performance from AMD GPUs. For

a more detailed discussion of wavefronts, see Section 1.8.2, “Wavefronts and

Work-groups,” page 1-18.

Scheduler maps work-item (i, j) onto Stream Core k

Memory

Memory Interface

Scheduler (UTDP)

Compute Unitn

Compute Device

Processing Elementk

Registers/Constants/Literals

ALU0ALU1ALUn-1

Input

Data

Output

Data

Process.

Element0

Process.

Elementk

Process.

Elementn-1

Memory Controller

AMD ACCELERATED PARALLEL PROCESSING

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL 1-9

Figure 1.7 Work-Item Grouping Into Work-Groups and Wavefronts

1.4.1 Work-Item Processing

All stream cores within a compute unit execute the same instruction for each

cycle. A work item can issue one VLIW instruction per clock cycle. The block of

work-items that are executed together is called a wavefront. To hide latencies

due to memory accesses and processing element operations, up to four work-

items from the same wavefront are pipelined on the same stream core. For

example, on the AMD Radeon™ HD 6970 GPU compute device, the 16

processing elements execute the same instructions for four cycles, which

effectively appears as a 64-wide compute unit in execution width.

The size of wavefronts can differ on different GPU compute devices. For

example, the AMD Radeon™ HD 54XX series graphics cards has a wavefront

size of 32 work-items. Higher-end AMD GPUs have a wavefront size of 64 work-

items.

Compute units operate independently of each other, so it is possible for different

compute units to execute different instructions.

Range

WORK-GROUP

WORK-ITEM

MVECTOR

(HW SPECIFIC SIZE)

Dimension X

Dimens

Dim Y

ion Z

Dimension X

DimensionZ

Dim Y

AMD ACCELERATED PARALLEL PROCESSING

1-10 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

1.4.2 Flow Control

Before discussing flow control, it is necessary to clarify the relationship of a

wavefront to a work-group. If a user defines a work-group, it consists of one or

more wavefronts. A wavefront is a hardware thread with its own program counter;

it is capable of following control flow independently of other wavefronts. A

wavefronts consists of 64 or fewer work-items, two wavefronts are between 65

to 128 work-items, etc., on a device with a wavefront size of 64. For optimum

hardware usage, an integer multiple of 64 work-items is recommended.

Flow control, such as branching, is done by combining all necessary paths as a

wavefront. If work-items within a wavefront diverge, all paths are executed

serially. For example, if a work-item contains a branch with two paths, the

wavefront first executes one path, then the second path. The total time to

execute the branch is the sum of each path time. An important point is that even

if only one work-item in a wavefront diverges, the rest of the work-items in the

wavefront execute the branch. The number of work-items that must be executed

during a branch is called the branch granularity. On AMD hardware, the branch

granularity is the same as the wavefront granularity.

Masking of wavefronts is effected by constructs such as:

if(x)

{. //items within these braces = A

}

else

{. //items within these braces = B

}

The wavefront mask is set true for lanes (elements/items) in which x is true, then

execute A. The mask then is inverted, and B is executed.

Example 1: If two branches, A and B, take the same amount of time t to execute

over a wavefront, the total time of execution, if any work-item diverges, is 2t.

Loops execute in a similar fashion, where the wavefront occupies a compute unit

as long as there is at least one work-item in the wavefront still being processed.

Thus, the total execution time for the wavefront is determined by the work-item

with the longest execution time.

Example 2: If t is the time it takes to execute a single iteration of a loop; and

within a wavefront all work-items execute the loop one time, except for a single

work-item that executes the loop 100 times, the time it takes to execute that

entire wavefront is 100t.

AMD ACCELERATED PARALLEL PROCESSING

1.5 Memory Architecture and Access 1-11

1.4.3 Work-Item Creation

For each work-group, the GPU compute device spawns the required number of

wavefronts on a single compute unit. If there are non-active work-items within a

wavefront, the stream cores that would have been mapped to those work-items

are idle. An example is a work-group that is a non-multiple of a wavefront size

(for example: if the work-group size is 32, the wavefront is half empty and

unused).

1.5 Memory Architecture and Access

OpenCL has four memory domains: private, local, global, and constant; the AMD

Accelerated Parallel Processing system also recognizes host (CPU) and PCI

Express® (PCIe®) memory.

•private memory - specific to a work-item; it is not visible to other work-items.

•local memory - specific to a work-group; accessible only by work-items

belonging to that work-group.

•global memory - accessible to all work-items executing in a context, as well

as to the host (read, write, and map commands).

•constant memory - read-only region for host-allocated and -initialized objects

that are not changed during kernel execution.

•host (CPU) memory - host-accessible region for an application’s data

structures and program data.

•PCIe memory - part of host (CPU) memory accessible from, and modifiable

by, the host program and the GPU compute device. Modifying this memory

requires synchronization between the GPU compute device and the CPU.

Figure 1.8 illustrates the interrelationship of the memories. (Note that the

referenced color buffer is a write-only output buffer in a pixel shader that has a

predetermined location based on the pixel location.)

AMD ACCELERATED PARALLEL PROCESSING

1-12 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

Figure 1.8 Interrelationship of Memory Domains for Southern Islands

Devices

Figure 1.9 illustrates the standard dataflow between host (CPU) and GPU.

Figure 1.9 Dataflow between Host and GPU

There are two ways to copy data from the host to the GPU compute device

memory:

•Implicitly by using clEnqueueMapBuffer and clEnqueueUnMapMemObject.

•Explicitly through clEnqueueReadBuffer and clEnqueueWriteBuffer

(clEnqueueReadImage, clEnqueueWriteImage.).

IMAGE / CONSTANT DATA

CACHE (L2)

Local Mem.

(LDS) Color BufferL1

Local Mem.

(LDS) Color BufferL1

GLOBAL MEMORY CONSTANT MEMORY

Compute Device

Memory (VRAM) PCIe

Host

DMA

Compute Device

write only

atomic

read / write

W/O atomic

R/W

Private Memory

(Reg Files) m

Proc. Elem.

(ALU)

Proc. Elem.

(ALU)

Compute Unit 1

Private Memory

(Reg Files) 1

Private Memory

(Reg Files) m

Proc. Elem.

(ALU)

Proc. Elem.

(ALU)

Compute Unit n

Private Memory

(Reg Files) 1

AMD ACCELERATED PARALLEL PROCESSING

1.5 Memory Architecture and Access 1-13

When using these interfaces, it is important to consider the amount of copying

involved. There is a two-copy processes: between host and PCIe, and between

PCIe and GPU compute device. This is why there is a large performance

difference between the system GFLOPS and the kernel GFLOPS.

With proper memory transfer management and the use of system pinned

memory (host/CPU memory remapped to the PCIe memory space), copying

between host (CPU) memory and PCIe memory can be skipped. Note that this

is not an easy API call to use and comes with many constraints, such as page

boundary and memory alignment.

Double copying lowers the overall system memory bandwidth. In GPU compute

device programming, pipelining and other techniques help reduce these

bottlenecks. See Chapter 4, Chapter 5, and Chapter 6 for more specifics about

optimization techniques.

1.5.1 Memory Access

Using local memory (known as local data store, or LDS, as shown in Figure 1.8)

typically is an order of magnitude faster than accessing host memory through

global memory (VRAM), which is one order of magnitude faster again than PCIe.

However, stream cores do not directly access memory; instead, they issue

memory requests through dedicated hardware units. When a work-item tries to

access memory, the work-item is transferred to the appropriate fetch unit. The

work-item then is deactivated until the access unit finishes accessing memory.

Meanwhile, other work-items can be active within the compute unit, contributing

to better performance. The data fetch units handle three basic types of memory

operations: loads, stores, and streaming stores. GPU compute devices can store

writes to random memory locations using global buffers.

1.5.2 Global Buffer

The global buffer lets applications read from, and write to, arbitrary locations in

memory. When using a global buffer, memory-read and memory-write operations

from the stream kernel are done using regular GPU compute device instructions

with the global buffer used as the source or destination for the instruction. The

programming interface is similar to load/store operations used with CPU

programs, where the relative address in the read/write buffer is specified.

1.5.3 Image Read/Write

Image reads are done by addressing the desired location in the input memory

using the fetch unit. The fetch units can process either 1D or 2 D addresses.

These addresses can be normalized or un-normalized. Normalized coordinates

are between 0.0 and 1.0 (inclusive). For the fetch units to handle 2D addresses

and normalized coordinates, pre-allocated memory segments must be bound to

the fetch unit so that the correct memory address can be computed. For a single

kernel invocation, up to 128 images can be bound at once for reading, and eight

for writing. The maximum number of 2D addresses is 8192 x 8192.

AMD ACCELERATED PARALLEL PROCESSING

1-14 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

Image reads are cached through the texture system (corresponding to the L2 and

L1 caches).

1.5.4 Memory Load/Store

When using a global buffer, each work-item can write to an arbitrary location

within the global buffer. Global buffers use a linear memory layout. If consecutive

addresses are written, the compute unit issues a burst write for more efficient

memory access. Only read-only buffers, such as constants, are cached.

1.6 Communication Between Host and GPU in a Compute Device

The following subsections discuss the communication between the host (CPU)

and the GPU in a compute device. This includes an overview of the PCIe bus,

processing API calls, and DMA transfers.

1.6.1 PCI Express Bus

Communication and data transfers between the system and the GPU compute

device occur on the PCIe channel. AMD Accelerated Parallel Processing

graphics cards use PCIe 2.0 x16 (second generation, 16 lanes). Generation 1

x16 has a theoretical maximum throughput of 4 GBps in each direction.

Generation 2 x16 doubles the throughput to 8 GBps in each direction. Southern

Islands AMD GPUs support PCIe 3.0 with a theoretical peak performance of

16 GBps. Actual transfer performance is CPU and chipset dependent.

Transfers from the system to the GPU compute device are done either by the

command processor or by the DMA engine. The GPU compute device also can

read and write system memory directly from the compute unit through kernel

instructions over the PCIe bus.

1.6.2 Processing API Calls: The Command Processor

The host application does not interact with the GPU compute device directly. A

driver layer translates and issues commands to the hardware on behalf of the

application.

Most commands to the GPU compute device are buffered in a command queue

on the host side. The command queue is sent to the GPU compute device, and

the commands are processed by it. There is no guarantee as to when commands

from the command queue are executed, only that they are executed in order.

Unless the GPU compute device is busy, commands are executed immediately.

Command queue elements include:

•Kernel execution calls

•Kernels

•Constants

•Transfers between device and host

AMD ACCELERATED PARALLEL PROCESSING

1.7 GPU Compute Device Scheduling 1-15

1.6.3 DMA Transfers

Direct Memory Access (DMA) memory transfers can be executed separately from

the command queue using the DMA engine on the GPU compute device. DMA

calls are executed immediately; and the order of DMA calls and command queue

flushes is guaranteed.

DMA transfers can occur asynchronously. This means that a DMA transfer is

executed concurrently with other system or GPU compute operations when there

are no dependencies. However, data is not guaranteed to be ready until the DMA

engine signals that the event or transfer is completed. The application can query

the hardware for DMA event completion. If used carefully, DMA transfers are

another source of parallelization.

1.6.4 Masking Visible Devices

By default, OpenCL applications are exposed to all GPUs installed in the system;

this allows applications to use multiple GPUs to run the compute task.

In some cases, the user might want to mask the visibility of the GPUs seen by

the OpenCL application. One example is to dedicate one GPU for regular

graphics operations and the other three (in a four-GPU system) for Compute. To

do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a comma-

separated list variable:

•Under Windows: set GPU_DEVICE_ORDINAL=1,2,3

•Under Linux: export GPU_DEVICE_ORDINAL=1,2,3

Another example is a system with eight GPUs, where two distinct OpenCL

applications are running at the same time. The administrator might want to set

GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the

second application; thus, partitioning the available GPUs so that both

applications can run at the same time.

1.7 GPU Compute Device Scheduling

GPU compute devices are very efficient at parallelizing large numbers of work-

items in a manner transparent to the application. Each GPU compute device

uses the large number of wavefronts to hide memory access latencies by having

the resource scheduler switch the active wavefront in a given compute unit

whenever the current wavefront is waiting for a memory access to complete.

Hiding memory access latencies requires that each work-item contain a large

number of ALU operations per memory load/store.

Figure 1.10 shows the timing of a simplified execution of work-items in a single

stream core. At time 0, the work-items are queued and waiting for execution. In

this example, only four work-items (T0…T3) are scheduled for the compute unit.

The hardware limit for the number of active work-items is dependent on the

resource usage (such as the number of active registers used) of the program

AMD ACCELERATED PARALLEL PROCESSING

1-16 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

being executed. An optimally programmed GPU compute device typically has

thousands of active work-items.

Figure 1.10 Simplified Execution Of Work-Items On A Single Stream Core

At runtime, work-item T0 executes until cycle 20; at this time, a stall occurs due

to a memory fetch request. The scheduler then begins execution of the next

work-item, T1. Work-item T1 executes until it stalls or completes. New work-items

execute, and the process continues until the available number of active work-

items is reached. The scheduler then returns to the first work-item, T0.

If the data work-item T0 is waiting for has returned from memory, T0 continues

execution. In the example in Figure 1.10, the data is ready, so T0 continues.

Since there were enough work-items and processing element operations to cover

the long memory latencies, the stream core does not idle. This method of

memory latency hiding helps the GPU compute device achieve maximum

performance.

If none of T0 – T3 are runnable, the stream core waits (stalls) until one of T0 –

T3 is ready to execute. In the example shown in Figure 1.11, T0 is the first to

continue execution.

Work-Item

020406080

STALL

READY

STALL

READY

= executing = ready (not executing) = stalled

AMD ACCELERATED PARALLEL PROCESSING

1.8 Terminology 1-17

Figure 1.11 Stream Core Stall Due to Data Dependency

The causes for this situation are discussed in the following sections.

1.8 Terminology

1.8.1 Compute Kernel

To define a compute kernel, it is first necessary to define a kernel. A kernel is a

small unit of execution that performs a clearly defined function and that can be

executed in parallel. Such a kernel can be executed on each element of an input

stream (called an NDRange), or simply at each point in an arbitrary index space.

A kernel is analogous and, on some devices identical, to what graphics

programmers call a shader program. This kernel is not to be confused with an

OS kernel, which controls hardware. The most basic form of an NDRange is

simply mapped over input data and produces one output item for each input

tuple. Subsequent extensions of the basic model provide random-access

functionality, variable output counts, and reduction/accumulation operations.

Kernels are specified using the kernel keyword.

A compute kernel is a specific type of kernel that is not part of the traditional

graphics pipeline. The compute kernel type can be used for graphics, but its

strength lies in using it for non-graphics fields such as physics, AI, modeling,

HPC, and various other computationally intensive applications.

Work-Item

020406080

STALL

= executing = ready (not executing) = stalled

AMD ACCELERATED PARALLEL PROCESSING

1-18 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

1.8.1.1 Work-Item Spawn Order

In a compute kernel, the work-item spawn order is sequential. This means that

on a chip with N work-items per wavefront, the first N work-items go to wavefront

1, the second N work-items go to wavefront 2, etc. Thus, the work-item IDs for

wavefront K are in the range (K•N) to ((K+1)•N) - 1.

1.8.2 Wavefronts and Work-groups

Wavefronts and work-groups are two concepts relating to compute kernels that

provide data-parallel granularity. A wavefront executes a number of work-items

in lock step relative to each other. Sixteen work-items are execute in parallel

across the vector unit, and the whole wavefront is covered over four clock cycles.

It is the lowest level that flow control can affect. This means that if two work-items

inside of a wavefront go divergent paths of flow control, all work-items in the

wavefront go to both paths of flow control.

Grouping is a higher-level granularity of data parallelism that is enforced in

software, not hardware. Synchronization points in a kernel guarantee that all

work-items in a work-group reach that point (barrier) in the code before the next

statement is executed.

Work-groups are composed of wavefronts. Best performance is attained when

the group size is an integer multiple of the wavefront size.

1.8.3 Local Data Store (LDS)

The LDS is a high-speed, low-latency memory private to each compute unit. It is

a full gather/scatter model: a work-group can write anywhere in its allocated

space. This model is unchanged for the AMD Radeon™ HD 7XXX series. The

constraints of the current LDS model are:

1. The LDS size is allocated per work-group. Each work-group specifies how

much of the LDS it requires. The hardware scheduler uses this information

to determine which work groups can share a compute unit.

2. Data can only be shared within work-items in a work-group.

3. Memory accesses outside of the work-group result in undefined behavior.

1.9 Programming Model

The OpenCL programming model is based on the notion of a host device,

supported by an application API, and a number of devices connected through a

bus. These are programmed using OpenCL C. The host API is divided into

platform and runtime layers. OpenCL C is a C-like language with extensions for

parallel programming such as memory fence operations and barriers. Figure 1.12

illustrates this model with queues of commands, reading/writing data, and

executing kernels for specific devices.

AMD ACCELERATED PARALLEL PROCESSING

1.9 Programming Model 1-19

Figure 1.12 OpenCL Programming Model

The devices are capable of running data- and task-parallel work. A kernel can be

executed as a function of multi-dimensional domains of indices. Each element is

called a work-item; the total number of indices is defined as the global work-size.

The global work-size can be divided into sub-domains, called work-groups, and

individual work-items within a group can communicate through global or locally

shared memory. Work-items are synchronized through barrier or fence

operations. Figure 1.12 is a representation of the host/device architecture with a

single platform, consisting of a GPU and a CPU.

An OpenCL application is built by first querying the runtime to determine which

platforms are present. There can be any number of different OpenCL

implementations installed on a single system. The desired OpenCL platform can

be selected by matching the platform vendor string to the desired vendor name,

such as “Advanced Micro Devices, Inc.” The next step is to create a context. As

shown in Figure 1.12, an OpenCL context has associated with it a number of

compute devices (for example, CPU or GPU devices),. Within a context, OpenCL

guarantees a relaxed consistency between these devices. This means that

memory objects, such as buffers or images, are allocated per context; but

changes made by one device are only guaranteed to be visible by another device

at well-defined synchronization points. For this, OpenCL provides events, with the

ability to synchronize on a given event to enforce the correct order of execution.

Many operations are performed with respect to a given context; there also are

many operations that are specific to a device. For example, program compilation

and kernel execution are done on a per-device basis. Performing work with a

device, such as executing kernels or moving data to and from the device’s local

memory, is done using a corresponding command queue. A command queue is

associated with a single device and a given context; all work for a specific device

is done through this interface. Note that while a single command queue can be

associated with only a single device, there is no limit to the number of command

queues that can point to the same device. For example, it is possible to have

Global/Constant Memory

_kernel foo(...) {

Wi0Wi1Wi3Win

Local Memory

Wi0Wi1Wi3Win

Local Memory

barrier(...)

}} Context

Queue Queue

AMD ACCELERATED PARALLEL PROCESSING

1-20 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

one command queue for executing kernels and a command queue for managing

data transfers between the host and the device.

Most OpenCL programs follow the same pattern. Given a specific platform, select

a device or devices to create a context, allocate memory, create device-specific

command queues, and perform data transfers and computations. Generally, the

platform is the gateway to accessing specific devices, given these devices and

a corresponding context, the application is independent of the platform. Given a

context, the application can:

•Create one or more command queues.

•Create programs to run on one or more associated devices.

•Create kernels within those programs.

•Allocate memory buffers or images, either on the host or on the device(s).

(Memory can be copied between the host and device.)

•Write data to the device.

•Submit the kernel (with appropriate arguments) to the command queue for

execution.

•Read data back to the host from the device.

The relationship between context(s), device(s), buffer(s), program(s), kernel(s),

and command queue(s) is best seen by looking at sample code.

1.10 Example Programs

The following subsections provide simple programming examples with

explanatory comments.

1.10.1 First Example: Simple Buffer Write

This sample shows a minimalist OpenCL C program that sets a given buffer to

some value. It illustrates the basic programming steps with a minimum amount

of code. This sample contains no error checks and the code is not generalized.

Yet, many simple test programs might look very similar. The entire code for this

sample is provided at the end of this section.

1. The host program must select a platform, which is an abstraction for a given

OpenCL implementation. Implementations by multiple vendors can coexist on

a host, and the sample uses the first one available.

2. A device id for a GPU device is requested. A CPU device could be requested

by using CL_DEVICE_TYPE_CPU instead. The device can be a physical device,

such as a given GPU, or an abstracted device, such as the collection of all

CPU cores on the host.

3. On the selected device, an OpenCL context is created. A context ties

together a device, memory buffers related to that device, OpenCL programs,

and command queues. Note that buffers related to a device can reside on

AMD ACCELERATED PARALLEL PROCESSING

1.10 Example Programs 1-21

either the host or the device. Many OpenCL programs have only a single

context, program, and command queue.

4. Before an OpenCL kernel can be launched, its program source is compiled,

and a handle to the kernel is created.

5. A memory buffer is allocated in the context.

6. The kernel is launched. While it is necessary to specify the global work size,

OpenCL determines a good local work size for this device. Since the kernel

was launch asynchronously, clFinish() is used to wait for completion.

7. The data is mapped to the host for examination. Calling

clEnqueueMapBuffer ensures the visibility of the buffer on the host, which in

this case probably includes a physical transfer. Alternatively, we could use

clEnqueueWriteBuffer(), which requires a pre-allocated host-side buffer.

Example Code 1 –

// A minimalist OpenCL program.

#include <CL/cl.h>

#include <stdio.h>

#define NWITEMS 512

// A simple memset kernel

const char *source =

"__kernel void memset( __global uint *dst ) \n"

"{ \n"

" dst[get_global_id(0)] = get_global_id(0); \n"

"} \n";

int main(int argc, char ** argv)

{

// 1. Get a platform.

cl_platform_id platform;

clGetPlatformIDs( 1, &platform, NULL );

// 2. Find a gpu device.

cl_device_id device;

clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU,

&device,

NULL);

AMD ACCELERATED PARALLEL PROCESSING

1-22 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

// 3. Create a context and command queue on that device.

cl_context context = clCreateContext( NULL,

&device,

NULL, NULL, NULL);

cl_command_queue queue = clCreateCommandQueue( context,

device,

0, NULL );

// 4. Perform runtime source compilation, and obtain kernel entry point.

cl_program program = clCreateProgramWithSource( context,

&source,

NULL, NULL );

clBuildProgram( program, 1, &device, NULL, NULL, NULL );

cl_kernel kernel = clCreateKernel( program, "memset", NULL );

// 5. Create a data buffer.

cl_mem buffer = clCreateBuffer( context,

CL_MEM_WRITE_ONLY,

NWITEMS * sizeof(cl_uint),

NULL, NULL );

// 6. Launch the kernel. Let OpenCL pick the local work size.

size_t global_work_size = NWITEMS;

clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);

clEnqueueNDRangeKernel( queue,

kernel,

NULL,

&global_work_size,

NULL, 0, NULL, NULL);

clFinish( queue );

// 7. Look at the results via synchronous buffer map.

cl_uint *ptr;

ptr = (cl_uint *) clEnqueueMapBuffer( queue,

buffer,

CL_TRUE,

CL_MAP_READ,

NWITEMS * sizeof(cl_uint),

0, NULL, NULL, NULL );

int i;

for(i=0; i < NWITEMS; i++)

printf("%d %d\n", i, ptr[i]);

return 0;

}

AMD ACCELERATED PARALLEL PROCESSING

1.10 Example Programs 1-23

1.10.2 Example: Parallel Min() Function

This medium-complexity sample shows how to implement an efficient parallel

min() function.

The code is written so that it performs very well on either CPU or GPU. The

number of threads launched depends on how many hardware processors are

available. Each thread walks the source buffer, using a device-optimal access

pattern selected at runtime. A multi-stage reduction using __local and __global

atomics produces the single result value.

The sample includes a number of programming techniques useful for simple

tests. Only minimal error checking and resource tear-down is used.

Runtime Code –

1. The source memory buffer is allocated, and initialized with a random pattern.

Also, the actual min() value for this data set is serially computed, in order to

later verify the parallel result.

2. The compiler is instructed to dump the intermediate IL and ISA files for

further analysis.

3. The main section of the code, including device setup, CL data buffer creation,

and code compilation, is executed for each device, in this case for CPU and

GPU. Since the source memory buffer exists on the host, it is shared. All

other resources are device-specific.

4. The global work size is computed for each device. A simple heuristic is used

to ensure an optimal number of threads on each device. For the CPU, a

given CL implementation can translate one work-item per CL compute unit

into one thread per CPU core.

On the GPU, an initial multiple of the wavefront size is used, which is

adjusted to ensure even divisibility of the input data over all threads. The

value of 7 is a minimum value to keep all independent hardware units of the

compute units busy, and to provide a minimum amount of memory latency

hiding for a kernel with little ALU activity.

5. After the kernels are built, the code prints errors that occurred during kernel

compilation and linking.

6. The main loop is set up so that the measured timing reflects the actual kernel

performance. If a sufficiently large NLOOPS is chosen, effects from kernel

launch time and delayed buffer copies to the device by the CL runtime are

minimized. Note that while only a single clFinish() is executed at the end

of the timing run, the two kernels are always linked using an event to ensure

serial execution.

The bandwidth is expressed as “number of input bytes processed.” For high-

end graphics cards, the bandwidth of this algorithm is about an order of

magnitude higher than that of the CPU, due to the parallelized memory

subsystem of the graphics card.

AMD ACCELERATED PARALLEL PROCESSING

1-24 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

7. The results then are checked against the comparison value. This also

establishes that the result is the same on both CPU and GPU, which can

serve as the first verification test for newly written kernel code.

8. Note the use of the debug buffer to obtain some runtime variables. Debug

buffers also can be used to create short execution traces for each thread,

assuming the device has enough memory.

9. You can use the Timer.cpp and Timer.h files from the TransferOverlap

sample, which is in the SDK samples.

Kernel Code –

10. The code uses four-component vectors (uint4) so the compiler can identify

concurrent execution paths as often as possible. On the GPU, this can be

used to further optimize memory accesses and distribution across ALUs. On

the CPU, it can be used to enable SSE-like execution.

11. The kernel sets up a memory access pattern based on the device. For the

CPU, the source buffer is chopped into continuous buffers: one per thread.

Each CPU thread serially walks through its buffer portion, which results in

good cache and prefetch behavior for each core.

On the GPU, each thread walks the source buffer using a stride of the total

number of threads. As many threads are executed in parallel, the result is a

maximally coalesced memory pattern requested from the memory back-end.

For example, if each compute unit has 16 physical processors, 16 uint4

requests are produced in parallel, per clock, for a total of 256 bytes per clock.

12. The kernel code uses a reduction consisting of three stages: __global to

__private, __private to __local, which is flushed to __global, and finally

__global to __global. In the first loop, each thread walks __global

memory, and reduces all values into a min value in __private memory

(typically, a register). This is the bulk of the work, and is mainly bound by

__global memory bandwidth. The subsequent reduction stages are brief in

comparison.

13. Next, all per-thread minimum values inside the work-group are reduced to a

__local value, using an atomic operation. Access to the __local value is

serialized; however, the number of these operations is very small compared

to the work of the previous reduction stage. The threads within a work-group

are synchronized through a local barrier(). The reduced min value is

stored in __global memory.

14. After all work-groups are finished, a second kernel reduces all work-group

values into a single value in __global memory, using an atomic operation.

This is a minor contributor to the overall runtime.

AMD ACCELERATED PARALLEL PROCESSING

1.10 Example Programs 1-25

Example Code 3 –

#include <CL/cl.h>

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include "Timer.h"

#define NDEVS 2

// A parallel min() kernel that works well on CPU and GPU

const char *kernel_source =

" \n"

"#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable \n"

"#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable \n"

" \n"

" // 9. The source buffer is accessed as 4-vectors. \n"

" \n"

"__kernel void minp( __global uint4 *src, \n"

" __global uint *gmin, \n"

" __local uint *lmin, \n"

" __global uint *dbg, \n"

" int nitems, \n"

" uint dev ) \n"

"{ \n"

" // 10. Set up __global memory access pattern. \n"

" \n"

" uint count = ( nitems / 4 ) / get_global_size(0); \n"

" uint idx = (dev == 0) ? get_global_id(0) * count \n"

" : get_global_id(0); \n"

" uint stride = (dev == 0) ? 1 : get_global_size(0); \n"

" uint pmin = (uint) -1; \n"

" \n"

" // 11. First, compute private min, for this work-item. \n"

" \n"

" for( int n=0; n < count; n++, idx += stride ) \n"

" { \n"

" pmin = min( pmin, src[idx].x ); \n"

" pmin = min( pmin, src[idx].y ); \n"

" pmin = min( pmin, src[idx].z ); \n"

" pmin = min( pmin, src[idx].w ); \n"

" } \n"

" \n"

" // 12. Reduce min values inside work-group. \n"

" \n"

" if( get_local_id(0) == 0 ) \n"

" lmin[0] = (uint) -1; \n"

" \n"

" barrier( CLK_LOCAL_MEM_FENCE ); \n"

" \n"

" (void) atom_min( lmin, pmin ); \n"

" \n"

" barrier( CLK_LOCAL_MEM_FENCE ); \n"

" \n"

" // Write out to __global. \n"

" \n"

" if( get_local_id(0) == 0 ) \n"

" gmin[ get_group_id(0) ] = lmin[0]; \n"

AMD ACCELERATED PARALLEL PROCESSING

1-26 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

" \n"

" // Dump some debug information. \n"

" \n"

" if( get_global_id(0) == 0 ) \n"

" { \n"

" dbg[0] = get_num_groups(0); \n"

" dbg[1] = get_global_size(0); \n"

" dbg[2] = count; \n"

" dbg[3] = stride; \n"

" } \n"

"} \n"

" \n"

"// 13. Reduce work-group min values from __global to __global. \n"

" \n"

"__kernel void reduce( __global uint4 *src, \n"

" __global uint *gmin ) \n"

"{ \n"

" (void) atom_min( gmin, gmin[get_global_id(0)] ) ; \n"

"} \n";

int main(int argc, char ** argv)

{

cl_platform_id platform;

int dev, nw;

cl_device_type devs[NDEVS] = { CL_DEVICE_TYPE_CPU,

CL_DEVICE_TYPE_GPU };

cl_uint *src_ptr;

unsigned int num_src_items = 4096*4096;

// 1. quick & dirty MWC random init of source buffer.

// Random seed (portable).

time_t ltime;

time(&ltime);

src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) );

cl_uint a = (cl_uint) ltime,

b = (cl_uint) ltime;

cl_uint min = (cl_uint) -1;

// Do serial computation of min() for result verification.

for( int i=0; i < num_src_items; i++ )

{

src_ptr[i] = (cl_uint) (b = ( a * ( b & 65535 )) + ( b >> 16 ));

min = src_ptr[i] < min ? src_ptr[i] : min;

}

// Get a platform.

clGetPlatformIDs( 1, &platform, NULL );

// 3. Iterate over devices.

for(dev=0; dev < NDEVS; dev++)

{

cl_device_id device;

cl_context context;

cl_command_queue queue;

AMD ACCELERATED PARALLEL PROCESSING

1.10 Example Programs 1-27

cl_program program;

cl_kernel minp;

cl_kernel reduce;

cl_mem src_buf;

cl_mem dst_buf;

cl_mem dbg_buf;

cl_uint *dst_ptr,

*dbg_ptr;

printf("\n%s: ", dev == 0 ? "CPU" : "GPU");

// Find the device.

clGetDeviceIDs( platform,

devs[dev],

&device,

NULL);

// 4. Compute work sizes.

cl_uint compute_units;

size_t global_work_size;

size_t local_work_size;

size_t num_groups;

clGetDeviceInfo( device,

CL_DEVICE_MAX_COMPUTE_UNITS,

sizeof(cl_uint),

&compute_units,

NULL);

if( devs[dev] == CL_DEVICE_TYPE_CPU )

{

global_work_size = compute_units * 1; // 1 thread per core

local_work_size = 1;

}

else

{

cl_uint ws = 64;

global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD

while( (num_src_items / 4) % global_work_size != 0 )

global_work_size += ws;

local_work_size = ws;

}

num_groups = global_work_size / local_work_size;

// Create a context and command queue on that device.

context = clCreateContext( NULL,

&device,

NULL, NULL, NULL);

queue = clCreateCommandQueue(context,

device,

0, NULL);

AMD ACCELERATED PARALLEL PROCESSING

1-28 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

// Minimal error check.

if( queue == NULL )

{

printf("Compute device setup failed\n");

return(-1);

}

// Perform runtime source compilation, and obtain kernel entry point.

program = clCreateProgramWithSource( context,

&kernel_source,

NULL, NULL );

//Tell compiler to dump intermediate .il and .isa GPU files.

// 5. Print compiler error messages

if(ret != CL_SUCCESS)

{

printf("clBuildProgram failed: %d\n", ret);

char buf[0x10000];

clGetProgramBuildInfo( program,

device,

CL_PROGRAM_BUILD_LOG,

0x10000,

buf,

NULL);

printf("\n%s\n", buf);

return(-1);

}

minp = clCreateKernel( program, "minp", NULL );

reduce = clCreateKernel( program, "reduce", NULL );

// Create input, output and debug buffers.

src_buf = clCreateBuffer( context,

CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,

num_src_items * sizeof(cl_uint),

src_ptr,

NULL );

dst_buf = clCreateBuffer( context,

CL_MEM_READ_WRITE,

num_groups * sizeof(cl_uint),

NULL, NULL );

dbg_buf = clCreateBuffer( context,

CL_MEM_WRITE_ONLY,

global_work_size * sizeof(cl_uint),

NULL, NULL );

clSetKernelArg(minp, 0, sizeof(void *), (void*) &src_buf);

clSetKernelArg(minp, 1, sizeof(void *), (void*) &dst_buf);

clSetKernelArg(minp, 2, 1*sizeof(cl_uint), (void*) NULL);

clSetKernelArg(minp, 3, sizeof(void *), (void*) &dbg_buf);

clSetKernelArg(minp, 4, sizeof(num_src_items), (void*) &num_src_items);

clSetKernelArg(minp, 5, sizeof(dev), (void*) &dev);

AMD ACCELERATED PARALLEL PROCESSING

1.10 Example Programs 1-29

clSetKernelArg(reduce, 0, sizeof(void *), (void*) &src_buf);

clSetKernelArg(reduce, 1, sizeof(void *), (void*) &dst_buf);

CPerfCounter t;

t.Reset();

t.Start();

// 6. Main timing loop.

#define NLOOPS 500

cl_event ev;

int nloops = NLOOPS;

while(nloops--)

{

clEnqueueNDRangeKernel( queue,

minp,

NULL,

&global_work_size,

&local_work_size,

0, NULL, &ev);

clEnqueueNDRangeKernel( queue,

reduce,

NULL,

&num_groups,

NULL, 1, &ev, NULL);

}

clFinish( queue );

t.Stop();

printf("B/W %.2f GB/sec, ", ((float) num_src_items *

sizeof(cl_uint) * NLOOPS) /

t.GetElapsedTime() / 1e9 );

// 7. Look at the results via synchronous buffer map.

dst_ptr = (cl_uint *) clEnqueueMapBuffer( queue,

dst_buf,

CL_TRUE,

CL_MAP_READ,

num_groups * sizeof(cl_uint),

0, NULL, NULL, NULL );

dbg_ptr = (cl_uint *) clEnqueueMapBuffer( queue,

dbg_buf,

CL_TRUE,

CL_MAP_READ,

global_work_size *

sizeof(cl_uint),

0, NULL, NULL, NULL );

// 8. Print some debug info.

printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0],

dbg_ptr[1],

dbg_ptr[2],

AMD ACCELERATED PARALLEL PROCESSING

1-30 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing

dbg_ptr[3] );

if( dst_ptr[0] == min )

printf("result correct\n");

else

printf("result INcorrect\n");

}

printf("\n");

return 0;

}

AMD ACCELERATED PARALLEL PROCESSING

AMD Accelerated Parallel Processing - OpenCL Programming Guide 2-1

Chapter 2

Building and Running OpenCL

Programs

The compiler tool-chain provides a common framework for both CPUs and

GPUs, sharing the front-end and some high-level compiler transformations. The

back-ends are optimized for the device type (CPU or GPU). Figure 2.1 is a high-

level diagram showing the general compilation path of applications using

OpenCL. Functions of an application that benefit from acceleration are re-written

in OpenCL and become the OpenCL source. The code calling these functions

are changed to use the OpenCL API. The rest of the application remains

unchanged. The kernels are compiled by the OpenCL compiler to either CPU

binaries or GPU binaries, depending on the target device.

Figure 2.1 OpenCL Compiler Toolchain

For CPU processing, the OpenCL runtime uses the LLVM AS to generate x86

binaries. The OpenCL runtime automatically determines the number of

processing elements, or cores, present in the CPU and distributes the OpenCL

kernel between them.

For GPU processing, the OpenCL runtime post-processes the incomplete AMD

IL from the OpenCL compiler and turns it into complete AMD IL. This adds

macros (from a macro database, similar to the built-in library) specific to the

Front-End

Linker

Built-In

Library

LLVM

Optimizer

LLVM IR

LLVM AS AMD IL

CPU GPU

OpenCL Compiler

OpenCL

Source

LLVM IR

AMD ACCELERATED PARALLEL PROCESSING

2-2 Chapter 2: Building and Running OpenCL Programs

GPU. The OpenCL Runtime layer then removes unneeded functions and passes

the complete IL to the CAL compiler for compilation to GPU-specific binaries.

2.1 Compiling the Program

An OpenCL application consists of a host program (C/C++) and an optional

kernel program (.cl). To compile an OpenCL application, the host program must

be compiled; this can be done using an off-the-shelf compiler such as g++ or

MSVC++. The application kernels are compiled into device-specific binaries

using the OpenCL compiler.

This compiler uses a standard C front-end, as well as the low-level virtual

machine (LLVM) framework, with extensions for OpenCL. The compiler starts

with the OpenCL source that the user program passes through the OpenCL

runtime interface (Figure 2.1). The front-end translates the OpenCL source to

LLVM IR. It keeps OpenCL-specific information as metadata structures. (For

example, to debug kernels, the front end creates metadata structures to hold the

debug information; also, a pass is inserted to translate this into LLVM debug

nodes, which includes the line numbers and source code mapping.) The front-

end supports additional data-types (int4, float8, etc.), additional keywords (kernel,

global, etc.) and built-in functions (get_global_id(), barrier(), etc.). Also, it

performs additional syntactic and semantic checks to ensure the kernels meet

the OpenCL specification. The input to the LLVM linker is the output of the front-

end and the library of built-in functions. This links in the built-in OpenCL functions

required by the source and transfers the data to the optimizer, which outputs

optimized LLVM IR.

For GPU processing, the LLVM IR-to-CAL IL module receives LLVM IR and

generates optimized IL for a specific GPU type in an incomplete format, which is

passed to the OpenCL runtime, along with some metadata for the runtime layer

to finish processing.

For CPU processing, LLVM AS generates x86 binary.

2.1.1 Compiling on Windows

To compile OpenCL applications on Windows requires that Visual Studio 2008

Professional Edition (or later) or the Intel C (C++) compiler are installed. All C++

files must be added to the project, which must have the following settings.

•Project Properties → C/C++ → Additional Include Directories

These must include $(ATISTREAMSDKROOT)/include for OpenCL headers.

Optionally, they can include $(ATISTREAMSDKSAMPLESROOT)/include for

SDKUtil headers.

•Project Properties → C/C++ → Preprocessor Definitions

These must define ATI_OS_WIN.

AMD ACCELERATED PARALLEL PROCESSING

2.1 Compiling the Program 2-3

•Project Properties → Linker → Additional Library Directories

These must include $(ATISTREAMSDKROOT)/lib/x86 for OpenCL libraries.

Optionally, they can include $(ATISTREAMSDKSAMPLESROOT)/lib/x86 for

SDKUtil libraries.

•Project Properties → Linker → Input → Additional Dependencies

These must include OpenCL.lib. Optionally, they can include SDKUtil.lib.

2.1.2 Compiling on Linux

To compile OpenCL applications on Linux requires that the gcc or the Intel C

compiler is installed. There are two major steps to do this: compiling and linking.

1. Compile all the C++ files (Template.cpp), and get the object files.

For 32-bit object files on a 32-bit system, or 64-bit object files on 64-bit

system:

g++ -o Template.o -DATI_OS_LINUX -c Template.cpp -I$ATISTREAMSDKROOT/include

For building 32-bit object files on a 64-bit system:

g++ -o Template.o -DATI_OS_LINUX -c Template.cpp -I$ATISTREAMSDKROOT/include

2. Link all the object files generated in the previous step to the OpenCL library

and create an executable.

For linking to a 64-bit library:

g++ -o Template Template.o -lOpenCL -L$ATISTREAMSDKROOT/lib/x86_64

For linking to a 32-bit library:

g++ -o Template Template.o -lOpenCL -L$ATISTREAMSDKROOT/lib/x86

The OpenCL samples in the SDK provided by AMD Accelerated Parallel

Processing depend on the SDKUtil library. In Linux, the samples use the shipped

SDKUtil.lib, whether or not the sample is built for release or debug. When

compiling all samples from the samples/opencl folder, the SDKUtil.lib is

created first; then, the samples use this generated library. When compiling the

SDKUtil library, the created library replaces the shipped library.

The following are linking options if the samples depend on the SDKUtil Library

(assuming the SDKUtil library is created in $ATISTREAMSDKROOT/lib/x86_64 for

64-bit libraries, or $ATISTREAMSDKROOT/lib/x86 for 32-bit libraries).

g++ -o Template Template.o -lSDKUtil -lOpenCL -L$ATISTREAMSDKROOT/lib/x86_64

g++ -o Template Template.o -lSDKUtil -lOpenCL -L$ATISTREAMSDKROOT/lib/x86

AMD ACCELERATED PARALLEL PROCESSING

2-4 Chapter 2: Building and Running OpenCL Programs

2.1.3 Supported Standard OpenCL Compiler Options

The currently supported options are:

•-I

dir

— Add the directory dir to the list of directories to be searched for

header files. When parsing #include directives, the OpenCL compiler

resolves relative paths using the current working directory of the application.

•-D

name

— Predefine name as a macro, with definition = 1. For -

name=definition

, the contents of definition are tokenized and processed

as if they appeared during the translation phase three in a #define directive.

In particular, the definition is truncated by embedded newline characters.

-D options are processed in the order they are given in the options argument

to clBuildProgram.

2.1.4 AMD-Developed Supplemental Compiler Options

The following supported options are not part of the OpenCL specification:

•-g — This is an experimental feature that lets you use the GNU project

debugger, GDB, to debug kernels on x86 CPUs running Linux or

cygwin/minGW under Windows. For more details, see Chapter 3, “Debugging

OpenCL.” This option does not affect the default optimization of the OpenCL

code.

•-O0 — Specifies to the compiler not to optimize. This is equivalent to the

OpenCL standard option -cl-opt-disable.

•-f[no-]bin-source — Does [not] generate OpenCL source in the .source

section. For more information, see Appendix E, “OpenCL Binary Image

Format (BIF) v2.0.”

•-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.

For more information, see Appendix E, “OpenCL Binary Image Format (BIF)

v2.0.”

•-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.

For more information, see Appendix E, “OpenCL Binary Image Format (BIF)

v2.0.”

•-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text

section. For more information, see Appendix E, “OpenCL Binary Image

Format (BIF) v2.0.”

•-save-temps[=<

prefix

>] — This option dumps intermediate temporary

files, such as IL and ISA code, for each OpenCL kernel. If <

prefix

> is not

given, temporary files are saved in the default temporary directory (the

current directory for Linux, C:\Users\<user>\AppData\Local for Windows).

If <

prefix

> is given, those temporary files are saved with the given

prefix

>. If <

prefix

> is an absolute path prefix, such as

C:\your\work\dir\mydumpprefix, those temporaries are saved under

C:\your\work\dir, with mydumpprefix as prefix to all temporary names. For

example,

AMD ACCELERATED PARALLEL PROCESSING

2.2 Running the Program 2-5

-save-temps

under the default directory

_temp_nn_xxx_yyy.il, _temp_nn_xxx_yyy.isa

-save-temps=aaa

under the default directory

aaa_nn_xxx_yyy.il, aaa_nn_xxx_yyy.isa

-save-temps=C:\you\dir\bbb

under C:\you\dir

bbb_nn_xxx_yyy.il, bbb_nn_xxx_yyy.isa

where xxx and yyy are the device name and kernel name for this build,

respectively, and nn is an internal number to identify a build to avoid

overriding temporary files. Note that this naming convention is subject to

change.

To avoid source changes, there are two environment variables that can be used

to change CL options during the runtime.

•AMD_OCL_BUILD_OPTIONS — Overrides the CL options specified in

clBuildProgram().

•AMD_OCL_BUILD_OPTIONS_APPEND — Appends options to those specified in

clBuildProgram().

2.2 Running the Program

The runtime system assigns the work in the command queues to the underlying

devices. Commands are placed into the queue using the clEnqueue commands

shown in the listing below.

OpenCL API Function Description

clCreateCommandQueue() Create a command queue for a specific device (CPU,

GPU).

clCreateProgramWithSource()

clCreateProgramWithBinary()

Create a program object using the source code of the

application kernels.

clBuildProgram() Compile and link to create a program executable from

the program source or binary.

clCreateKernel() Creates a kernel object from the program object.

clCreateBuffer() Creates a buffer object for use via OpenCL kernels.

clSetKernelArg()

clEnqueueNDRangeKernel()

Set the kernel arguments, and enqueue the kernel in a

command queue.

clEnqueueReadBuffer(),

clEnqueueWriteBuffer()

Enqueue a command in a command queue to read from

a buffer object to host memory, or write to the buffer

object from host memory.

clEnqueueWaitForEvents() Wait for the specified events to complete.

AMD ACCELERATED PARALLEL PROCESSING

2-6 Chapter 2: Building and Running OpenCL Programs

The commands can be broadly classified into three categories.

•Kernel commands (for example, clEnqueueNDRangeKernel(), etc.),

•Memory commands (for example, clEnqueueReadBuffer(), etc.), and

•Event commands (for example, clEnqueueWaitForEvents(), etc.

As illustrated in Figure 2.2, the application can create multiple command queues

(some in libraries, for different components of the application, etc.). These

queues are muxed into one queue per device type. The figure shows command

queues 1 and 3 merged into one CPU device queue (blue arrows); command

queue 2 (and possibly others) are merged into the GPU device queue (red

arrow). The device queue then schedules work onto the multiple compute

resources present in the device. Here, K = kernel commands, M = memory

commands, and E = event commands.

2.2.1 Running Code on Windows

The following steps ensure the execution of OpenCL applications on Windows.

1. The path to OpenCL.lib ($ATISTREAMSDKROOT/lib/x86) must be included in

path environment variable.

2. Generally, the path to the kernel file (Template_Kernel.cl) specified in the

host program is relative to the executable. Unless an absolute path is

specified, the kernel file must be in the same directory as the executable.

Figure 2.2 Runtime Processing Structure

Scheduler

GPU Core 1 GPU Core 2CPU Core 1

K111

CPU Core 2

K112

Programming

Layer

Command

Queues

For CPU queue For CPU queue For GPU queue

GPU

CPU

123

Device

Command

Queue

K1E1K2K3

E11

K11 K12 K32

M1M3

M11 M12 M31 M32

AMD ACCELERATED PARALLEL PROCESSING

2.3 Calling Conventions 2-7

2.2.2 Running Code on Linux

The following steps ensure the execution of OpenCL applications on Linux.

1. The path to libOpenCL.so ($ATISTREAMSDKROOT/lib/x86) must be included

in $LD_LIBRARY_PATH.

2. /usr/lib/OpenCL/vendors/ must have libatiocl32.so and/or

libatiocl64.so.

3. Generally, the path to the kernel file (Template_Kernel.cl) specified in the

host program is relative to the executable. Unless an absolute path is

specified, the kernel file must be in the same directory as the executable.

2.3 Calling Conventions

For all Windows platforms, the __stdcall calling convention is used. Function

names are undecorated.

For Linux, the calling convention is __cdecl.

AMD ACCELERATED PARALLEL PROCESSING

2-8 Chapter 2: Building and Running OpenCL Programs

AMD ACCELERATED PARALLEL PROCESSING

AMD Accelerated Parallel Processing - OpenCL Programming Guide 3-1

Chapter 3

Debugging OpenCL

This chapter discusses how to debug OpenCL programs running on AMD

Accelerated Parallel Processing GPU and CPU compute devices. The first,

preferred, method is to debug with the AMD gDEBugger, as described in

Section 3.1, “AMD gDEBugger.” The second method, described in Section 3.2,

“Debugging CPU Kernels with GDB,” is to use experimental features provided by

AMD Accelerated Parallel Processing (GNU project debugger, GDB) to debug

kernels on x86 CPUs running Linux or cygwin/minGW under Windows.

3.1 AMD gDEBugger

gDEBugger 6.2 is available as an extension to Microsoft® Visual Studio®, a

stand-alone version for Windows and a stand alone version for Linux.

gDEBugger offers real-time OpenCL kernel debugging and memory analysis on

GPU devices, allowing developers to access the kernel execution directly from

the API call that issues it, debug inside the kernel, and view all variable values

across the different work-groups and work-items. For Microsoft® Visual Studio®,

it also provides OpenGL debugging and memory analysis. For information on

downloading and installing gDEBugger, see:

http://developer.amd.com/tools/gDEBugger/Pages/default.aspx

After installing gDEBugger for Visual Studio, launch Visual Studio, and open the

solution to be worked on. In the Visual Studio menu bar, note the new

gDEBugger menu, which contains all the required controls.

Select a Visual C/C++ project, and set its debugging properties as normal. To

add a breakpoint, either select New gDEBugger Breakpoint from the gDEBugger

menu, or navigate to a kernel file used in the application and set a breakpoint on

the desired source line. Then, select the Launch OpenCL/OpenGL Debugging

from the gDEBugger menu to start debugging.

gDEBugger currently supports only API-level debugging and OpenCL kernel

debugging; stepping through C/C++ code is not yet possible. However, the

C/C++ call stack can be seen in the Visual Studio call stack view, which shows

what led to the API function call.

To start kernel debugging, you can choose one of several options; one of these

is to Step Into (F11) the appropriate clEnqueueNDRangeKernel function call.

Once the kernel starts executing, debug it like C/C++ code, stepping into, out of,

or over function calls in the kernel, setting source breakpoints, and inspecting the

locals, autos, watch, and call stack views.

AMD ACCELERATED PARALLEL PROCESSING

3-2 Chapter 3: Debugging OpenCL

To view OpenCL and OpenGL objects and their information, use the gDEBugger

Explorer and gDEBugger Properties view. Additional views and features provide

more detailed and advanced information on the OpenCL and OpenGL runtimes,

their states, and the objects created within them.

For further information and more detailed usage instructions, see the gDEBugger

User Guide:

http://developer.amd.com/tools/gDEBugger/webhelp/index.html

or the online help provided with gDEBugger.

3.2 Debugging CPU Kernels with GDB

This section describes an experimental feature for using the GNU project

debugger, GDB, to debug kernels on x86 CPUs running Linux or cygwin/minGW

under Windows.

3.2.1 Setting the Environment

The OpenCL program to be debugged first is compiled by passing the “-g -O0”

(or “-g -cl-opt-disable”) option to the compiler through the options string to

clBuildProgram. For example, using the C++ API:

err = program.build(devices,"-g -O0");

To avoid source changes, set the environment variable as follows:

AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0" or

AMD_OCL_BUILD_OPTIONS="-g -O0"

Below is a sample debugging session of a program with a simple hello world

kernel. The following GDB session shows how to debug this kernel. Ensure that

the program is configured to be executed on the CPU. It is important to set

CPU_MAX_COMPUTE_UNITS=1. This ensures that the program is executed

deterministically.

3.2.2 Setting the Breakpoint in an OpenCL Kernel

To set a breakpoint, use:

b [

function

kernel_name

]

where

is the line number in the source code,

function

is the function name,

and

kernel_name

is constructed as follows: if the name of the kernel is

bitonicSort, the

kernel_name

is __OpenCL_bitonicSort_kernel.

Note that if no breakpoint is set, the program does not stop until execution is

complete.

Also note that OpenCL kernel symbols are not visible in the debugger until the

kernel is loaded. A simple way to check for known OpenCL symbols is to set a

AMD ACCELERATED PARALLEL PROCESSING

3.2 Debugging CPU Kernels with GDB 3-3

breakpoint in the host code at clEnqueueNDRangeKernel, and to use the GDB

info functions __OpenCL command, as shown in the example below.

3.2.3 Sample GDB Session

The following is a sample debugging session. Note that two separate breakpoints

are set. The first is set in the host code, at clEnqueueNDRangeKernel(). The

second breakpoint is set at the actual CL kernel function.

$ export AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0"

$ export CPU_MAX_COMPUTE_UNITS=1

$ gdb BitonicSort

GNU gdb (GDB) 7.1-ubuntu

License GPLv3+: GNU GPL version 3 or later

<http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>...

Reading symbols from /home/himanshu/Desktop/ati-stream-sdk-v2.3-

lnx64/samples/opencl/bin/x86_64/BitonicSort...done.

(gdb) b clEnqueueNDRangeKernel

Breakpoint 1 at 0x403228

(gdb) r --device cpu

Starting program: /home/himanshu/Desktop/ati-stream-sdk-v2.3-

lnx64/samples/opencl/bin/x86_64/BitonicSort --device cpu

[Thread debugging using libthread_db enabled]

Unsorted Input

53 5 199 15 120 9 71 107 71 242 84 150 134 180 26 128 196 9 98 4 102 65

206 35 224 2 52 172 160 94 2 214 99 .....

Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : AMD Athlon(tm) II X4 630 Processor

[New Thread 0x7ffff7e6b700 (LWP 1894)]

[New Thread 0x7ffff2fcc700 (LWP 1895)]

Executing kernel for 1 iterations

-------------------------------------------

Breakpoint 1, 0x00007ffff77b9b20 in clEnqueueNDRangeKernel () from

/home/himanshu/Desktop/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libOpenCL.so

(gdb) info functions __OpenCL

All functions matching regular expression "__OpenCL":

File OCLm2oVFr.cl:

void __OpenCL_bitonicSort_kernel(uint *, const uint, const uint, const

uint, const uint);

Non-debugging symbols:

0x00007ffff23c2dc0 __OpenCL_bitonicSort_kernel@plt

0x00007ffff23c2f40 __OpenCL_bitonicSort_stub

(gdb) b __OpenCL_bitonicSort_kernel

Breakpoint 2 at 0x7ffff23c2de9: file OCLm2oVFr.cl, line 32.

(gdb) c

Continuing.

[Switching to Thread 0x7ffff2fcc700 (LWP 1895)]

Breakpoint 2, __OpenCL_bitonicSort_kernel (theArray=0x615ba0, stage=0,

passOfStage=0, width=1024, direction=0) at OCLm2oVFr.cl:32

32 uint sortIncreasing = direction;

(gdb) p get_global_id(0)

$1 = 0

(gdb) c

AMD ACCELERATED PARALLEL PROCESSING

3-4 Chapter 3: Debugging OpenCL

Continuing.

Breakpoint 2, __OpenCL_bitonicSort_kernel (theArray=0x615ba0, stage=0,

passOfStage=0, width=1024, direction=0) at OCLm2oVFr.cl:32

32 uint sortIncreasing = direction;

(gdb) p get_global_id(0)

$2 = 1

(gdb)

3.2.4 Notes

1. To make a breakpoint in a working thread with some particular ID in

dimension N, one technique is to set a conditional breakpoint when the

get_global_id(N) == ID. To do this, use:

b [ N | function | kernel_name ] if (get_global_id(N)==ID)

where N can be 0, 1, or 2.

2. For complete GDB documentation, see

http://www.gnu.org/software/gdb/documentation/ .

3. For debugging OpenCL kernels in Windows, a developer can use GDB

running in cygwin or minGW. It is done in the same way as described in

sections 3.1 and 3.2.

Notes:

– Only OpenCL kernels are visible to GDB when running cygwin or

minGW. GDB under cygwin/minGW currently does not support host code

debugging.

– It is not possible to use two debuggers attached to the same process.

Do not try to attach Visual Studio to a process, and concurrently GDB to

the kernels of that process.

– Continue to develop the application code using Visual Studio. Currently,

gcc running in cygwin or minGW is not supported.

AMD ACCELERATED PARALLEL PROCESSING

AMD Accelerated Parallel Processing - OpenCL Programming Guide 4-1

Chapter 4

OpenCL Performance and

Optimization

This chapter discusses performance and optimization when programming for

AMD Accelerated Parallel Processing (APP) GPU compute devices, as well as

CPUs and multiple devices. Details specific to the Southern Islands series of

GPUs is at the end of the chapter.

4.1 AMD APP Profiler

The AMD APP Profiler (hereafter Profiler) is a performance analysis tool that

gathers data from the OpenCL run-time and AMD Radeon™ GPUs during the

execution of an OpenCL application. This information is used to discover

bottlenecks in the application and find ways to optimize the application’s

performance for AMD platforms. The Profiler can be installed as part of the AMD

APP SDK installation, or separately using its own installer package. It is

downloadable from:

http://developer.amd.com/tools/AMDAPPProfiler/Pages/default.aspx.

This section describes the major features of Profiler version 2.4. Because the

Profiler is still being developed, please see the documentation for the latest

features of the tool at the same URL provided above.

The Profiler supports two usage models.

•Plug-in for Microsoft Visual Studio 2008 or 2010 (recommended). This lets

you visualize and analyze the results in multiple ways.

•Command-line utility tool for both Windows and Linux platforms. This is a

way to collect data for applications without source code access. The results

can be analyzed directly in the text format or visualized in the Visual Studio

plug-in.

The Profiler supports two modes of operations.

•Collecting OpenCL application traces.

•Collecting OpenCL kernel GPU performance counters.

These are described in the following subsections.

AMD ACCELERATED PARALLEL PROCESSING

4-2 Chapter 4: OpenCL Performance and Optimization

4.1.1 Collecting OpenCL Application Trace

The OpenCL application trace lists all the OpenCL API calls made by the

application. For each of the API calls, the input parameters and output results

are recorded, in addition to the CPU timestamps for the host code and device

timestamps retrieved from the OpenCL run-time. The output data is recorded in

an AMD custom application trace profile (*.atp) file format. See the Profiler

documentation for more information.

This mode is especially useful for investigating the high-level structure of a

complex application.

From the OpenCL application trace data, it is possible to:

•Reveal the high-level structure of the application with the Timeline view. This

lets you investigate the number of OpenCL contexts and command queues

created, as well as the relationships of these items in the application. The

timeline shows the host code, kernel, and data transfer execution. See

Section 4.1.1.1, “Timeline View,” page 4-2.

•Identify whether the application is bound by kernel execution or data transfer

time; find the top ten most expensive kernels and data transfers; find the API

hot spots (most frequently called or expensive API call) in the application with

the Summary Pages view. See Section 4.1.1.2, “Summary Pages View,”

page 4-4).

•View and debug the input parameters and output results for all API calls in

the application with the API Trace view. See Section 4.1.1.3, “API Trace

View,” page 4-5.

•An OpenCL Performance Marker (CLPerfMarker) library is also provided for

visualizing and analyzing non-OpenCL host code on the Timeline. Users can

instrument their code with calls to clBeginPerfMarkerAMD() and

clEndPerfMarkerAMD(). These calls are then used by the Profiler to

annotate the host-code timeline hierarchically. For more information, see the

CLPerfMarkerAMD.pdf in the CLPerfMarker/Doc subdirectory under the

Profiler installation directory, typically

$AMDAPPSDKROOT/Tools/AMDAPPProfiler-vx.x/.

4.1.1.1 Timeline View

The timeline view (the top half of Figure 4.1) provides a visual representation of

the execution of the application.

AMD ACCELERATED PARALLEL PROCESSING

4.1 AMD APP Profiler 4-3

Figure 4.1 Timeline and API Trace View in Microsoft Visual Studio 2010

Along the top of the timeline is the time grid, which shows the total elapsed time

of the application, in milliseconds. Timing begins when the first OpenCL call is

made by the application, and ends when the final OpenCL is made. Directly

below the time grid, each host (OS) thread that made at least one OpenCL call

is listed. For each host thread, the OpenCL API calls are plotted along the time

grid, showing the start time and duration of each call. Below the host threads, an

OpenCL tree shows all contexts and queues created by the application, along

with data transfer operations and kernel execution operations for each queue.

The Timeline View can be navigated by zooming, panning, collapsing/expanding,

and selecting an interest region. From the Timeline View, we can also navigate

to the corresponding API call in the API Trace View and vice versa.

The Timeline View can be useful for debugging your OpenCL application. Some

examples are:

•Easily confirm that the high-level structure of the algorithm is correct (the

number of queues and contexts created match your expectation).

•Confirm that synchronization has been performed properly in the application.

For example, if kernel A execution is dependent on a buffer write or copy

and/or outputs from kernel B execution, then, if the synchronization has been

set up correctly, kernel A execution appears after the completion of the buffer

execution and kernel B execution in the timeline grid. It can be hard to find

this type of synchronization error using traditional debugging techniques.

AMD ACCELERATED PARALLEL PROCESSING

4-4 Chapter 4: OpenCL Performance and Optimization

•Confirm that the kernel and data transfer execution from all the queues have

been performed efficiently. This is easily verified by ensuring that non-

dependent kernel and data transfer execution happens concurrently in the

timeline grid.

4.1.1.2 Summary Pages View

The Summary Pages View (Figure 4.2) shows the statistics of your OpenCL

application. It can provide a general idea of the location of the program's

bottlenecks. It also provides useful information such as the number of buffers and

number of images created on each context, most expensive kernel call, etc.

Figure 4.2 Context Summary Page View in Microsoft Visual Studio 2010

The Summary Pages View consists of the following pages.

1. API Summary shows the useful statistics for all OpenCL API calls made in

the application for API hot spots identification.

2. Context Summary shows the timing information for all the kernel dispatches

and data transfers for each context. This permits identifying whether the

application is bound by the kernel execution or data transfer. If the

application is bound by the data transfers, this page permits finding the most

expensive data transfer type (read, write, copy, or map) in the application.

3. Kernel Summary lists all the kernels that are created in the application. If the

application is bound by the kernel execution, it is possible to find the device

causing the bottleneck. If the kernel execution on the GPU device is the

bottleneck, use the GPU performance counters (see Section 4.1.2,

“Collecting OpenCL GPU Kernel Performance Counters,” page 4-5) to

investigate the bottleneck inside the kernel.

4. Top 10 Data Transfer Summary shows the top ten most expensive individual

data transfers.

5. Top 10 Kernel Summary shows the top ten most expensive individual kernel

executions.

6. Warning(s) and Error(s) shows potential problems in your OpenCL

application.

AMD ACCELERATED PARALLEL PROCESSING

4.1 AMD APP Profiler 4-5

In order to minimize expensive data transfers, the algorithm/application may have

be modified. With the help from the timeline view, we can investigate whether the

data transfer execution has been most efficient (occurs concurrently with a kernel

execution).

The Warning(s) and Error(s) page (Figure 4.3) shows potential problems and

optimization hints in your OpenCL application, including unreleased OpenCL

resources, OpenCL API failures, non-optimized work size, non-optimized data

transfer, and excessive synchronization; it also provides suggestions to achieve

better performance. Clicking on a hyperlink takes you to the corresponding

OpenCL API that generated the message.

Figure 4.3 Warning(s) and Error(s) Page

4.1.1.3 API Trace View

The API Trace View (the bottom half in Figure 4.1) lists all the OpenCL API calls

made by the application. Each host thread that makes at least one OpenCL call

is listed in a separate tab. Each tab contains a list of all the API calls made by

that particular thread. For each call, it shows the index of the call (representing

execution order), the name of the API function, a semicolon-delimited list of

parameters passed to the function, and the value returned by the function.

Double-clicking an item in the API Trace view displays and zooms into that API

call in the Host Thread row in the Timeline View. If stack trace is enabled while

collecting the API trace, and the application contains debug information, it is

possible to navigate from the API trace to source code.

The lets you analyze and debug the input parameters and output results for each

API call. For example it is easy to check that all the API calls are returning

CL_SUCCESS, all the buffers are created with the correct flags, as well as to

identify redundant API calls. The API Trace shows additional information about

data transfers using clEnqueueMapBuffer/clEnqueueMapImage; this includes the

source, destination, and transfer type of the map operation. Some devices can

take advantage of zero copy to save on data transfer time.

4.1.2 Collecting OpenCL GPU Kernel Performance Counters

The GPU kernel performance counters can be used to find the possible

bottlenecks in the kernel execution. You can find the list of performance counters

supported by AMD Radeon™ GPUs in the Profiler documentation.

After determining the most expensive kernel to be optimized using the trace data,

collect the GPU performance counters to drill down to the kernel execution on

the GPU devices. Using the performance counters, it is possible to:

AMD ACCELERATED PARALLEL PROCESSING

4-6 Chapter 4: OpenCL Performance and Optimization

•Find the number of resources (VGPR, SGPR [if applicable], and Local

Memory size) allocated for the kernel. These resources affect the possible

number of in-flight wavefronts in the GPU (higher number is required to hide

the data latency). The Occupancy Modeler identifies the limiting factor for

achieving a higher count of in-flight wavefronts.

•Identify the number of ALU, global, and local memory instructions executed

in the GPU.

•Identify the number of bytes fetched from and written to the global memory.

•View use of the SIMD engine and memory units in the system.

•View the efficiency of the Shader Compiler to pack ALU instructions into the

VLIW instructions in AMD GPUs.

•View the local memory (LDS) bank conflict.

The Session view (Figure 4.4) shows the resulting performance counters for a

Profiler session. The output data is recorded in a csv format.

Figure 4.4 Sample Session View in Microsoft Visual Studio 2010

4.1.3 OpenCL Kernel Occupancy Modeler

Figure 4.5 shows a sample screen shot of the OpenCL Kernel Occupancy

Modeler.

The top part of the page shows four graphs (only three on non-GCN devices)

that provide a visual indication of how kernel resources affect the theoretical

number of in-flight wavefronts on a compute unit. The graph representing the

limiting resource has its title displayed in red text. If there is more than one

limiting resource, more than one graph can have a red title. In each graph, the

actual usage of the particular resource being graphed is highlighted with an

orange square. Hovering the mouse over a point in the graph causes a popup

hint to be displayed; this shows the current X and Y values at that location.

AMD ACCELERATED PARALLEL PROCESSING

4.1 AMD APP Profiler 4-7

Figure 4.5 Sample Kernel Occupancy Modeler Screen

The first graph, titled Number of waves limited by Work-group size, shows

how the number of active wavefronts is affected by the size of the work-group

for the dispatched kernel. In the above screen shot, note that the highest number

of wavefronts is achieved when the work-group size is in the range of 64 to 128.

The second graph, titled Number of waves limited by VGPRs, shows how the

number of active wavefronts is affected by the number of vector GPRs used by

the dispatched kernel. In the above screen shot, note that as the number of

VGPRs used increases, the number active wavefronts decreases in steps. Note

that this graph shows that more than 62 GPRs can be allocated, although 62 is

the maximum number, since the Shader Compiler assumes the work-group size

is 256 items by default (the largest possible work-group size). For the Shader

Compiler to allocate more than 62 GPRs, the kernel source code must be

marked with the reqd_work_group_size kernel attribute. This attribute can tell

the Shader Compiler that the kernel is to be launched with a work-group size

smaller than the maximum, allowing it to allocate more GPRs. Thus, for X-axis

values greater than 62, the GPR graph shows the theoretical number of

wavefronts that can be launched if the kernel specified a smaller work-group size

using that attribute.

If running on and AMD Radeon™ HD 7XXX series (GCN) device, the third graph,

titled Number of waves limited by SGPR, shows how the number of active

wavefronts is affected by the number of scalar GPRs used by the dispatched

kernel. In the above screen shot, note that as the number of used SGPRs

increases, the number active wavefronts decreases in steps.

AMD ACCELERATED PARALLEL PROCESSING

4-8 Chapter 4: OpenCL Performance and Optimization

The fourth graph, titled Number of waves limited by LDS, shows how the

number of active wavefronts is affected by the amount of LDS used by the

dispatched kernel. In the above screen shot, note that as the amount of LDS

usage increases, the number active wavefronts decreases in steps.

Below the graphs is a table that provides information about the device, the

kernel, and the kernel occupancy. In the Kernel Occupancy section, note the

limits imposed by each kernel resource, as well as which resource is currently

limiting the number of waves for the kernel dispatch. This also displays the kernel

occupancy percentage.

4.2 AMD APP KernelAnalyzer

The AMD APP KernelAnalyzer is a static analysis tool to compile, analyze, and

disassemble an OpenCL kernel for AMD Radeon GPUs. It is commonly used for

rapid prototyping of OpenCL kernels since it quickly compiles the kernel for

multiple GPU device targets.

It can be used as a GUI tool for interactive tunings of an OpenCL kernel or in

command-line mode to generate detailed reports.

The KernelAnalyzer can be installed as part of the AMD APP SDK installation,

or individually using its own installer package. The KernelAnalyzer package can

be downloaded from:

http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx.

To use the KernelAnalyzer, the AMD OpenCL run-time is required to be installed

in the system; however, no GPU is required in the system.

To compile an OpenCL kernel in the KernelAnalyzer, drop the OpenCL source

kernel to the source code panel in the GUI (Figure 4.6). The entire OpenCL

application is not required to compile or analyze the OpenCL kernel.

With the KernelAnalyzer, it is possible to:

•Compile and disassemble the OpenCL kernel for multiple Catalyst driver

versions and GPU device targets.

•View the OpenCL kernel compilation’s error messages from the OpenCL run-

time.

•View the AMD Intermediate Language (IL) code generated by the OpenCL

run-time.

•View the ISA code generated by the AMD Shader Compiler. Typically, hard

core kernel optimizations are performed by analyzing the ISA code.

•View the statistics generated by analyzing the ISA code.

•View General-Purpose Register usage and spill registers allocated for the

kernel.

AMD ACCELERATED PARALLEL PROCESSING

4.3 Analyzing Processor Kernels 4-9

Figure 4.6 AMD APP Kernel Analyzer

4.3 Analyzing Processor Kernels

4.3.1 Intermediate Language and GPU Disassembly

The AMD Accelerated Parallel Processing software exposes the Intermediate

Language (IL) and instruction set architecture (ISA) code generated for

OpenCL™ kernels through the compiler options -save-temps[=prefix].

The AMD Intermediate Language (IL) is an abstract representation for hardware

vertex, pixel, and geometry shaders, as well as compute kernels that can be

taken as input by other modules implementing the IL. An IL compiler uses an IL

shader or kernel in conjunction with driver state information to translate these

shaders into hardware instructions or a software emulation layer. For a complete

description of IL, see the AMD Intermediate Language (IL) Specification v2.

AMD ACCELERATED PARALLEL PROCESSING

4-10 Chapter 4: OpenCL Performance and Optimization

The instruction set architecture (ISA) defines the instructions and formats

accessible to programmers and compilers for the AMD GPUs. The Northern

Islands-family ISA instructions and microcode are documented in the AMD

Northern Islands-Family ISA Instructions and Microcode.

4.3.2 Generating IL and ISA Code

In Microsoft Visual Studio, the AMD APP Profiler provides an integrated tool to

view IL and ISA code. (The AMD APP KernelAnalyzer also can show the IL and

ISA code.) After running the Profiler, single-click the name of the kernel for

detailed programming and disassembly information. The associated ISA

disassembly is shown in a new tab. A drop-down menu provides the option to

view the IL, ISA, or source OpenCL for the selected kernel.

Developers also can generate IL and ISA code from their OpenCL kernel by

setting the environment variable AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps

(see Section 2.1.4, “AMD-Developed Supplemental Compiler Options,” page 2-

4).

4.4 Estimating Performance

4.4.1 Measuring Execution Time

The OpenCL runtime provides a built-in mechanism for timing the execution of

kernels by setting the CL_QUEUE_PROFILING_ENABLE flag when the queue is

created. Once profiling is enabled, the OpenCL runtime automatically records

timestamp information for every kernel and memory operation submitted to the

queue.

OpenCL provides four timestamps:

•CL_PROFILING_COMMAND_QUEUED - Indicates when the command is enqueued

into a command-queue on the host. This is set by the OpenCL runtime when

the user calls an clEnqueue* function.

•CL_PROFILING_COMMAND_SUBMIT - Indicates when the command is submitted

to the device. For AMD GPU devices, this time is only approximately defined

and is not detailed in this section.

•CL_PROFILING_COMMAND_START - Indicates when the command starts

execution on the requested device.

•CL_PROFILING_COMMAND_END - Indicates when the command finishes

execution on the requested device.

The sample code below shows how to compute the kernel execution time (End-

Start):

cl_event myEvent;

cl_ulong startTime, endTime;

clCreateCommandQueue (…, CL_QUEUE_PROFILING_ENABLE, NULL);

clEnqueueNDRangeKernel(…, &myEvent);

clFinish(myCommandQ); // wait for all events to finish

AMD ACCELERATED PARALLEL PROCESSING

4.4 Estimating Performance 4-11

clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_START,

sizeof(cl_ulong), &startTime, NULL);

clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_END,

sizeof(cl_ulong), &endTimeNs, NULL);

cl_ulong kernelExecTimeNs = endTime-startTime;

The AMD APP Profiler also can record the execution time for a kernel

automatically. The Kernel Time metric reported in the Profiler output uses the

built-in OpenCL timing capability and reports the same result as the

kernelExecTimeNs calculation shown above.

Another interesting metric to track is the kernel launch time (Start – Queue). The

kernel launch time includes both the time spent in the user application (after

enqueuing the command, but before it is submitted to the device), as well as the

time spent in the runtime to launch the kernel. For CPU devices, the kernel

launch time is fast (tens of μs), but for discrete GPU devices it can be several

hundred μs. Enabling profiling on a command queue adds approximately 10 μs

to 40 μs overhead to all clEnqueue calls. Much of the profiling overhead affects

the start time; thus, it is visible in the launch time. Be careful when interpreting

this metric. To reduce the launch overhead, the AMD OpenCL runtime combines

several command submissions into a batch. Commands submitted as batch

report similar start times and the same end time.

4.4.2 Using the OpenCL timer with Other System Timers

The resolution of the timer, given in ns, can be obtained from:

clGetDeviceInfo(…,CL_DEVICE_PROFILING_TIMER_RESOLUTION…);

AMD CPUs and GPUs report a timer resolution of 1 ns. AMD OpenCL devices

are required to correctly track time across changes in frequency and power

states. Also, the AMD OpenCL SDK uses the same time-domain for all devices

in the platform; thus, the profiling timestamps can be directly compared across

the CPU and GPU devices.

The sample code below can be used to read the current value of the OpenCL

timer clock. The clock is the same routine used by the AMD OpenCL runtime to

generate the profiling timestamps. This function is useful for correlating other

program events with the OpenCL profiling timestamps.

uint64_t

timeNanos()

{

#ifdef linux

struct timespec tp;

clock_gettime(CLOCK_MONOTONIC, &tp);

return (unsigned long long) tp.tv_sec * (1000ULL * 1000ULL * 1000ULL) +

(unsigned long long) tp.tv_nsec;

#else

LARGE_INTEGER current;

QueryPerformanceCounter(&current);

return (unsigned long long)((double)current.QuadPart / m_ticksPerSec * 1e9);

#endif

}

AMD ACCELERATED PARALLEL PROCESSING

4-12 Chapter 4: OpenCL Performance and Optimization

Normal CPU time-of-day routines can provide a rough measure of the elapsed

time of a GPU kernel. GPU kernel execution is non-blocking, that is, calls to

enqueue*Kernel return to the CPU before the work on the GPU is finished. For

an accurate time value, ensure that the GPU is finished. In OpenCL, you can

force the CPU to wait for the GPU to become idle by inserting calls to

clFinish() before and after the sequence you want to time; this increases the

timing accuracy of the CPU routines. The routine clFinish() blocks the CPU

until all previously enqueued OpenCL commands have finished.

For more information, see section 5.9, “Profiling Operations on Memory Objects

and Kernels,” of the OpenCL 1.0 Specification.

4.4.3 Estimating Memory Bandwidth

The memory bandwidth required by a kernel is perhaps the most important

performance consideration. To calculate this:

Effective Bandwidth = (Br + Bw)/T

where:

Br = total number of bytes read from global memory.

Bw = total number of bytes written to global memory.

T = time required to run kernel, specified in nanoseconds.

If Br and Bw are specified in bytes, and T in ns, the resulting effective bandwidth

is measured in GB/s, which is appropriate for current CPUs and GPUs for which

the peak bandwidth range is 20-260 GB/s. Computing Br and Bw requires a

thorough understanding of the kernel algorithm; it also can be a highly effective

way to optimize performance. For illustration purposes, consider a simple matrix

addition: each element in the two source arrays is read once, added together,

then stored to a third array. The effective bandwidth for a 1024x1024 matrix

addition is calculated as:

Br = 2 x (1024 x 1024 x 4 bytes) = 8388608 bytes ;; 2 arrays, 1024x1024, each

element 4-byte float

Bw = 1 x (1024 x 1024 x 4 bytes) = 4194304 bytes ;; 1 array, 1024x1024, each

element 4-byte float.

If the elapsed time for this copy as reported by the profiling timers is 1000000 ns

(1 million ns, or .001 sec), the effective bandwidth is:

(Br+Bw)/T = (8388608+4194304)/1000000 = 12.6GB/s

The AMD APP Profiler can report the number of dynamic instructions per thread

that access global memory through the FetchInsts and WriteInsts counters. The

Fetch and Write reports average the per-thread counts; these can be fractions if

the threads diverge. The Profiler also reports the dimensions of the global

NDRange for the kernel in the GlobalWorkSize field. The total number of threads

can be determined by multiplying together the three components of the range. If

all (or most) global accesses are the same size, the counts from the Profiler and

the approximate size can be used to estimate Br and Bw:

AMD ACCELERATED PARALLEL PROCESSING

4.5 OpenCL Memory Objects 4-13

Br = Fetch * GlobalWorkitems * Size

Bw = Write * GlobalWorkitems * Element Size

where GlobalWorkitems is the dispatch size.

An example Profiler output and bandwidth calculation:

WaveFrontSize = 192*144*1 = 27648 global work items.

In this example, assume we know that all accesses in the kernel are four bytes;

then, the bandwidth can be calculated as:

Br = 70.8 * 27648 * 4 = 7829914 bytes

Bw = 0.5 * 27648 * 4 = 55296 bytes

The bandwidth then can be calculated as:

(Br + Bw)/T = (7829914 bytes + 55296 bytes) / .9522 ms / 1000000

= 8.2 GB/s

4.5 OpenCL Memory Objects

This section explains the AMD OpenCL runtime policy for memory objects. It also

recommends best practices for best performance.

OpenCL uses memory objects to pass data to kernels. These can be either

buffers or images. Space for these is managed by the runtime, which uses

several types of memory, each with different performance characteristics. Each

type of memory is suitable for a different usage pattern. The following

subsections describe:

•the memory types used by the runtime;

•how to control which memory kind is used for a memory object;

•how the runtime maps memory objects for host access;

•how the runtime performs memory object reading, writing and copying;

•how best to use command queues; and

•some recommended usage patterns.

4.5.1 Types of Memory Used by the Runtime

Memory is used to store memory objects that are accessed by kernels executing

on the device, as well as to hold memory object data when they are mapped for

access by the host application. This section describes the different memory kinds

used by the runtime. Table 4.1 lists the performance of each memory type given

a PCIe3-capable platform and a high-end AMD Radeon™ 7XXX discrete GPU. In

Method GlobalWorkSize Time Fetch Write

runKernel_Cypress {192; 144; 1} 0.9522 70.8 0.5

AMD ACCELERATED PARALLEL PROCESSING

4-14 Chapter 4: OpenCL Performance and Optimization

Table 4.1, when host memory is accessed by the GPU shader, it is of type

CL_MEM_ALLOC_HOST_PTR. When GPU memory is accessed by the CPU, it is of

type CL_MEM_PERSISTENT_MEM_AMD.

Table 4.1 Memory Bandwidth in GB/s (R = read, W = write) in GB/s

Host memory and device memory in the above table consists of one of the

subtypes given below.

4.5.1.1 Host Memory

This regular CPU memory can be access by the CPU at full memory bandwidth;

however, it is not directly accessible by the GPU. For the GPU to transfer host

memory to device memory (for example, as a parameter to

clEnqueueReadBuffer or clEnqueueWriteBuffer), it first must be pinned (see

section 4.5.1.2). Pinning takes time, so avoid incurring pinning costs where CPU

overhead must be avoided.

When host memory is copied to device memory, the OpenCL runtime uses the

following transfer methods.

•<=32 kB: For transfers from the host to device, the data is copied by the CPU

to a runtime pinned host memory buffer, and the DMA engine transfers the

data to device memory. The opposite is done for transfers from the device to

the host.

•>32 kB and <=16 MB: The host memory physical pages containing the data

are pinned, the GPU DMA engine is used, and the pages then are unpinned.

•>16 MB: Runtime pinned host memory staging buffers are used. The CPU

copies the data in pieces, which then are transferred to the device using the

GPU DMA engine. Double buffering is used to overlap the CPU copies with

the DMA.

Due to the cost of copying to staging buffers, or pinning/unpinning host memory,

host memory does not offer the best transfer performance.

4.5.1.2 Pinned Host Memory

This is host memory that the operating system has bound to a fixed physical

address and that the operating system ensures is resident. The CPU can access

pinned host memory at full memory bandwidth. The runtime limits the total

amount of pinned host memory that can be used for memory objects. (See

Section 4.5.2, “Placement,” page 4-16, for information about pinning memory.

CPU R GPU R GPU Shader R GPU Shader W GPU DMA R GPU DMA W

Host Memory 10 - 20 10 - 20 9 - 10 2.5 11 - 12 11 - 12

GPU Memory .01 9 - 10 230 120 -150 n/a n/a

AMD ACCELERATED PARALLEL PROCESSING

4.5 OpenCL Memory Objects 4-15

If the runtime knows the data is in pinned host memory, it can be transferred to,

and from, device memory without requiring staging buffers or having to perform

pinning/unpinning on each transfer. This offers improved transfer performance.

Currently, the runtime recognizes only data that is in pinned host memory for

operation arguments that are memory objects it has allocated in pinned host

memory. For example, the buffer argument of

clEnqueueReadBuffer/clEnqueueWriteBuffer and image argument of

clEnqueueReadImage/clEnqueueWriteImage. It does not detect that the ptr

arguments of these operations addresses pinned host memory, even if they are

the result of clEnqueueMapBuffer/clEnqueueMapImage on a memory object that

is in pinned host memory.

The runtime can make pinned host memory directly accessible from the GPU.

Like regular host memory, the CPU uses caching when accessing pinned host

memory. Thus, GPU accesses must use the CPU cache coherency protocol

when accessing. For discrete devices, the GPU access to this memory is through

the PCIe bus, which also limits bandwidth. For fusion devices that do not have

the PCIe overhead, GPU access is significantly slower than accessing device-

visible host memory (see section 4.5.1.3), which does not use the cache

coherency protocol.

4.5.1.3 Device-Visible Host Memory

The runtime allocates a limited amount of pinned host memory that is accessible

by the GPU without using the CPU cache coherency protocol. This allows the

GPU to access the memory at a higher bandwidth than regular pinned host

memory.

A portion of this memory is also configured to be accessible by the CPU as

uncached memory. Thus, reads by the CPU are significantly slower than those

from regular host memory. However, these pages are also configured to use the

memory system write combining buffers. The size, alignment, and number of

write combining buffers is chip-set dependent. Typically, there are 4 to 8 buffers

of 64 bytes, each aligned to start on a 64-byte memory address. These allow

writes to adjacent memory locations to be combined into a single memory

access. This allows CPU streaming writes to perform reasonably well. Scattered

writes that do not fill the write combining buffers before they have to be flushed

do not perform as well.

Fusion devices have no device memory and use device-visible host memory for

their global device memory.

4.5.1.4 Device Memory

Discrete GPU devices have their own dedicated memory, which provides the

highest bandwidth for GPU access. The CPU cannot directly access device

memory on a discrete GPU (except for the host-visible device memory portion

described in section 4.5.1.5).

AMD ACCELERATED PARALLEL PROCESSING

4-16 Chapter 4: OpenCL Performance and Optimization

On an APU, the system memory is shared between the GPU and the CPU; it is

visible by either the CPU or the GPU at any given time. A significant benefit of

this is that buffers can be zero copied between the devices by using map/unmap

operations to logically move the buffer between the CPU and the GPU address

space. See Section 4.5.4, “Mapping,” page 4-18, for more information on zero

copy.

4.5.1.5 Host-Visible Device Memory

A limited portion of discrete GPU device memory is configured to be directly

accessible by the CPU. It can be accessed by the GPU at full bandwidth, but

CPU access is over the PCIe bus; thus, it is much slower that host memory

bandwidth. The memory is mapped into the CPU address space as uncached,

but using the memory system write combining buffers. This results in slow CPU

reads and scattered writes, but streaming CPU writes perform much better

because they reduce PCIe overhead.

4.5.2 Placement

Every OpenCL memory object has a location that is defined by the flags passed

to clCreateBuffer/clCreateImage. A memory object can be located either on

a device, or (as of SDK 2.4) it can be located on the host and accessed directly

by all the devices. The Location column of Table 4.2 gives the memory type used

for each of the allocation flag values for different kinds of devices. When a device

kernel is executed, it accesses the contents of memory objects from this location.

The performance of these accesses is determined by the memory kind used.

An OpenCL context can have multiple devices, and a memory object that is

located on a device has a location on each device. To avoid over-allocating

device memory for memory objects that are never used on that device, space is

not allocated until first used on a device-by-device basis. For this reason, the first

use of a memory object after it is created can be slower than subsequent uses.

AMD ACCELERATED PARALLEL PROCESSING

4.5 OpenCL Memory Objects 4-17

Table 4.2 OpenCL Memory Object Properties

clCreateBuffer/

clCreateImage Flags Argument Device Type Location

clEnqueueMapBuffer/

clEnqueueMapImage/

clEnqueueUnmapMemObject

Map

Mode Map Location

Default

(none of the following flags)

Discrete GPU Device memory

Copy

Mapped data size:

•<=32MiB: Pinned

host memory

•>32MiB: Host

memory (different

memory area can be

used on each map)

Fusion APU Device-visible host

memory

CPU Use Map Location

directly Zero copy

CL_MEM_ALLOC_HOST_PTR

(clCreateBuffer on Windows 7

and Vista for Evergreen, Northern

Islands, and Southern Islands;

Southern Islands devices have

this functionality on Linux as well.)

Discrete GPU Pinned host memory

shared by all devices

in context (unless

only device in

context is CPU;

then, host memory)

Zero copy

Use Location directly

(same memory area is

used on each map)

Fusion APU

CPU

CL_MEM_ALLOC_HOST_PTR

(clCreateImage on Windows 7,

Vista and Linux; clCreateBuffer

on Linux)

Discrete GPU Device memory

Copy

Pinned host memory,

unless only device in

context is CPU; then,

host memory (same

memory area is used

on each map)

Fusion APU Device-visible

memory

CPU Zero copy

CL_MEM_USE_HOST_PTR

Discrete GPU Device memory

Copy

Pinned host memory,

unless only device in

context is CPU; then,

host memory (same

memory area is used

on each map)

Fusion APU Device-visible host

memory

CPU Use Map Location

directly Zero copy

CL_MEM_USE_PERSISTENT_MEM_AMD

on Windows 7 and Vista for

Evergreen, Northern Islands, and

Southern Islands; Southern Islands

devices have this functionality on

Linux as well.)

Discrete GPU Host-visible device

memory

Zero copy

Use Location directly

(different memory area

can be used on each

map)

Fusion APU Device-visible host

memory

CPU Host memory

CL_MEM_USE_PERSISTENT_MEM_AM

(Linux for Evergreen and Northern

Islands)

Same as default.

AMD ACCELERATED PARALLEL PROCESSING

4-18 Chapter 4: OpenCL Performance and Optimization

4.5.3 Memory Allocation

4.5.3.1 Using the CPU

Create memory objects with CL_MEM_ALLOC_HOST_PTR, and use map/unmap; do

not use read/write. The reason for this is that if the object is created with

CL_MEM_USE_HOST_PTR the CPU is running the kernel on the buffer provided by

the application (a hack that all vendors use). This results in zero copy between

the CPU and the application buffer; the kernel updates the application buffer, and

in this case a map/unmap is actually a no-op. Also, when allocating the buffer on

the host, ensure that it is created with the correct alignment. For example, a

buffer to be used as float4* must be 128-bit aligned.

4.5.3.2 Using Both CPU and GPU Devices, or using an APU Device

When creating memory objects, create them with

CL_MEM_USE_PERSISTENT_MEM_AMD. This enables the zero copy feature, as

explained in Section 4.5.3.1, “Using the CPU.”.

4.5.3.3 Buffers vs Images

Unlike GPUs, CPUs do not contain dedicated hardware (samplers) for accessing

images. Instead, image access is emulated in software. Thus, a developer may

prefer using buffers instead of images if no sampling operation is needed.

4.5.3.4 Choosing Execution Dimensions

Note the following guidelines.

•Make the number of work-groups a multiple of the number of logical CPU

cores (device compute units) for maximum use.

•When work-groups number exceed the number of CPU cores, the CPU cores

execute the work-groups sequentially.

4.5.4 Mapping

The host application can use clEnqueueMapBuffer/clEnqueueMapImage to

obtain a pointer that can be used to access the memory object data. When

finished accessing, clEnqueueUnmapMemObject must be used to make the data

available to device kernel access. When a memory object is located on a device,

the data either can be transferred to, and from, the host, or (as of SDK 2.4) be

accessed directly from the host. Memory objects that are located on the host, or

located on the device but accessed directly by the host, are termed zero copy

memory objects. The data is never transferred, but is accessed directly by both

the host and device. Memory objects that are located on the device and

transferred to, and from, the device when mapped and unmapped are termed

copy memory objects. The Map Mode column of Table 4.2 specifies the transfer

mode used for each kind of memory object, and the Map Location column

indicates the kind of memory referenced by the pointer returned by the map

operations.

AMD ACCELERATED PARALLEL PROCESSING

4.5 OpenCL Memory Objects 4-19

4.5.4.1 Zero Copy Memory Objects

CL_MEM_USE_PERSISTENT_MEM_AMD, CL_MEM_USE_HOST_PTR, and

CL_MEM_ALLOC_HOST_PTR support zero copy memory objects. The first provides

device-resident zero copy memory objects; the other two provide host-resident

zero copy memory objects.

Zero copy memory objects can be used by an application to optimize data

movement. When clEnqueueMapBuffer / clEnqueueMapImage /

clEnqueueUnmapMemObject are used, no runtime transfers are performed, and

the operations are very fast; however, the runtime can return a different pointer

value each time a zero copy memory object is mapped. Note that only images

created with CL_MEM_USE_PERSISTENT_MEM_AMD can be zero copy.

Southern Island devices support zero copy memory objects under Linux;

however, only images created with CL_MEM_USE_PERSISTENT_MEM_AMD can be

zero copy.

Zero copy host resident memory objects can boost performance when host

memory is accessed by the device in a sparse manner or when a large host

memory buffer is shared between multiple devices and the copies are too

expensive. When choosing this, the cost of the transfer must be greater than the

extra cost of the slower accesses.

Streaming writes by the host to zero copy device resident memory objects are

about as fast as the transfer rates, so this can be a good choice when the host

does not read the memory object to avoid the host having to make a copy of the

data to transfer. Memory objects requiring partial updates between kernel

executions can also benefit. If the contents of the memory object must be read

by the host, use clEnqueueCopyBuffer to transfer the data to a separate

CL_MEM_ALLOC_HOST_PTR buffer.

4.5.4.2 Copy Memory Objects

For memory objects with copy map mode, the memory object location is on the

device, and it is transferred to, and from, the host when clEnqueueMapBuffer /

clEnqueueMapImage / clEnqueueUnmapMemObject are called. Table 4.3 shows

how the map_flags argument affects transfers. The runtime transfers only the

portion of the memory object requested in the offset and cb arguments. When

accessing only a portion of a memory object, only map that portion for improved

performance.

AMD ACCELERATED PARALLEL PROCESSING

4-20 Chapter 4: OpenCL Performance and Optimization

Table 4.3 Transfer policy on clEnqueueMapBuffer / clEnqueueMapImage /

clEnqueueUnmapMemObject for Copy Memory Objects

For default memory objects, the pointer returned by clEnqueueMapBuffer /

clEnqueueMapImage may not be to the same memory area each time because

different runtime buffers may be used.

For CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR the same map location

is used for all maps; thus, the pointer returned is always in the same memory

area. For other copy memory objects, the pointer returned may not always be to

the same memory region.

For CL_MEM_USE_HOST_PTR and the CL_MEM_ALLOC_HOST_PTR cases that use

copy map mode, the runtime tracks if the map location contains an up-to-date

copy of the memory object contents and avoids doing a transfer from the device

when mapping as CL_MAP_READ. This determination is based on whether an

operation such as clEnqueueWriteBuffer/clEnqueueCopyBuffer or a kernel

execution has modified the memory object. If a memory object is created with

CL_MEM_READ_ONLY, then a kernel execution with the memory object as an

argument is not considered as modifying the memory object. Default memory

objects cannot be tracked because the map location changes between map calls;

thus, they are always transferred on the map.

For CL_MEM_USE_HOST_PTR, clCreateBuffer/clCreateImage pins the host

memory passed to the host_ptr argument. It is unpinned when the memory

object is deleted. To minimize pinning costs, align the memory to 4KiB. This

avoids the runtime having to pin/unpin on every map/unmap transfer, but does

add to the total amount of pinned memory.

For CL_MEM_USE_HOST_PTR, the host memory passed as the ptr argument of

clCreateBuffer/clCreateImage is used as the map location. As mentioned in

section 4.5.1.1, host memory transfers incur considerable cost in

pinning/unpinning on every transfer. If used, minimize the pinning cost by

ensuring the memory is 4 kB aligned. If host memory that is updated once is

required, use CL_MEM_ALLOC_HOST_PTR with the CL_MEM_COPY_HOST_PTR flag

instead. If device memory is needed, use CL_MEM_USE_PERSISTENT_MEM_AMD and

clEnqueueWriteBuffer.

clEnqueueMapBuffer /

clEnqueueMapImage

map_flags argument

Transfer on clEnqueueMapBuffer /

clEnqueueMapImage

Transfer on

clEnqueueUnmapMemObject

CL_MAP_READ Device to host, if map location is not current. None.

CL_MAP_WRITE Device to host, if map location is not current. Host to device.

CL_MAP_READ

CL_MAP_WRITE Device to host if map location is not current. Host to device.

CL_MAP_WRITE_INVALI

DATE_REGION None. Host to device.

AMD ACCELERATED PARALLEL PROCESSING

4.6 OpenCL Data Transfer Optimization 4-21

If CL_MEM_COPY_HOST_PTR is specified with CL_MEM_ALLOC_HOST_PTR when

creating a memory object, the memory is allocated in pinned host memory and

initialized with the passed data. For other kinds of memory objects, the deferred

allocation means the memory is not yet allocated on a device, so the runtime has

to copy the data into a temporary runtime buffer. The memory is allocated on the

device when the device first accesses the resource. At that time, any data that

must be transferred to the resource is copied. For example, this would apply

when a buffer was allocated with the flag CL_MEM_COPY_HOST_PTR. Using

CL_MEM_COPY_HOST_PTR for these buffers is not recommended because of the

extra copy. Instead, create the buffer without CL_MEM_COPY_HOST_PTR, and

initialize with clEnqueueWriteBuffer/clEnqueueWriteImage.

When images are transferred, additional costs are involved because the image

must be converted to, and from, linear address mode for host access. The

runtime does this by executing kernels on the device.

4.5.5 Reading, Writing, and Copying

There are numerous OpenCL commands to read, write, and copy buffers and

images. The runtime performs transfers depending on the memory kind of the

source and destination. When transferring between host memory and device

memory the methods described in section Section 4.5.1.1, “Host Memory,”

page 4-14, are used. Memcpy is used to transferring between the various kinds of

host memory, this may be slow if reading from device visible host memory, as

described in section Section 4.5.1.3, “Device-Visible Host Memory,” page 4-15.

Finally, device kernels are used to copy between device memory. For images,

device kernels are used to convert to and from the linear address mode when

necessary.

4.5.6 Command Queue

It is best to use non-blocking commands to allow multiple commands to be

queued before the command queue is flushed to the GPU. This sends larger

batches of commands, which amortizes the cost of preparing and submitting

work to the GPU. Use event tracking to specify the dependence between

operations. It is recommended to queue operations that do not depend of the

results of previous copy and map operations. This can help keep the GPU busy

with kernel execution and DMA transfers. Note that if a non-blocking copy or map

is queued, it does not start until the command queue is flushed. Use clFlush if

necessary, but avoid unnecessary flushes because they cause small command

batching.

4.6 OpenCL Data Transfer Optimization

The AMD OpenCL implementation offers several optimized paths for data

transfer to, and from, the device. The following chapters describe buffer and

image paths, as well as how they map to common application scenarios. To find

out where the application’s buffers are stored (and understand how the data

AMD ACCELERATED PARALLEL PROCESSING

4-22 Chapter 4: OpenCL Performance and Optimization

transfer behaves), use the APP Profiler’s API Trace View, and look at the tool

tips of the clEnqueueMapBuffer calls.

4.6.1 Definitions

•Deferred allocation — The CL runtime attempts to minimize resource

consumption by delaying buffer allocation until first use. As a side effect, the

first accesses to a buffer may be more expensive than subsequent accesses.

•Peak interconnect bandwidth — As used in the text below, this is the transfer

bandwidth between host and device that is available under optimal conditions

at the application level. It is dependent on the type of interconnect, the

chipset, and the graphics chip. As an example, a high-performance PC with

a PCIe 3.0 16x bus and a GCN architecture (AMD Radeon™ HD 7XXX

series) graphics card has a nominal interconnect bandwidth of 16 GB/s.

•Pinning — When a range of host memory is prepared for transfer to the

GPU, its pages are locked into system memory. This operation is called

pinning; it can impose a high cost, proportional to the size of the memory

range. One of the goals of optimizing data transfer is to use pre-pinned

buffers whenever possible. However, if pre-pinned buffers are used

excessively, it can reduce the available system memory and result in

excessive swapping. Host side zero copy buffers provide easy access to pre-

pinned memory.

•WC — Write Combine is a feature of the CPU write path to a select region

of the address space. Multiple adjacent writes are combined into cache lines

(for example, 64 bytes) before being sent to the external bus. This path

typically provides fast streamed writes, but slower scattered writes.

Depending on the chip set, scattered writes across a graphics interconnect

can be very slow. Also, some platforms require multi-core CPU writes to

saturate the WC path over an interconnect.

•Uncached accesses — Host memory and I/O regions can be configured as

uncached. CPU read accesses are typically very slow; for example:

uncached CPU reads of graphics memory over an interconnect.

•

USWC

— Host memory from the Uncached Speculative Write Combine heap

can be accessed by the GPU without causing CPU cache coherency traffic.

Due to the uncached WC access path, CPU streamed writes are fast, while

CPU reads are very slow. On Fusion devices, this memory provides the

fastest possible route for CPU writes followed by GPU reads.

4.6.2 Buffers

OpenCL buffers currently offer the widest variety of specialized buffer types and

optimized paths, as well as slightly higher transfer performance.

4.6.2.1 Regular Device Buffers

Buffers allocated using the flags CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, or

CL_MEM_READ_WRITE are placed on the GPU device. These buffers can be

accessed by a GPU kernel at very high bandwidths. For example, on a high-end

AMD ACCELERATED PARALLEL PROCESSING

4.6 OpenCL Data Transfer Optimization 4-23

graphics card, the OpenCL kernel read/write performance is significantly higher

than 100 GB/s. When device buffers are accessed by the host through any of

the OpenCL read/write/copy and map/unmap API calls, the result is an explicit

transfer across the hardware interconnect.

4.6.2.2 Zero Copy Buffers

AMD APP SDK 2.4 on Windows 7 and Vista introduces a new feature called zero

copy buffers.

If a buffer is of the zero copy type, the runtime tries to leave its content in place,

unless the application explicitly triggers a transfer (for example, through

clEnqueueCopyBuffer()). Depending on its type, a zero copy buffer resides on

the host or the device. Independent of its location, it can be accessed directly by

the host CPU or a GPU device kernel, at a bandwidth determined by the

capabilities of the hardware interconnect.

Calling clEnqueueMapBuffer() and clEnqueueUnmapMemObject() on a zero

copy buffer is typically a low-cost operation.

Since not all possible read and write paths perform equally, check the application

scenarios below for recommended usage. To assess performance on a given

platform, use the BufferBandwidth sample.

If a given platform supports the zero copy feature, the following buffer types are

available:

•The CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR buffers are:

– zero copy buffers that resides on the host.

– directly accessible by the host at host memory bandwidth.

– directly accessible by the device across the interconnect.

– a pre-pinned sources or destinations for CL read, write, and copy

commands into device memory at peak interconnect bandwidth.

Note that buffers created with the flag CL_MEM_ALLOC_HOST_PTR together with

CL_MEM_READ_ONLY may reside in uncached write-combined memory. As a

result, CPU can have high streamed write bandwidth, but low read and

potentially low write scatter bandwidth, due to the uncached WC path.

•The CL_MEM_USE_PERSISTENT_MEM_AMD buffer is

– a zero copy buffer that resides on the GPU device.

– directly accessible by the GPU device at GPU memory bandwidth.

– directly accessible by the host across the interconnect (typically with high

streamed write bandwidth, but low read and potentially low write scatter

bandwidth, due to the uncached WC path).

– copyable to, and from, the device at peak interconnect bandwidth using

CL read, write, and copy commands.

There is a limit on the maximum size per buffer, as well as on the total size

of all buffers. This is platform-dependent, limited in size for each buffer, and

AMD ACCELERATED PARALLEL PROCESSING

4-24 Chapter 4: OpenCL Performance and Optimization

also for the total size of all buffers of that type (a good working assumption

is 64 MB for the per-buffer limit, and 128 MB for the total).

Zero copy buffers work well on Fusion APU devices. SDK 2.5 introduced an

optimization that is of particular benefit on Fusion APUs. The runtime uses

USWC memory for buffers allocated as CL_MEM_ALLOC_HOST_PTR |

CL_MEM_READ_ONLY. On Fusion systems, this type of zero copy buffer can be

written to by the CPU at very high data rates, then handed over to the GPU at

minimal cost for equally high GPU read-data rates over the Radeon memory bus.

This path provides the highest data transfer rate for the CPU-to-GPU path. The

use of multiple CPU cores may be necessary to achieve peak write performance.

1. buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

2. address = clMapBuffer( buffer )

3. memset( address ) or memcpy( address ) (if possible, using multiple CPU

cores)

4. clEnqueueUnmapMemObject( buffer )

5. clEnqueueNDRangeKernel( buffer )

As this memory is not cacheable, CPU read operations are very slow. This type

of buffer also exists on discrete platforms, but transfer performance typically is

limited by PCIe bandwidth.

Zero copy buffers can provide low latency for small transfers, depending on the

transfer path. For small buffers, the combined latency of map/CPU memory

access/unmap can be smaller than the corresponding DMA latency.

4.6.2.3 Pre-pinned Buffers

AMD APP SDK 2.5 introduces a new feature called pre-pinned buffers. This

feature is supported on Windows 7, Windows Vista, and Linux.

Buffers of type CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR are pinned at

creation time. These buffers can be used directly as a source or destination for

clEnqueueCopyBuffer to achieve peak interconnect bandwidth. Mapped buffers

also can be used as a source or destination for clEnqueueRead/WriteBuffer

calls, again achieving peak interconnect bandwidth. Note that using

CL_MEM_USE_HOST_PTR permits turning an existing user memory region into pre-

pinned memory. However, in order to stay on the fast path, that memory must be

aligned to 256 bytes. Buffers of type CL_MEM_USE_HOST_PTR remain pre-pinned

as long as they are used only for data transfer, but not as kernel arguments. If

the buffer is used in a kernel, the runtime creates a cached copy on the device,

and subsequent copies are not on the fast path. The same restriction applies to

CL_MEM_ALLOC_HOST_PTR allocations under Linux.

See usage examples described for various options below.

The pre-pinned path is supported for the following calls.

•clEnqueueRead/WriteBuffer

AMD ACCELERATED PARALLEL PROCESSING

4.6 OpenCL Data Transfer Optimization 4-25

•clEnqueueRead/WriteImage

•clEnqueueRead/WriteBufferRect (Windows only)

Offsets into mapped buffer addresses are supported, too.

Note that the CL image calls must use pre-pinned mapped buffers on the host

side, and not pre-pinned images.

4.6.2.4 Application Scenarios and Recommended OpenCL Paths

The following section describes various application scenarios, and the

corresponding paths in the OpenCL API that are known to work well on AMD

platforms. The various cases are listed, ordered from generic to more

specialized.

From an application point of view, two fundamental use cases exist, and they can

be linked to the various options, described below.

•An application wants to transfer a buffer that was already allocated through

malloc() or mmap(). In this case, options 2), 3) and 4) below always consist

of a memcpy() plus a device transfer. Option 1) does not require a memcpy().

•If an application is able to let OpenCL allocate the buffer, options 2) and 4)

below can be used to avoid the extra memcpy(). In the case of option 5),

memcpy() and transfer are identical.

Note that the OpenCL runtime uses deferred allocation to maximize memory

resources. This means that a complete roundtrip chain, including data transfer

and kernel compute, might take one or two iterations to reach peak performance.

A code sample named BufferBandwidth can be used to investigate and

benchmark the various transfer options in combination with different buffer types.

Option 1 - clEnqueueWriteBuffer() and clEnqueueReadBuffer()

This option is the easiest to use on the application side.

CL_MEM_USE_HOST_PTR is an ideal choice if the application wants to transfer

a buffer that has already been allocated through malloc() or mmap().

There are two ways to use this option. The first uses

clEnqueueRead/WriteBuffer on a pre-pinned, mapped host-side buffer:

a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or

CL_MEM_USE_HOST_PTR )

b. deviceBuffer = clCreateBuffer()

c. void *pinnedMemory = clEnqueueMapBuffer( pinnedBuffer )

d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )

e. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory )

The pinning cost is incurred at step a. Step d does not incur any pinning cost.

Typically, an application performs steps a, b, c, and e once. It then

repeatedly reads or modifies the data in pinnedMemory, followed by step d.

AMD ACCELERATED PARALLEL PROCESSING

4-26 Chapter 4: OpenCL Performance and Optimization

For the second way to use this option, clEnqueueRead/WriteBuffer is used

directly on a user memory buffer. The standard clEnqueueRead/Write calls

require to pin (lock in memory) memory pages before they can be copied (by

the DMA engine). This creates a performance penalty that is proportional to

the buffer size. The performance of this path is currently about two-thirds of

peak interconnect bandwidth.

Option 2 - clEnqueueCopyBuffer() on a pre-pinned host buffer (requires

pre-pinned buffer support)

This is analogous to Option 1. Performing a CL copy of a pre-pinned buffer

to a device buffer (or vice versa) runs at peak interconnect bandwidth.

a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or

CL_MEM_USE_HOST_PTR )

b. deviceBuffer = clCreateBuffer()

This is followed either by:

c. void *memory = clEnqueueMapBuffer( pinnedBuffer )

d. Application writes or modifies memory.

e. clEnqueueUnmapMemObject( pinnedBuffer, memory )

f. clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer )

or by:

g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer )

h. void *memory = clEnqueueMapBuffer( pinnedBuffer )

i. Application reads memory.

j. clEnqueueUnmapMemObject( pinnedBuffer, memory )

Since the pinnedBuffer resides in host memory, the clMap() and clUnmap()

calls do not result in data transfers, and they are of very low latency. Sparse

or dense memory operations by the application take place at host memory

bandwidth.

Option 3 - clEnqueueMapBuffer() and clEnqueueUnmapMemObject() of a

Device Buffer

This is a good choice if the application fills in the data on the fly, or requires

a pointer for calls to other library functions (such as fread() or fwrite()).

An optimized path exists for regular device buffers; this path provides peak

interconnect bandwidth at map/unmap time.

For buffers already allocated through malloc() or mmap(), the total transfer

cost includes a memcpy() into the mapped device buffer, in addition to the

interconnect transfer. Typically, this is slower than option 1), above.

The transfer sequence is as follows:

a. Data transfer from host to device buffer.

1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )

AMD ACCELERATED PARALLEL PROCESSING

4.6 OpenCL Data Transfer Optimization 4-27

Since the buffer is mapped write-only, no data is transferred from

device buffer to host. The map operation is very low cost. A pointer

to a pinned host buffer is returned.

2. The application fills in the host buffer through memset( ptr ),

memcpy ( ptr, srcptr ), fread( ptr ), or direct CPU writes.

This happens at host memory bandwidth.

3. clEnqueueUnmapMemObject( .., buf, ptr, .. )

The pre-pinned buffer is transferred to the GPU device, at peak

interconnect bandwidth.

b. Data transfer from device buffer to host.

1. ptr = clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. )

This command triggers a transfer from the device to host memory,

into a pre-pinned temporary buffer, at peak interconnect bandwidth.

A pointer to the pinned memory is returned.

2. The application reads and processes the data, or executes a

memcpy( dstptr, ptr ), fwrite (ptr), or similar function. Since

the buffer resides in host memory, this happens at host memory

bandwidth.

3. clEnqueueUnmapMemObject( .., buf, ptr, .. )

Since the buffer was mapped as read-only, no transfer takes place,

and the unmap operation is very low cost.

Option 4 - Direct host access to a zero copy device buffer (requires zero

copy support)

This option allows overlapping of data transfers and GPU compute. It is also

useful for sparse write updates under certain constraints.

a. A zero copy buffer on the device is created using the following command:

buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )

This buffer can be directly accessed by the host CPU, using the

uncached WC path. This can take place at the same time the GPU

executes a compute kernel. A common double buffering scheme has the

kernel process data from one buffer while the CPU fills a second buffer.

See the TransferOverlap code sample.

A zero copy device buffer can also be used to for sparse updates, such

as assembling sub-rows of a larger matrix into a smaller, contiguous

block for GPU processing. Due to the WC path, it is a good design

choice to try to align writes to the cache line size, and to pick the write

block size as large as possible.

AMD ACCELERATED PARALLEL PROCESSING

4-28 Chapter 4: OpenCL Performance and Optimization

b. Transfer from the host to the device.

1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )

This operation is low cost because the zero copy device buffer is

directly mapped into the host address space.

2. The application transfers data via memset( ptr ), memcpy( ptr,

srcptr ), or direct CPU writes.

The CPU writes directly across the interconnect into the zero copy

device buffer. Depending on the chipset, the bandwidth can be of

the same order of magnitude as the interconnect bandwidth,

although it typically is lower than peak.

3. clEnqueueUnmapMemObject( .., buf, ptr, .. )

As with the preceding map, this operation is low cost because the

buffer continues to reside on the device.

c. If the buffer content must be read back later, use

clEnqueueReadBuffer( .., buf, ..) or

clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ).

This bypasses slow host reads through the uncached path.

Option 5 - Direct GPU access to a zero copy host buffer (requires zero

copy support)

This option allows direct reads or writes of host memory by the GPU. A GPU

kernel can import data from the host without explicit transfer, and write data

directly back to host memory. An ideal use is to perform small I/Os straight

from the kernel, or to integrate the transfer latency directly into the kernel

execution time.

a. The application creates a zero copy host buffer.

buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )

b. Next, the application modifies or reads the zero copy host buffer.

1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ |

CL_MAP_WRITE, .. )

This operation is very low cost because it is a map of a buffer

already residing in host memory.

2. The application modifies the data through memset( ptr ),

memcpy(in either direction), sparse or dense CPU reads or writes.

Since the application is modifying a host buffer, these operations

take place at host memory bandwidth.

AMD ACCELERATED PARALLEL PROCESSING

4.7 Using Multiple OpenCL Devices 4-29

3. clEnqueueUnmapMemObject( .., buf, ptr, .. )

As with the preceding map, this operation is very low cost because

the buffer continues to reside in host memory.

c. The application runs clEnqueueNDRangeKernel(), using buffers of this

type as input or output. GPU kernel reads and writes go across the

interconnect to host memory, and the data transfer becomes part of the

kernel execution.

The achievable bandwidth depends on the platform and chipset, but can

be of the same order of magnitude as the peak interconnect bandwidth.

For discrete graphics cards, it is important to note that resulting GPU

kernel bandwidth is an order of magnitude lower compared to a kernel

accessing a regular device buffer located on the device.

d. Following kernel execution, the application can access data in the host

buffer in the same manner as described above.

4.7 Using Multiple OpenCL Devices

The AMD OpenCL runtime supports both CPU and GPU devices. This section

introduces techniques for appropriately partitioning the workload and balancing it

across the devices in the system.

4.7.1 CPU and GPU Devices

Table 4.4 lists some key performance characteristics of two exemplary CPU and

GPU devices: a quad-core AMD Phenom II X4 processor running at 2.8 GHz,

and a mid-range AMD Radeon™ HD 7770 GPU running at 1 GHz. The “best”

device in each characteristic is highlighted, and the ratio of the best/other device

is shown in the final column.

The GPU excels at high-throughput: the peak execution rate (measured in

FLOPS) is 7X higher than the CPU, and the memory bandwidth is 2.5X higher

than the CPU. The GPU also consumes approximately 65% the power of the

CPU; thus, for this comparison, the power efficiency in flops/watt is 10X higher.

While power efficiency can vary significantly with different devices, GPUs

generally provide greater power efficiency (flops/watt) than CPUs because they

optimize for throughput and eliminate hardware designed to hide latency.

AMD ACCELERATED PARALLEL PROCESSING

4-30 Chapter 4: OpenCL Performance and Optimization

Table 4.4 CPU and GPU Performance Characteristics

Conversely, CPUs excel at latency-sensitive tasks. For example, an integer add

is 10X faster on the CPU than on the GPU. This is a product of both the CPUs

higher clock rate (2800 MHz vs 1000 MHz for this comparison), as well as the

operation latency; the CPU is optimized to perform an integer add in just one

cycle, while the GPU requires four cycles. The CPU also has a latency-optimized

path to DRAM, while the GPU optimizes for bandwidth and relies on many in-

flight threads to hide the latency. The AMD Radeon™ HD 7770 GPU, for example,

supports more than 25,000 in-flight work-items and can switch to a new

wavefront (containing up to 64 work-items) in a single cycle. The CPU supports

only four hardware threads, and thread-switching requires saving and restoring

the CPU registers from memory. The GPU requires many active threads to both

keep the execution resources busy, as well as provide enough threads to hide

the long latency of cache misses.

Each GPU thread has its own register state, which enables the fast single-cycle

switching between threads. Also, GPUs can be very efficient at gather/scatter

operations: each thread can load from any arbitrary address, and the registers

are completely decoupled from the other threads. This is substantially more

flexible and higher-performing than a classic Vector ALU-style architecture (such

as SSE on the CPU), which typically requires that data be accessed from

contiguous and aligned memory locations. SSE supports instructions that write

parts of a register (for example, MOVLPS and MOVHPS, which write the upper and

lower halves, respectively, of an SSE register), but these instructions generate

additional microarchitecture dependencies and frequently require additional pack

instructions to format the data correctly.

CPU GPU Winner Ratio

Example Device AMD Phenom™ II X4 AMD Radeon™ HD 7770

Core Frequency 2800 MHz 1 GHz 3 X

Compute Units 4 10 2.5 X

Approx. Power195 W 80 W 1.2 X

Approx. Power/Compute Unit 19 W 8 W 2.4 X

Peak Single-Precision

Billion Floating-Point Ops/Sec 90 1280 14 X

Approx GFLOPS/Watt 0.9 16 18 X

Max In-flight HW Threads 4 25600 6400 X

Simultaneous Executing Threads 4 640 160 X

Memory Bandwidth 26 GB/s 72 GB/s 2.8 X

Int Add latency 0.4 ns 4 ns 10 X

FP Add Latency 1.4 ns 4 ns 2.9 X

Approx DRAM Latency 50 ns 270 ns 5.4 X

L2+L3 (GPU only L2) cache capacity 8192 KB 128 kB 64 X

Approx Kernel Launch Latency 25 μs50 μs2 X

1. For the power specifications of the AMD Phenom™ II x4, see http://www.amd.com/us/products/desk-

top/processors/phenom-ii/Pages/phenom-ii-model-number-comparison.aspx.

AMD ACCELERATED PARALLEL PROCESSING

4.7 Using Multiple OpenCL Devices 4-31

In contrast, each GPU thread shares the same program counter with 63 other

threads in a wavefront. Divergent control-flow on a GPU can be quite expensive

and can lead to significant under-utilization of the GPU device. When control flow

substantially narrows the number of valid work-items in a wave-front, it can be

faster to use the CPU device.

CPUs also tend to provide significantly more on-chip cache than GPUs. In this

example, the CPU device contains 512 kB L2 cache/core plus a 6 MB L3 cache

that is shared among all cores, for a total of 8 MB of cache. In contrast, the GPU

device contains only 128 kB cache shared by the five compute units. The larger

CPU cache serves both to reduce the average memory latency and to reduce

memory bandwidth in cases where data can be re-used from the caches.

Finally, note the approximate 2X difference in kernel launch latency. The GPU

launch time includes both the latency through the software stack, as well as the

time to transfer the compiled kernel and associated arguments across the PCI-

express bus to the discrete GPU. Notably, the launch time does not include the

time to compile the kernel. The CPU can be the device-of-choice for small, quick-

running problems when the overhead to launch the work on the GPU outweighs

the potential speedup. Often, the work size is data-dependent, and the choice of

device can be data-dependent as well. For example, an image-processing

algorithm may run faster on the GPU if the images are large, but faster on the

CPU when the images are small.

The differences in performance characteristics present interesting optimization

opportunities. Workloads that are large and data parallel can run orders of

magnitude faster on the GPU, and at higher power efficiency. Serial or small

parallel workloads (too small to efficiently use the GPU resources) often run

significantly faster on the CPU devices. In some cases, the same algorithm can

exhibit both types of workload. A simple example is a reduction operation such

as a sum of all the elements in a large array. The beginning phases of the

operation can be performed in parallel and run much faster on the GPU. The end

of the operation requires summing together the partial sums that were computed

in parallel; eventually, the width becomes small enough so that the overhead to

parallelize outweighs the computation cost, and it makes sense to perform a

serial add. For these serial operations, the CPU can be significantly faster than

the GPU.

4.7.2 When to Use Multiple Devices

One of the features of GPU computing is that some algorithms can run

substantially faster and at better energy efficiency compared to a CPU device.

Also, once an algorithm has been coded in the data-parallel task style for

OpenCL, the same code typically can scale to run on GPUs with increasing

compute capability (that is more compute units) or even multiple GPUs (with a

little more work).

For some algorithms, the advantages of the GPU (high computation throughput,

latency hiding) are offset by the advantages of the CPU (low latency, caches, fast

launch time), so that the performance on either devices is similar. This case is

AMD ACCELERATED PARALLEL PROCESSING

4-32 Chapter 4: OpenCL Performance and Optimization

more common for mid-range GPUs and when running more mainstream

algorithms. If the CPU and the GPU deliver similar performance, the user can

get the benefit of either improved power efficiency (by running on the GPU) or

higher peak performance (use both devices).

Usually, when the data size is small, it is faster to use the CPU because the start-

up time is quicker than on the GPU due to a smaller driver overhead and

avoiding the need to copy buffers from the host to the device.

4.7.3 Partitioning Work for Multiple Devices

By design, each OpenCL command queue can only schedule work on a single

OpenCL device. Thus, using multiple devices requires the developer to create a

separate queue for each device, then partition the work between the available

command queues.

A simple scheme for partitioning work between devices would be to statically

determine the relative performance of each device, partition the work so that

faster devices received more work, launch all the kernels, and then wait for them

to complete. In practice, however, this rarely yields optimal performance. The

relative performance of devices can be difficult to determine, in particular for

kernels whose performance depends on the data input. Further, the device

performance can be affected by dynamic frequency scaling, OS thread

scheduling decisions, or contention for shared resources, such as shared caches

and DRAM bandwidth. Simple static partitioning algorithms which “guess wrong”

at the beginning can result in significantly lower performance, since some

devices finish and become idle while the whole system waits for the single,

unexpectedly slow device.

For these reasons, a dynamic scheduling algorithm is recommended. In this

approach, the workload is partitioned into smaller parts that are periodically

scheduled onto the hardware. As each device completes a part of the workload,

it requests a new part to execute from the pool of remaining work. Faster

devices, or devices which work on easier parts of the workload, request new

input faster, resulting in a natural workload balancing across the system. The

approach creates some additional scheduling and kernel submission overhead,

but dynamic scheduling generally helps avoid the performance cliff from a single

bad initial scheduling decision, as well as higher performance in real-world

system environments (since it can adapt to system conditions as the algorithm

runs).

Multi-core runtimes, such as Cilk, have already introduced dynamic scheduling

algorithms for multi-core CPUs, and it is natural to consider extending these

scheduling algorithms to GPUs as well as CPUs. A GPU introduces several new

aspects to the scheduling process:

•Heterogeneous Compute Devices

Most existing multi-core schedulers target only homogenous computing

devices. When scheduling across both CPU and GPU devices, the scheduler

must be aware that the devices can have very different performance

AMD ACCELERATED PARALLEL PROCESSING

4.7 Using Multiple OpenCL Devices 4-33

characteristics (10X or more) for some algorithms. To some extent, dynamic

scheduling is already designed to deal with heterogeneous workloads (based

on data input the same algorithm can have very different performance, even

when run on the same device), but a system with heterogeneous devices

makes these cases more common and more extreme. Here are some

suggestions for these situations.

– The scheduler should support sending different workload sizes to

different devices. GPUs typically prefer larger grain sizes, and higher-

performing GPUs prefer still larger grain sizes.

– The scheduler should be conservative about allocating work until after it

has examined how the work is being executed. In particular, it is

important to avoid the performance cliff that occurs when a slow device

is assigned an important long-running task. One technique is to use small

grain allocations at the beginning of the algorithm, then switch to larger

grain allocations when the device characteristics are well-known.

– As a special case of the above rule, when the devices are substantially

different in performance (perhaps 10X), load-balancing has only a small

potential performance upside, and the overhead of scheduling the load

probably eliminates the advantage. In the case where one device is far

faster than everything else in the system, use only the fast device.

– The scheduler must balance small-grain-size (which increase the

adaptiveness of the schedule and can efficiently use heterogeneous

devices) with larger grain sizes (which reduce scheduling overhead).

Note that the grain size must be large enough to efficiently use the GPU.

•Asynchronous Launch

OpenCL devices are designed to be scheduled asynchronously from a

command-queue. The host application can enqueue multiple kernels, flush

the kernels so they begin executing on the device, then use the host core for

other work. The AMD OpenCL implementation uses a separate thread for

each command-queue, so work can be transparently scheduled to the GPU

in the background.

Avoid starving the high-performance GPU devices. This can occur if the

physical CPU core, which must re-fill the device queue, is itself being used

as a device. A simple approach to this problem is to dedicate a physical CPU

core for scheduling chores. The device fission extension (see Section A.7,

“cl_ext Extensions,” page A-4) can be used to reserve a core for scheduling.

For example, on a quad-core device, device fission can be used to create an

OpenCL device with only three cores.

Another approach is to schedule enough work to the device so that it can

tolerate latency in additional scheduling. Here, the scheduler maintains a

watermark of uncompleted work that has been sent to the device, and refills

the queue when it drops below the watermark. This effectively increase the

grain size, but can be very effective at reducing or eliminating device

starvation. Developers cannot directly query the list of commands in the

OpenCL command queues; however, it is possible to pass an event to each

clEnqueue call that can be queried, in order to determine the execution

AMD ACCELERATED PARALLEL PROCESSING

4-34 Chapter 4: OpenCL Performance and Optimization

status (in particular the command completion time); developers also can

maintain their own queue of outstanding requests.

For many algorithms, this technique can be effective enough at hiding

latency so that a core does not need to be reserved for scheduling. In

particular, algorithms where the work-load is largely known up-front often

work well with a deep queue and watermark. Algorithms in which work is

dynamically created may require a dedicated thread to provide low-latency

scheduling.

•Data Location

Discrete GPUs use dedicated high-bandwidth memory that exists in a

separate address space. Moving data between the device address space

and the host requires time-consuming transfers over a relatively slow PCI-

Express bus. Schedulers should be aware of this cost and, for example,

attempt to schedule work that consumes the result on the same device

producing it.

CPU and GPU devices share the same memory bandwidth, which results in

additional interactions of kernel executions.

4.7.4 Synchronization Caveats

The OpenCL functions that enqueue work (clEnqueueNDRangeKernel) merely

enqueue the requested work in the command queue; they do not cause it to

begin executing. Execution begins when the user executes a synchronizing

command, such as clFlush or clWaitForEvents. Enqueuing several commands

before flushing can enable the host CPU to batch together the command

submission, which can reduce launch overhead.

Command-queues that are configured to execute in-order are guaranteed to

complete execution of each command before the next command begins. This

synchronization guarantee can often be leveraged to avoid explicit

clWaitForEvents() calls between command submissions. Using

clWaitForEvents() requires intervention by the host CPU and additional

synchronization cost between the host and the GPU; by leveraging the in-order

queue property, back-to-back kernel executions can be efficiently handled

directly on the GPU hardware.

AMD Southern Islands GPUs can execute multiple kernels simultaneously when

there are no dependencies.

The AMD OpenCL implementation spawns a new thread to manage each

command queue. Thus, the OpenCL host code is free to manage multiple

devices from a single host thread. Note that clFinish is a blocking operation;

the thread that calls clFinish blocks until all commands in the specified

command-queue have been processed and completed. If the host thread is

managing multiple devices, it is important to call clFlush for each command-

queue before calling clFinish, so that the commands are flushed and execute

in parallel on the devices. Otherwise, the first call to clFinish blocks, the

AMD ACCELERATED PARALLEL PROCESSING

4.7 Using Multiple OpenCL Devices 4-35

commands on the other devices are not flushed, and the devices appear to

execute serially rather than in parallel.

For low-latency CPU response, it can be more efficient to use a dedicated spin

loop and not call clFinish() Calling clFinish() indicates that the application

wants to wait for the GPU, putting the thread to sleep. For low latency, the

application should use clFlush(), followed by a loop to wait for the event to

complete. This is also true for blocking maps. The application should use non-

blocking maps followed by a loop waiting on the event. The following provides

sample code for this.

if (sleep)

{

// this puts host thread to sleep, useful if power is a consideration

or overhead is not a concern

clFinish(cmd_queue_);

}

else

{

// this keeps the host thread awake, useful if latency is a concern

clFlush(cmd_queue_);

error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,

sizeof(cl_int), &eventStatus, NULL);

while (eventStatus > 0)

{

error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,

sizeof(cl_int), &eventStatus, NULL);

Sleep(0); // be nice to other threads, allow scheduler to find

other work if possible

// Choose your favorite way to yield, SwitchToThread() for example,

in place of Sleep(0)

}

4.7.5 GPU and CPU Kernels

While OpenCL provides functional portability so that the same kernel can run on

any device, peak performance for each device is typically obtained by tuning the

OpenCL kernel for the target device.

Code optimized for the Tahiti device (the AMD Radeon™ HD 7970 GPU) typically

runs well across other members of the Southern Islands family.

CPUs and GPUs have very different performance characteristics, and some of

these impact how one writes an optimal kernel. Notable differences include:

•The Vector ALU floating point resources in a CPU (SSE/AVX) require the use

of vectorized types (such as float4) to enable packed SSE code generation

and extract good performance from the Vector ALU hardware. The GPU

Vector ALU hardware is more flexible and can efficiently use the floating-

point hardware; however, code that can use float4 often generates hi-quality

code for both the CPU and the AMD GPUs.

•The AMD OpenCL CPU implementation runs work-items from the same

work-group back-to-back on the same physical CPU core. For optimally

AMD ACCELERATED PARALLEL PROCESSING

4-36 Chapter 4: OpenCL Performance and Optimization

coalesced memory patterns, a common access pattern for GPU-optimized

algorithms is for work-items in the same wavefront to access memory

locations from the same cache line. On a GPU, these work-items execute in

parallel and generate a coalesced access pattern. On a CPU, the first work-

item runs to completion (or until hitting a barrier) before switching to the next.

Generally, if the working set for the data used by a work-group fits in the CPU

caches, this access pattern can work efficiently: the first work-item brings a

line into the cache hierarchy, which the other work-items later hit. For large

working-sets that exceed the capacity of the cache hierarchy, this access

pattern does not work as efficiently; each work-item refetches cache lines

that were already brought in by earlier work-items but were evicted from the

cache hierarchy before being used. Note that AMD CPUs typically provide

512 kB to 2 MB of L2+L3 cache for each compute unit.

•CPUs do not contain any hardware resources specifically designed to

accelerate local memory accesses. On a CPU, local memory is mapped to

the same cacheable DRAM used for global memory, and there is no

performance benefit from using the __local qualifier. The additional memory

operations to write to LDS, and the associated barrier operations can reduce

performance. One notable exception is when local memory is used to pack

values to avoid non-coalesced memory patterns.

•CPU devices only support a small number of hardware threads, typically two

to eight. Small numbers of active work-group sizes reduce the CPU switching

overhead, although for larger kernels this is a second-order effect.

For a balanced solution that runs reasonably well on both devices, developers

are encouraged to write the algorithm using float4 vectorization. The GPU is

more sensitive to algorithm tuning; it also has higher peak performance potential.

Thus, one strategy is to target optimizations to the GPU and aim for reasonable

performance on the CPU. For peak performance on all devices, developers can

choose to use conditional compilation for key code loops in the kernel, or in some

cases even provide two separate kernels. Even with device-specific kernel

optimizations, the surrounding host code for allocating memory, launching

kernels, and interfacing with the rest of the program generally only needs to be

written once.

Another approach is to leverage a CPU-targeted routine written in a standard

high-level language, such as C++. In some cases, this code path may already

exist for platforms that do not support an OpenCL device. The program uses

OpenCL for GPU devices, and the standard routine for CPU devices. Load-

balancing between devices can still leverage the techniques described in

Section 4.7.3, “Partitioning Work for Multiple Devices,” page 4-32.

4.7.6 Contexts and Devices

The AMD OpenCL program creates at least one context, and each context can

contain multiple devices. Thus, developers must choose whether to place all

devices in the same context or create a new context for each device. Generally,

it is easier to extend a context to support additional devices rather than

duplicating the context for each device: buffers are allocated at the context level

AMD ACCELERATED PARALLEL PROCESSING

4.7 Using Multiple OpenCL Devices 4-37

(and automatically across all devices), programs are associated with the context,

and kernel compilation (via clBuildProgram) can easily be done for all devices

in a context. However, with current OpenCL implementations, creating a separate

context for each device provides more flexibility, especially in that buffer

allocations can be targeted to occur on specific devices. Generally, placing the

devices in the same context is the preferred solution.

AMD ACCELERATED PARALLEL PROCESSING

4-38 Chapter 4: OpenCL Performance and Optimization

AMD ACCELERATED PARALLEL PROCESSING

AMD Accelerated Parallel Processing - OpenCL Programming Guide 5-1

Chapter 5

OpenCL Performance and Optimiza-

tion for Southern Islands Devices

This chapter discusses performance and optimization when programming for

AMD Accelerated Parallel Processing GPU compute devices that are part of the

Southern Islands family, as well as CPUs and multiple devices. Details specific

to the Evergreen and Northern Islands families of GPUs are provided in

Chapter 6, “OpenCL Performance and Optimization for Evergreen and Northern

Islands Devices.”

5.1 Global Memory Optimization

Figure 5.1 is a block diagram of the GPU memory system. The up arrows are

read paths, the down arrows are write paths. WC is the write combine cache.

Figure 5.1 Memory System

Compute Unit <> Memory Channel Xbar

Memory Channel

Channel 0

Memory Channel

Channel 1

Memory Channel

Channeln-2

Memory Channel

Channel n-1

16 pe

LDS

16 pe

LDS

16 pe

LDS

16 pe

LDS

16 pe

LDS

16 pe

LDS

16 pe

LDS

16 pe

LDS

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2

AMD ACCELERATED PARALLEL PROCESSING

5-2 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

The GPU consists of multiple compute units. Each compute unit contains local

(on-chip) memory, L1 cache, registers, and 16 processing element (PE).

Individual work-items execute on a single processing element; one or more work-

groups execute on a single compute unit. On a GPU, hardware schedules groups

of work-items, called wavefronts, onto compute units; thus, work-items within a

wavefront execute in lock-step; the same instruction is executed on different

data.

Each compute unit contains 64 kB local memory, 16 kB of read/write L1 cache,

four vector units, and one scalar unit. The maximum local memory allocation is

32 kB per work-group. Each vector unit contains 512 scalar registers (SGPRs)

for handling branching, constants, and other data constant across a wavefront.

Vector units also contain 256 vector registers (VGPRs). VGPRs actually are

scalar registers, but they are replicated across the whole wavefront. Vector units

contain 16 processing elements (PEs). Each PE is scalar.

Since the L1 cache is 16 kB per compute unit, the total L1 cache size is

16 kB * (# of compute units). For the AMD Radeon™ HD 7970, this means a total

of 512 kB L1 cache. L1 bandwidth can be computed as:

L1 peak bandwidth = Compute Units * (4 threads/clock) * (128 bits per thread) *

(1 byte / 8 bits) * Engine Clock

For the AMD Radeon™ HD 7970, this is ~1.9 TB/s.

The peak memory bandwidth of your device is available in Appendix D, “Device

Parameters.”

If two memory access requests are directed to the same controller, the hardware

serializes the access. This is called a channel conflict. Similarly, if two memory

access requests go to the same memory bank, hardware serializes the access.

This is called a bank conflict. From a developer’s point of view, there is not much

difference between channel and bank conflicts. Often, a large power of two stride

results in a channel conflict. The size of the power of two stride that causes a

specific type of conflict depends on the chip. A stride that results in a channel

conflict on a machine with eight channels might result in a bank conflict on a

machine with four.

In this document, the term bank conflict is used to refer to either kind of conflict.

5.1.1 Channel Conflicts

The important concept is memory stride: the increment in memory address,

measured in elements, between successive elements fetched or stored by

consecutive work-items in a kernel. Many important kernels do not exclusively

use simple stride one accessing patterns; instead, they feature large non-unit

strides. For instance, many codes perform similar operations on each dimension

of a two- or three-dimensional array. Performing computations on the low

dimension can often be done with unit stride, but the strides of the computations

in the other dimensions are typically large values. This can result in significantly

degraded performance when the codes are ported unchanged to GPU systems.

AMD ACCELERATED PARALLEL PROCESSING

5.1 Global Memory Optimization 5-3

A CPU with caches presents the same problem, large power-of-two strides force

data into only a few cache lines.

One solution is to rewrite the code to employ array transpositions between the

kernels. This allows all computations to be done at unit stride. Ensure that the

time required for the transposition is relatively small compared to the time to

perform the kernel calculation.

For many kernels, the reduction in performance is sufficiently large that it is

worthwhile to try to understand and solve this problem.

In GPU programming, it is best to have adjacent work-items read or write

adjacent memory addresses. This is one way to avoid channel conflicts.

When the application has complete control of the access pattern and address

generation, the developer must arrange the data structures to minimize bank

conflicts. Accesses that differ in the lower bits can run in parallel; those that differ

only in the upper bits can be serialized.

In this example:

for (ptr=base; ptr<max; ptr += 16KB)

R0 = *ptr ;

where the lower bits are all the same, the memory requests all access the same

bank on the same channel and are processed serially.

This is a low-performance pattern to be avoided. When the stride is a power of

2 (and larger than the channel interleave), the loop above only accesses one

channel of memory.

The hardware byte address bits are:

•On all AMD Radeon™ HD 79XX-series GPUs, there are 12 channels. A

crossbar distributes the load to the appropriate memory channel. Each

memory channel has a read/write global L2 cache, with 64 kB per channel.

The cache line size is 64 bytes.

Because 12 channels are not a part of the power of two memory and bank

channel addressing, this is not straightforward for the AMD Radeon™ HD

79XX series. The memory channels are grouped in four quadrants, each

which consisting of three channels. Bits 8, 9, and 10 of the address select a

“virtual pipe.” The top two bits of this pipe select the quadrant; then, the

channel within the quadrant is selected using the low bit of the pipe and the

row and bank address modulo three, according to the following conditional

equation.

If (({ row, bank} %3) == 1)

channel_within_quadrant = 1

else

channel_within_quadrant = 2 * pipe[0]

31:x bank channel 7:0 address

AMD ACCELERATED PARALLEL PROCESSING

5-4 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

Figure 5.2 illustrates the memory channel mapping.

Figure 5.2 Channel Remapping/Interleaving

Note that an increase of the address by 2048 results in a 1/3 probability the

same channel is hit; increasing the address by 256 results in a 1/6 probability

the same channel is hit, etc.

AMD ACCELERATED PARALLEL PROCESSING

5.1 Global Memory Optimization 5-5

•On all AMD Radeon™ HD 77XX- and 78XX-series GPUs, the lower eight bits

select an element within a channel.

•The next set of bits select the channel. The number of channel bits varies,

since the number of channels is not the same on all parts. With eight

channels, three bits are used to select the channel; with two channels, a

single bit is used.

•The next set of bits selects the memory bank. The number of bits used

depends on the number of memory banks.

•The remaining bits are the rest of the address.

On AMD Radeon™ HD 78XX GPUs, the channel selection are bits 10:8 of the

byte address. For the AMD Radeon™ HD 77XX, the channel selection are bits

9:8 of the byte address. This means a linear burst switches channels every 256

bytes. Since the wavefront size is 64, channel conflicts are avoided if each work-

item in a wave reads a different address from a 64-word region. All AMD

Radeon™ HD 7XXX series GPUs have the same layout: channel ends at bit 8,

and the memory bank is to the left of the channel.

For AMD Radeon™ HD 77XX and 78XX GPUs, a burst of 2 kB (# of channels *

256 bytes) cycles through all the channels.

For AMD Radeon™ HD 77XX and 78XX GPUs, when calculating an address as

y*width+x, but reading a burst on a column (incrementing y), only one memory

channel of the system is used, since the width is likely a multiple of 256 words

= 2048 bytes. If the width is an odd multiple of 256B, then it cycles through all

channels.

If every work-item in a work-group references consecutive memory addresses

and the address of work-item 0 is aligned to 256 bytes and each work-item

fetches 32 bits, the entire wavefront accesses one channel. Although this seems

slow, it actually is a fast pattern because it is necessary to consider the memory

access over the entire device, not just a single wavefront.

One or more work-groups execute on each compute unit. On the AMD Radeon™

HD 7000-series GPUs, work-groups are dispatched in a linear order, with x

changing most rapidly. For a single dimension, this is:

DispatchOrder = get_group_id(0)

For two dimensions, this is:

DispatchOrder = get_group_id(0) + get_group_id(1) * get_num_groups(0)

This is row-major-ordering of the blocks in the index space. Once all compute

units are in use, additional work-groups are assigned to compute units as

needed. Work-groups retire in order, so active work-groups are contiguous.

At any time, each compute unit is executing an instruction from a single

wavefront. In memory intensive kernels, it is likely that the instruction is a

memory access. Since there are 12 channels on the AMD Radeon™ HD 7970

GPU, at most 12 of the compute units can issue a memory access operation in

AMD ACCELERATED PARALLEL PROCESSING

5-6 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

one cycle. It is most efficient if the accesses from 12 wavefronts go to different

channels. One way to achieve this is for each wavefront to access consecutive

groups of 256 = 64 * 4 bytes.

An inefficient access pattern is if each wavefront accesses all the channels. This

is likely to happen if consecutive work-items access data that has a large power

of two strides.

In the next example of a kernel for copying, the input and output buffers are

interpreted as though they were 2D, and the work-group size is organized as 2D.

The kernel code is:

#define WIDTH 1024

#define DATA_TYPE float

#define A(y , x ) A[ (y) * WIDTH + (x ) ]

#define C(y , x ) C[ (y) * WIDTH+(x ) ]

kernel void copy_float (__global const

DATA_TYPE * A,

__global DATA_TYPE* C)

{int idx = get_global_id(0);

int idy = get_global_id(1);

C(idy, idx) = A( idy, idx);

}

By changing the width, the data type and the work-group dimensions, we get a

set of kernels out of this code.

Given a 64x1 work-group size, each work-item reads a consecutive 32-bit

address. Given a 1x64 work-group size, each work-item reads a value separated

by the width in a power of two bytes.

To avoid power of two strides:

•Add an extra column to the data matrix.

•Change the work-group size so that it is not a power of 21.

•It is best to use a width that causes a rotation through all of the memory

channels, instead of using the same one repeatedly.

•Change the kernel to access the matrix with a staggered offset.

5.1.1.1 Staggered Offsets

Staggered offsets apply a coordinate transformation to the kernel so that the data

is processed in a different order. Unlike adding a column, this technique does not

use extra space. It is also relatively simple to add to existing code.

Figure 5.3 illustrates the transformation to staggered offsets.

1. Generally, it is not a good idea to make the work-group size something other than an integer multiple

of the wavefront size, but that usually is less important than avoiding channel conflicts.

AMD ACCELERATED PARALLEL PROCESSING

5.1 Global Memory Optimization 5-7

Figure 5.3 Transformation to Staggered Offsets

The global ID values reflect the order that the hardware initiates work-groups.

The values of get group ID are in ascending launch order.

global_id(0) = get_group_id(0) * get_local_size(0) + get_local_id(0)

global_id(1) = get_group_id(1) * get_local_size(1) + get_local_id(1)

The hardware launch order is fixed, but it is possible to change the launch order,

as shown in the following example.

Assume a work-group size of k x k, where k is a power of two, and a large 2D

matrix of size 2nx2

m in row-major order. If each work-group must process a

block in column-order, the launch order does not work out correctly: consecutive

work-groups execute down the columns, and the columns are a large power-of-

two apart; so, consecutive work-groups access the same channel.

By introducing a transformation, it is possible to stagger the work-groups to avoid

channel conflicts. Since we are executing 2D work-groups, each work group is

identified by four numbers.

1. get_group_id(0) - the x coordinate or the block within the column of the

matrix.

2. get_group_id(1) - the y coordinate or the block within the row of the matrix.

3. get_global_id(0) - the x coordinate or the column of the matrix.

4. get_global_id(1) - the y coordinate or the row of the matrix.

Work-

Group

0,0

1,0

2,0

0,0

0,0 1,0 2,0

0,0

1,0

2,0

0,0

Work-Group size k by k

Matrix in row

major order

Linear format (each group

is a power of two apart)

Offset format (each group is not a

power of two apart)

After transform

K + 2N2K + 2N

AMD ACCELERATED PARALLEL PROCESSING

5-8 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

To transform the code, add the following four lines to the top of the kernel.

get_group_id_0 = get_group_id(0);

get_group_id_1 = (get_group_id(0) + get_group_id(1)) % get_local_size(0);

get_global_id_0 = get_group_id_0 * get_local_size(0) + get_local_id(0);

get_global_id_1 = get_group_id_1 * get_local_size(1) + get_local_id(1);

Then, change the global IDs and group IDs to the staggered form. The result is:

__kernel void

copy_float (

__global const DATA_TYPE * A,

__global DATA_TYPE * C)

{size_t get_group_id_0 = get_group_id(0);

size_t get_group_id_1 = (get_group_id(0) + get_group_id(1)) %

get_local_size(0);

size_t get_global_id_0 = get_group_id_0 * get_local_size(0) +

get_local_id(0);

size_t get_global_id_1 = get_group_id_1 * get_local_size(1) +

get_local_id(1);

int idx = get_global_id_0; //changed to staggered form

int idy = get_global_id_1; //changed to staggered form

C(idy , idx) = A( idy , idx);

}

5.1.1.2 Reads Of The Same Address

Under certain conditions, one unexpected case of a channel conflict is that

reading from the same address is a conflict, even on the FastPath.

This does not happen on the read-only memories, such as constant buffers,

textures, or shader resource view (SRV); but it is possible on the read/write UAV

memory or OpenCL global memory.

From a hardware standpoint, reads from a fixed address have the same upper

bits, so they collide and are serialized. To read in a single value, read the value

in a single work-item, place it in local memory, and then use that location:

Avoid:

temp = input[3] // if input is from global space

Use:

if (get_local_id(0) == 0) {

local = input[3]

}

barrier(CLK_LOCAL_MEM_FENCE);

temp = local

5.1.2 Coalesced Writes

Southern Island devices do not support coalesced writes; however, continuous

addresses within work-groups provide maximum performance.

Each compute unit accesses the memory system in quarter-wavefront units. The

compute unit transfers a 32-bit address and one element-sized piece of data for

each work-item. This results in a total of 16 elements + 16 addresses per quarter-

AMD ACCELERATED PARALLEL PROCESSING

5.2 Local Memory (LDS) Optimization 5-9

wavefront. On GCN-based devices, processing quarter-wavefront requires two

cycles before the data is transferred to the memory controller.

5.1.3 Hardware Variations

For a listing of the AMD GPU hardware variations, see Appendix D, “Device

Parameters.” This appendix includes information on the number of memory

channels, compute units, and the L2 size per device.

5.2 Local Memory (LDS) Optimization

AMD Southern Islands GPUs include a Local Data Store (LDS) cache, which

accelerates local memory accesses. LDS provides high-bandwidth access (more

than 10X higher than global memory), efficient data transfers between work-items

in a work-group, and high-performance atomic support. LDS is much faster than

L1 cache access as it has twice the peak bandwidth and far lower latency.

Additionally, using LDS memory can reduce global memory bandwidth usage.

Local memory offers significant advantages when the data is re-used; for

example, subsequent accesses can read from local memory, thus reducing

global memory bandwidth. Another advantage is that local memory does not

require coalescing.

To determine local memory size:

clGetDeviceInfo( …, CL_DEVICE_LOCAL_MEM_SIZE, … );

All AMD Southern Islands GPUs contain a 64 kB LDS for each compute unit;

although only 32 kB can be allocated per work-group. The LDS contains 32-

banks, each bank is four bytes wide and 256 bytes deep; the bank address is

determined by bits 6:2 in the address. Appendix D, “Device Parameters” shows

how many LDS banks are present on the different AMD Southern Island devices.

As shown below, programmers must carefully control the bank bits to avoid bank

conflicts as much as possible. Bank conflicts are determined by what addresses

are accessed on each half wavefront boundary. Threads 0 through 31 are

checked for conflicts as are threads 32 through 63 within a wavefront.

In a single cycle, local memory can service a request for each bank (up to 32

accesses each cycle on the AMD Radeon™ HD 7970 GPU). For an AMD

Radeon™ HD 7970 GPU, this delivers a memory bandwidth of over 100 GB/s for

each compute unit, and more than 3.5 TB/s for the whole chip. This is more than

14X the global memory bandwidth. However, accesses that map to the same

bank are serialized and serviced on consecutive cycles. LDS operations do not

stall; however, the compiler inserts wait operations prior to issuing operations that

depend on the results. A wavefront that generated bank conflicts does not stall

implicitly, but may stall explicitly in the kernel if the compiler has inserted a wait

command for the outstanding memory access. The GPU reprocesses the

wavefront on subsequent cycles, enabling only the lanes receiving data, until all

the conflicting accesses complete. The bank with the most conflicting accesses

determines the latency for the wavefront to complete the local memory operation.

The worst case occurs when all 64 work-items map to the same bank, since each

AMD ACCELERATED PARALLEL PROCESSING

5-10 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

access then is serviced at a rate of one per clock cycle; this case takes 64 cycles

to complete the local memory access for the wavefront. A program with a large

number of bank conflicts (as measured by the LDSBankConflict performance

counter in the AMD APP Profiler statistics) might benefit from using the constant

or image memory rather than LDS.

Thus, the key to effectively using the local cache memory is to control the access

pattern so that accesses generated on the same cycle map to different banks in

the local memory. One notable exception is that accesses to the same address

(even though they have the same bits 6:2) can be broadcast to all requestors

and do not generate a bank conflict. The LDS hardware examines the requests

generated over two cycles (32 work-items of execution) for bank conflicts.

Ensure, as much as possible, that the memory requests generated from a

quarter-wavefront avoid bank conflicts by using unique address bits 6:2. A simple

sequential address pattern, where each work-item reads a float2 value from LDS,

generates a conflict-free access pattern on the AMD Radeon™ HD 7XXX GPU.

Note that a sequential access pattern, where each work-item reads a float4 value

from LDS, uses only half the banks on each cycle on the AMD Radeon™ HD

7XXX GPU and delivers half the performance of the float access pattern.

Each stream processor can generate up to two 4-byte LDS requests per cycle.

Byte and short reads consume four bytes of LDS bandwidth. Developers can use

the large register file: each compute unit has 256 kB of register space available

(8X the LDS size) and can provide up to twelve 4-byte values/cycle (6X the LDS

bandwidth). Registers do not offer the same indexing flexibility as does the LDS,

but for some algorithms this can be overcome with loop unrolling and explicit

addressing.

LDS reads require one ALU operation to initiate them. Each operation can initiate

two loads of up to four bytes each.

The AMD APP Profiler provides the following performance counter to help

optimize local memory usage:

LDSBankConflict: The percentage of time accesses to the LDS are stalled

due to bank conflicts relative to GPU Time. In the ideal case, there are no

bank conflicts in the local memory access, and this number is zero.

Local memory is software-controlled “scratchpad” memory. In contrast, caches

typically used on CPUs monitor the access stream and automatically capture

recent accesses in a tagged cache. The scratchpad allows the kernel to explicitly

load items into the memory; they exist in local memory until the kernel replaces

them, or until the work-group ends. To declare a block of local memory, use the

__local keyword; for example:

__local float localBuffer[64]

These declarations can be either in the parameters to the kernel call or in the

body of the kernel. The __local syntax allocates a single block of memory,

which is shared across all work-items in the workgroup.

AMD ACCELERATED PARALLEL PROCESSING

5.2 Local Memory (LDS) Optimization 5-11

To write data into local memory, write it into an array allocated with __local. For

example:

localBuffer[i] = 5.0;

A typical access pattern is for each work-item to collaboratively write to the local

memory: each work-item writes a subsection, and as the work-items execute in

parallel they write the entire array. Combined with proper consideration for the

access pattern and bank alignment, these collaborative write approaches can

lead to highly efficient memory accessing.

The following example is a simple kernel section that collaboratively writes, then

reads from, local memory:

__kernel void localMemoryExample (__global float *In, __global float *Out) {

__local float localBuffer[64];

uint tx = get_local_id(0);

uint gx = get_global_id(0);

// Initialize local memory:

// Copy from this work-group’s section of global memory to local:

// Each work-item writes one element; together they write it all

localBuffer[tx] = In[gx];

// Ensure writes have completed:

barrier(CLK_LOCAL_MEM_FENCE);

// Toy computation to compute a partial factorial, shows re-use from local

float f = localBuffer[tx];

for (uint i=tx+1; i<64; i++) {

f *= localBuffer[i];

}

Out[gx] = f;

}

Note the host code cannot read from, or write to, local memory. Only the kernel

can access local memory.

Local memory is consistent across work-items only at a work-group barrier; thus,

before reading the values written collaboratively, the kernel must include a

barrier() instruction. An important optimization is the case where the local

work-group size is less than, or equal to, the wavefront size. Because the

wavefront executes as an atomic unit, the explicit barrier operation is not

required. The compiler automatically removes these barriers if the kernel

specifies a reqd_work_group_size (see section 5.8 of the OpenCL

Specification) that is less than the wavefront size. Developers are strongly

encouraged to include the barriers where appropriate, and rely on the compiler

to remove the barriers when possible, rather than manually removing the

barriers(). This technique results in more portable code, including the ability to

run kernels on CPU devices.

AMD ACCELERATED PARALLEL PROCESSING

5-12 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

5.3 Constant Memory Optimization

Constants (data from read-only buffers shared by a wavefront) are loaded to

SGPRs from memory through the L1 (and L2) cache using scalar memory read

instructions. The scalar instructions can use up to two SGPR sources per cycle;

vector instructions can use one SGPR source per cycle. (There are 512 SGPRs

per SIMD, 4 SIMDs per CU; so a 32 CU configuration like Tahiti has 256 kB of

SGPRs.)

Southern Islands hardware supports specific inline literal constants. These

constants are “free” in that they do not increase code size:

integers 1.. 64

integers -1 .. -16

0.5 single or double floats

-0.5

1.0

-1.0

2.0

-2.0

4.0

-4.0

Any other literal constant increases the code size by at least 32 bits.

The AMD implementation of OpenCL provides three levels of performance for the

“constant” memory type.

1. Simple Direct-Addressing Patterns

Very high bandwidth can be attained when the compiler has available the

constant address at compile time and can embed the constant address into

the instruction. Each processing element can load up to 4x4-byte direct-

addressed constant values each cycle. Typically, these cases are limited to

simple non-array constants and function parameters. The executing kernel

loads the constants into scalar registers and concurrently populates the

constant cache. The cache is a tagged cache, typically each 8k blocks is

shared among four compute units. If the constant data is already present in

the constant cache, the load is serviced by the cache and does not require

any global memory bandwidth. The constant cache size for each device is

given in Appendix D, “Device Parameters”; it varies from 4k to 48k per GPU.

2. Same Index

Hardware acceleration also takes place when all work-items in a wavefront

reference the same constant address. In this case, the data is loaded from

memory one time, stored in the L1 cache, and then broadcast to all wave-

fronts. This can reduce significantly the required memory bandwidth.

AMD ACCELERATED PARALLEL PROCESSING

5.3 Constant Memory Optimization 5-13

3. Varying Index

More sophisticated addressing patterns, including the case where each work-

item accesses different indices, are not hardware accelerated and deliver the

same performance as a global memory read with the potential for cache hits.

To further improve the performance of the AMD OpenCL stack, two methods

allow users to take advantage of hardware constant buffers. These are:

1. Globally scoped constant arrays. These arrays are initialized, globally

scoped, and in the constant address space (as specified in section 6.5.3 of

the OpenCL specification). If the size of an array is below 64 kB, it is placed

in hardware constant buffers; otherwise, it uses global memory. An example

of this is a lookup table for math functions.

2. Per-pointer attribute specifying the maximum pointer size. This is specified

using the max_constant_size(N) attribute. The attribute form conforms to

section 6.10 of the OpenCL 1.0 specification. This attribute is restricted to

top-level kernel function arguments in the constant address space. This

restriction prevents a pointer of one size from being passed as an argument

to a function that declares a different size. It informs the compiler that indices

into the pointer remain inside this range and it is safe to allocate a constant

buffer in hardware, if it fits. Using a constant pointer that goes outside of this

range results in undefined behavior. All allocations are aligned on the 16-byte

boundary. For example:

kernel void mykernel(global int* a,

constant int* b __attribute__((max_constant_size (65536)))

)

{

size_t idx = get_global_id(0);

a[idx] = b[idx & 0x3FFF];

}

A kernel that uses constant buffers must use CL_DEVICE_MAX_CONSTANT_ARGS to

query the device for the maximum number of constant buffers the kernel can

support. This value might differ from the maximum number of hardware constant

buffers available. In this case, if the number of hardware constant buffers is less

than the CL_DEVICE_MAX_CONSTANT_ARGS, the compiler allocates the largest

constant buffers in hardware first and allocates the rest of the constant buffers in

global memory. As an optimization, if a constant pointer A uses n bytes of

memory, where n is less than 64 kB, and constant pointer B uses m bytes of

memory, where m is less than (64 kB – n) bytes of memory, the compiler can

allocate the constant buffer pointers in a single hardware constant buffer. This

optimization can be applied recursively by treating the resulting allocation as a

single allocation and finding the next smallest constant pointer that fits within the

space left in the constant buffer.

AMD ACCELERATED PARALLEL PROCESSING

5-14 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

5.4 OpenCL Memory Resources: Capacity and Performance

Table 5.1 summarizes the hardware capacity and associated performance for the

structures associated with the five OpenCL Memory Types. This information

specific to the AMD Radeon™ HD 7970 GPUs with 3 GB video memory. See

Appendix D, “Device Parameters” for more details about other GPUs.

Table 5.1 Hardware Performance Parameters

The compiler tries to map private memory allocations to the pool of GPRs in the

GPU. In the event GPRs are not available, private memory is mapped to the

“scratch” region, which has the same performance as global memory.

Section 5.6.2, “Resource Limits on Active Wavefronts,” page 5-17, has more

information on register allocation and identifying when the compiler uses the

scratch region. GPRs provide the highest-bandwidth access of any hardware

resource. In addition to reading up to 12 bytes/cycle per processing element from

the register file, the hardware can access results produced in the previous cycle

without consuming any register file bandwidth.

Same-indexed constants can be cached in the L1 and L2 cache. Note that

“same-indexed” refers to the case where all work-items in the wavefront

reference the same constant index on the same cycle. The performance shown

assumes an L1 cache hit.

Varying-indexed constants, which are cached only in L2, use the same path as

global memory access and are subject to the same bank and alignment

constraints described in Section 5.1, “Global Memory Optimization,” page 5-1.

The L1 and L2 read/write caches are constantly enabled. As of SDK 2.4, read

only buffers can be cached in L1 and L2.

The L1 cache can service up to four address requests per cycle, each delivering

up to 16 bytes. The bandwidth shown assumes an access size of 16 bytes;

smaller access sizes/requests result in a lower peak bandwidth for the L1 cache.

Using float4 with images increases the request size and can deliver higher L1

cache bandwidth.

OpenCL

Memory Type Hardware Resource Size/CU Size/GPU

Peak Read

Bandwidth/ Stream

Core

Private GPRs 256k 8192k 12 bytes/cycle

Local LDS 64k 2048k 8 bytes/cycle

Constant

Direct-addressed constant 48k 4 bytes/cycle

Same-indexed constant 4 bytes/cycle

Varying-indexed constant ~0.14 bytes/cycle

Images L1 Cache 16k 512k14 bytes/cycle

L2 Cache 7682k ~0.4 bytes/cycle

Global Global Memory 3G ~0.14 bytes/cycle

1. Applies to images and buffers.

2. Applies to images and buffers.

AMD ACCELERATED PARALLEL PROCESSING

5.5 Using LDS or L1 Cache 5-15

Each memory channel on the GPU contains an L2 cache that can deliver up to

64 bytes/cycle. The AMD Radeon™ HD 7970 GPU has 12 memory channels;

thus, it can deliver up to 768 bytes/cycle; divided among 2048 stream cores, this

provides up to ~0.4 bytes/cycle for each stream core.

Global Memory bandwidth is limited by external pins, not internal bus bandwidth.

The AMD Radeon™ HD 7970 GPU supports up to 264 GB/s of memory

bandwidth which is an average of 0.14 bytes/cycle for each stream core.

Note that Table 5.1 shows the performance for the AMD Radeon™ HD 7970

GPU. The “Size/Compute Unit” column and many of the bandwidths/processing

element apply to all Southern Islands-class GPUs; however, the “Size/GPU”

column and the bandwidths for varying-indexed constant, L2, and global memory

vary across different GPU devices. The resource capacities and peak bandwidth

for other AMD GPU devices can be found in Appendix D, “Device Parameters.”

5.5 Using LDS or L1 Cache

There are a number of considerations when deciding between LDS and L1 cache

for a given algorithm.

LDS supports read/modify/write operations, as well as atomics. It is well-suited

for code that requires fast read/write, read/modify/write, or scatter operations that

otherwise are directed to global memory. On current AMD hardware, L1 is part

of the read path; hence, it is suited to cache-read-sensitive algorithms, such as

matrix multiplication or convolution.

LDS is typically larger than L1 (for example: 64 kB vs 16 kB on Southern Islands

devices). If it is not possible to obtain a high L1 cache hit rate for an algorithm,

the larger LDS size can help. On the AMD Radeon™ HD 7970 device, the

theoretical LDS peak bandwidth is 3.8 TB/s, compared to L1 at 1.9 TB/sec.

The native data type for L1 is a four-vector of 32-bit words. On L1, fill and read

addressing are linked. It is important that L1 is initially filled from global memory

with a coalesced access pattern; once filled, random accesses come at no extra

processing cost.

Currently, the native format of LDS is a 32-bit word. The theoretical LDS peak

bandwidth is achieved when each thread operates on a two-vector of 32-bit

words (16 threads per clock operate on 32 banks). If an algorithm requires

coalesced 32-bit quantities, it maps well to LDS. The use of four-vectors or larger

can lead to bank conflicts, although the compiler can mitigate some of these.

From an application point of view, filling LDS from global memory, and reading

from it, are independent operations that can use independent addressing. Thus,

LDS can be used to explicitly convert a scattered access pattern to a coalesced

pattern for read and write to global memory. Or, by taking advantage of the LDS

read broadcast feature, LDS can be filled with a coalesced pattern from global

memory, followed by all threads iterating through the same LDS words

simultaneously.

AMD ACCELERATED PARALLEL PROCESSING

5-16 Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices

LDS reuses the data already pulled into cache by other wavefronts. Sharing

across work-groups is not possible because OpenCL does not guarantee that

LDS is in a particular state at the beginning of work-group execution. L1 content,

on the other hand, is independent of work-group execution, so that successive

work-groups can share the content in the L1 cache of a given Vector ALU.

However, it currently is not possible to explicitly control L1 sharing across work-

groups.

The use of LDS is linked to GPR usage and wavefront-per-Vector ALU count.

Better sharing efficiency requires a larger work-group, so that more work-items

share the same LDS. Compiling kernels for larger work-groups typically results

in increased register use, so that fewer wavefronts can be scheduled

simultaneously per Vector ALU. This, in turn, reduces memory latency hiding.

Requesting larger amounts of LDS per work-group results in fewer wavefronts

per Vector ALU, with the same effect.

LDS typically involves the use of barriers, with a potential performance impact.

This is true even for read-only use cases, as LDS must be explicitly filled in from

global memory (after which a barrier is required before reads can commence).

5.6 NDRange and Execution Range Optimization

Probably the most effective way to exploit the potential performance of the GPU

is to provide enough threads to keep the device completely busy. The

programmer specifies a three-dimensional NDRange over which to execute the

kernel; bigger problems with larger NDRanges certainly help to more effectively

use the machine. The programmer also controls how the global NDRange is

divided into local ranges, as well as how much work is done in each work-item,

and which resources (registers and local memory) are used by the kernel. All of

these can play a role in how the work is balanced across the machine and how

well it is used. This section introduces the concept of latency hiding, how many

wavefronts are required to hide latency on AMD GPUs, how the resource usage

in the kernel can impact the active wavefronts, and how to choose appropriate

global and local work-group dimensions.

5.6.1 Hiding ALU and Memory Latency

The read-after-write latency for most arithmetic operations (a floating-point add,

for example) is only four cycles. For most Southern Island devices, each CU can

execute 64 vector ALU instructions per cycle, 16 per wavefront. Also, a wavefront

can issue a scalar ALU instruction every four cycles. To achieve peak ALU

power, a minimum of four wavefronts must be scheduled for each CU.

Global memory reads generate a reference to the off-chip memory and

experience a latency of 300 to 600 cycles. The wavefront that generates the

global memory access is made idle until the memory request completes. During

this time, the compute unit can process other independent wavefronts, if they are

available.

AMD ACCELERATED PARALLEL PROCESSING

5.6 NDRange and Execution Range Optimization 5-17

Kernel execution time also plays a role in hiding memory latency: longer chains

of ALU instructions keep the functional units busy and effectively hide more

latency. To better understand this concept, consider a global memory access

which takes 400 cycles to execute. Assume the compute unit contains many

other wavefronts, each of which performs five ALU instructions before generating

another global memory reference. As discussed previously, the hardware

executes each instruction in the wavefront in four cycles; thus, all five instructions

occupy the ALU for 20 cycles. Note the compute unit interleaves two of these

wavefronts and executes the five instructions from both wavefronts (10 total

instructions) in 40 cycles. To fully hide the 400 cycles of latency, the compute

unit requires (400/40) = 10 pairs of wavefronts, or 20 total wavefronts. If the

wavefront contains 10 instructions rather than 5, the wavefront pair would

consume 80 cycles of latency, and only 10 wavefronts would be required to hide

the 400 cycles of latency.

Generally, it is not possible to predict how the compute unit schedules the

available wavefronts, and thus it is not useful to try to predict exactly which ALU

block executes when trying to hide latency. Instead, consider the overall ratio of

ALU operations to fetch operations – this metric is reported by the AMD APP

Profiler in the ALUFetchRatio counter. Each ALU operation keeps the compute

unit busy for four cycles, so you can roughly divide 500 cycles of latency by

(4*ALUFetchRatio) to determine how many wavefronts must be in-flight to hide

that latency. Additionally, a low value for the ALUBusy performance counter can

indicate that the compute unit is not providing enough wavefronts to keep the

execution resources in full use. (This counter also can be low if the kernel

exhausts the available DRAM bandwidth. In this case, generating more

wavefronts does not improve performance; it can reduce performance by creating

more contention.)

Increasing the wavefronts/compute unit does not indefinitely improve

performance; once the GPU has enough wavefronts to hide latency, additional

active wavefronts provide little or no performance benefit. A closely related metric

to wavefronts/compute unit is “occupancy,” which is defined as the ratio of active

wavefronts to the maximum number of possible wavefronts supported by the

hardware. Many of the important optimization targets and resource limits are

expressed in wavefronts/compute units, so this section uses this metric rather

than the related “occupancy” term.

6.6.2 Resource Limits on Active Wavefronts

AMD GPUs have two important global resource constraints that limit the number

of in-flight wavefronts:

•Each compute unit supports a maximum of eight work-groups. Recall that

AMD OpenCL supports up to 256 work-items (four wavefronts) per work-

AMD ACCELERATED PARALLEL PROCESSING

6.6 NDRange and Execution Range Optimization 6-25

group; effectively, this means each compute unit can support up to 32

wavefronts.

•Each GPU has a global (across all compute units) limit on the number of

active wavefronts. The GPU hardware is generally effective at balancing the

load across available compute units. Thus, it is useful to convert this global

limit into an average wavefront/compute unit so that it can be compared to

the other limits discussed in this section. For example, the ATI Radeon™ HD

5870 GPU has a global limit of 496 wavefronts, shared among 20 compute

units. Thus, it supports an average of 24.8 wavefronts/compute unit.

Appendix D, “Device Parameters” contains information on the global number

of wavefronts supported by other AMD GPUs. Some AMD GPUs support up

to 96 wavefronts/compute unit.

These limits are largely properties of the hardware and, thus, difficult for

developers to control directly. Fortunately, these are relatively generous limits.

Frequently, the register and LDS usage in the kernel determines the limit on the

number of active wavefronts/compute unit, and these can be controlled by the

developer.

6.6.2.1 GPU Registers

Each compute unit provides 16384 GP registers, and each register contains

4x32-bit values (either single-precision floating point or a 32-bit integer). The total

among all active wavefronts on the compute unit; each kernel allocates only the

registers it needs from the shared pool. This is unlike a CPU, where each thread

is assigned a fixed set of architectural registers. However, using many registers

in a kernel depletes the shared pool and eventually causes the hardware to

throttle the maximum number of active wavefronts.

Table 6.7 shows how the registers used in the kernel impacts the register-limited

wavefronts/compute unit.

For example, a kernel that uses 30 registers (120x32-bit values) can run with

eight active wavefronts on each compute unit. Because of the global limits

described earlier, each compute unit is limited to 32 wavefronts; thus, kernels can

use up to seven registers (28 values) without affecting the number of

wavefronts/compute unit. Finally, note that in addition to the GPRs shown in the

table, each kernel has access to four clause temporary registers.

AMD ACCELERATED PARALLEL PROCESSING

6-26 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

Devices

Table 6.7 Impact of Register Type on Wavefronts/CU

AMD provides the following tools to examine the number of general-purpose

registers (GPRs) used by the kernel.

•The AMD APP Profiler displays the number of GPRs used by the kernel.

•Alternatively, the AMD APP Profiler generates the ISA dump (described in

Section 4.3, “Analyzing Processor Kernels,” page 4-9), which then can be

searched for the string :NUM_GPRS.

•The AMD APP KernelAnalyzer also shows the GPR used by the kernel,

across a wide variety of GPU compilation targets.

GP Registers used

by Kernel

Wavefronts / Compute-Unit

0-1 248

2 124

382

462

549

641

735

831

927

10 24

11 22

12 20

13 19

14 17

15 16

16 15

17 14

18-19 13

19-20 12

21-22 11

23-24 10

25-27 9

28-31 8

32-35 7

36-41 6

42-49 5

50-62 4

63-82 3

83-124 2

AMD ACCELERATED PARALLEL PROCESSING

6.6 NDRange and Execution Range Optimization 6-27

The compiler generates spill code (shuffling values to, and from, memory) if it

cannot fit all the live values into registers. Spill code uses long-latency global

memory and can have a large impact on performance. The AMD APP Profiler

reports the static number of register spills in the ScratchReg field. Generally, it

is a good idea to re-write the algorithm to use fewer GPRs, or tune the work-

group dimensions specified at launch time to expose more registers/kernel to the

compiler, in order to reduce the scratch register usage to 0.

6.6.2.2 Specifying the Default Work-Group Size at Compile-Time

The number of registers used by a work-item is determined when the kernel is

compiled. The user later specifies the size of the work-group. Ideally, the OpenCL

compiler knows the size of the work-group at compile-time, so it can make

optimal register allocation decisions. Without knowing the work-group size, the

compiler must assume an upper-bound size to avoid allocating more registers in

the work-item than the hardware actually contains.

For example, if the compiler allocates 70 registers for the work-item, Table 6.7

shows that only three wavefronts (192 work-items) are supported. If the user later

launches the kernel with a work-group size of four wavefronts (256 work-items),

the launch fails because the work-group requires 70*256=17920 registers, which

is more than the hardware allows. To prevent this from happening, the compiler

performs the register allocation with the conservative assumption that the kernel

is launched with the largest work-group size (256 work-items). The compiler

guarantees that the kernel does not use more than 62 registers (the maximum

number of registers which supports a work-group with four wave-fronts), and

generates low-performing register spill code, if necessary.

Fortunately, OpenCL provides a mechanism to specify a work-group size that the

compiler can use to optimize the register allocation. In particular, specifying a

smaller work-group size at compile time allows the compiler to allocate more

registers for each kernel, which can avoid spill code and improve performance.

The kernel attribute syntax is:

__attribute__((reqd_work_group_size(X, Y, Z)))

Section 6.7.2 of the OpenCL specification explains the attribute in more detail.

6.6.2.3 Local Memory (LDS) Size

In addition to registers, shared memory can also serve to limit the active

wavefronts/compute unit. Each compute unit has 32k of LDS, which is shared

among all active work-groups. LDS is allocated on a per-work-group granularity,

so it is possible (and useful) for multiple wavefronts to share the same local

memory allocation. However, large LDS allocations eventually limits the number

of workgroups that can be active. Table 6.8 provides more details about how LDS

usage can impact the wavefronts/compute unit.

AMD ACCELERATED PARALLEL PROCESSING

6-28 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

Devices

Table 6.8 Effect of LDS Usage on Wavefronts/CU1

1. Assumes each work-group uses four wavefronts (the maximum supported by the AMD

OpenCL SDK).

AMD provides the following tools to examine the amount of LDS used by the

kernel:

•The AMD APP Profiler displays the LDS usage. See the LocalMem counter.

•Alternatively, use the AMD APP Profiler to generate the ISA dump (described

in Section 4.3, “Analyzing Processor Kernels,” page 4-9), then search for the

string SQ_LDS_ALLOC:SIZE in the ISA dump. Note that the value is shown in

hexadecimal format.

6.6.3 Partitioning the Work

In OpenCL, each kernel executes on an index point that exists in a global

NDRange. The partition of the NDRange can have a significant impact on

performance; thus, it is recommended that the developer explicitly specify the

global (#work-groups) and local (#work-items/work-group) dimensions, rather

than rely on OpenCL to set these automatically (by setting local_work_size to

NULL in clEnqueueNDRangeKernel). This section explains the guidelines for

partitioning at the global, local, and work/kernel levels.

6.6.3.1 Global Work Size

OpenCL does not explicitly limit the number of work-groups that can be submitted

with a clEnqueueNDRangeKernel command. The hardware limits the available in-

flight threads, but the OpenCL SDK automatically partitions a large number of

work-groups into smaller pieces that the hardware can process. For some large

workloads, the amount of memory available to the GPU can be a limitation; the

problem might require so much memory capacity that the GPU cannot hold it all.

Local Memory

/ Work-Group

LDS-Limited

Work-Groups

LDS-Limited

Wavefronts/

Compute-Unit

(Assume 4

Wavefronts/

Work-Group)

LDS-Limited

Wavefronts/

Compute-Unit

(Assume 3

Wavefronts/

Work-Group)

LDS-Limited

Wavefronts/

Compute-Unit

(Assume 2

Wavefronts/

Work-Group)

<=4K8 322416

4.0K-4.6K 7 28 21 14

4.6K-5.3K 6 24 18 12

5.3K-6.4K 5 20 15 10

6.4K-8.0K 4 16 12 8

8.0K-10.7K 3 12 9 6

10.7K-16.0K 2 8 6 4

16.0K-32.0K 1 4 3 2

AMD ACCELERATED PARALLEL PROCESSING

6.6 NDRange and Execution Range Optimization 6-29

In these cases, the programmer must partition the workload into multiple

clEnqueueNDRangeKernel commands. The available device memory can be

obtained by querying clDeviceInfo.

At a minimum, ensure that the workload contains at least as many work-groups

as the number of compute units in the hardware. Work-groups cannot be split

across multiple compute units, so if the number of work-groups is less than the

available compute units, some units are idle. Evergreen and Northern Islands

GPUs have 2-24 compute units. (See Appendix D, “Device Parameters” for a

table of device parameters, including the number of compute units, or use

clGetDeviceInfo(…CL_DEVICE_MAX_COMPUTE_UNITS) to determine the value

dynamically).

6.6.3.2 Local Work Size (#Work-Items per Work-Group)

OpenCL limits the number of work-items in each group. Call clDeviceInfo with

the CL_DEVICE_MAX_WORK_GROUP_SIZE to determine the maximum number of

work-groups supported by the hardware. Currently, AMD GPUs with SDK 2.1

return 256 as the maximum number of work-items per work-group. Note the

number of work-items is the product of all work-group dimensions; for example,

a work-group with dimensions 32x16 requires 512 work-items, which is not

allowed with the current AMD OpenCL SDK.

The fundamental unit of work on AMD GPUs is called a wavefront. Each

wavefront consists of 64 work-items; thus, the optimal local work size is an

integer multiple of 64 (specifically 64, 128, 192, or 256) work-items per work-

group.

Work-items in the same work-group can share data through LDS memory and

also use high-speed local atomic operations. Thus, larger work-groups enable

more work-items to efficiently share data, which can reduce the amount of slower

global communication. However, larger work-groups reduce the number of global

work-groups, which, for small workloads, could result in idle compute units.

Generally, larger work-groups are better as long as the global range is big

enough to provide 1-2 Work-Groups for each compute unit in the system; for

small workloads it generally works best to reduce the work-group size in order to

avoid idle compute units. Note that it is possible to make the decision

dynamically, when the kernel is launched, based on the launch dimensions and

the target device characteristics.

6.6.3.3 Moving Work to the Kernel

Often, work can be moved from the work-group into the kernel. For example, a

matrix multiply where each work-item computes a single element in the output

array can be written so that each work-item generates multiple elements. This

technique can be important for effectively using the processing elements

available in the five-wide (or four-wide, depending on the GPU type) VLIW

processing engine (see the ALUPacking performance counter reported by the

AMD APP Profiler). The mechanics of this technique often is as simple as adding

a for loop around the kernel, so that the kernel body is run multiple times inside

AMD ACCELERATED PARALLEL PROCESSING

6-30 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

Devices

this loop, then adjusting the global work size to reduce the work-items. Typically,

the local work-group is unchanged, and the net effect of moving work into the

kernel is that each work-group does more effective processing, and fewer global

work-groups are required.

When moving work to the kernel, often it is best to combine work-items that are

separated by 16 in the NDRange index space, rather than combining adjacent

work-items. Combining the work-items in this fashion preserves the memory

access patterns optimal for global and local memory accesses. For example,

consider a kernel where each kernel accesses one four-byte element in array A.

The resulting access pattern is:

If we naively combine four adjacent work-items to increase the work processed

per kernel, so that the first work-item accesses array elements A+0 to A+3 on

successive cycles, the overall access pattern is:

This pattern shows that on the first cycle the access pattern contains “holes.”

Also, this pattern results in bank conflicts on the LDS. A better access pattern is

to combine four work-items so that the first work-item accesses array elements

A+0, A+16, A+32, and A+48. The resulting access pattern is:

Note that this access patterns preserves the sequentially-increasing addressing

of the original kernel and generates efficient global and LDS memory references.

Increasing the processing done by the kernels can allow more processing to be

done on the fixed pool of local memory available to work-groups. For example,

consider a case where an algorithm requires 32x32 elements of shared memory.

If each work-item processes only one element, it requires 1024 work-items/work-

group, which exceeds the maximum limit. Instead, each kernel can be written to

Work-item 0 1 2 3 …

Cycle0 A+0 A+1 A+2 A+3

Work-item 0 1 2 3 4 5

...

Cycle0 A+0 A+4 A+8 A+12 A+16 A+20

Cycle1 A+1 A+5 A+9 A+13 A+17 A+21

Cycle2 A+2 A+6 A+10 A+14 A+18 A+22

Cycle3 A+3 A+7 A+11 A+15 A+19 A+23

Work-item012345

…

Cycle0 A+0 A+1 A+2 A+3 A+4 A+5

Cycle1 A+16 A+17 A+18 A+19 A+20 A+21

Cycle2 A+32 A+33 A+34 A+35 A+36 A+37

Cycle3 A+48 A+49 A+50 A+51 A+52 A+53

AMD ACCELERATED PARALLEL PROCESSING

6.6 NDRange and Execution Range Optimization 6-31

process four elements, and a work-group of 16x16 work-items could be launched

to process the entire array. A related example is a blocked algorithm, such as a

matrix multiply; the performance often scales with the size of the array that can

be cached and used to block the algorithm. By moving processing tasks into the

kernel, the kernel can use the available local memory rather than being limited

by the work-items/work-group.

6.6.3.4 Work-Group Dimensions vs Size

The local NDRange can contain up to three dimensions, here labeled X, Y, and

Z. The X dimension is returned by get_local_id(0), Y is returned by

get_local_id(1), and Z is returned by get_local_id(2). The GPU hardware

schedules the kernels so that the X dimensions moves fastest as the work-items

are packed into wavefronts. For example, the 128 threads in a 2D work-group of

dimension 32x4 (X=32 and Y=4) would be packed into two wavefronts as follows

(notation shown in X,Y order):

The total number of work-items in the work-group is typically the most important

parameter to consider, in particular when optimizing to hide latency by increasing

wavefronts/compute unit. However, the choice of XYZ dimensions for the same

overall work-group size can have the following second-order effects.

•Work-items in the same quarter-wavefront execute on the same cycle in the

processing engine. Thus, global memory coalescing and local memory bank

conflicts can be impacted by dimension, particularly if the fast-moving X

dimension is small. Typically, it is best to choose an X dimension of at least

16, then optimize the memory patterns for a block of 16 work-items which

differ by 1 in the X dimension.

•Work-items in the same wavefront have the same program counter and

execute the same instruction on each cycle. The packing order can be

important if the kernel contains divergent branches. If possible, pack together

work-items that are likely to follow the same direction when control-flow is

encountered. For example, consider an image-processing kernel where each

work-item processes one pixel, and the control-flow depends on the color of

the pixel. It might be more likely that a square of 8x8 pixels is the same color

than a 64x1 strip; thus, the 8x8 would see less divergence and higher

performance.

•When in doubt, a square 16x16 work-group size is a good start.

WaveFront0

0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 9,0 10,0 11,0 12,0 13,0 14,0 15,0

16,0 17,0 18,0 19,0 20,0 21,0 22,0 23,0 24,0 25,0 26,0 27,0 28,0 29,0 30,0 31,0

0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1 10,1 11,1 12,1 13,1 14,1 15,1

16,1 17,1 18,1 19,1 20,1 21,1 22,1 23,1 24,1 25,1 26,1 27,1 28,1 29,1 30,1 31,1

WaveFront1

0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 8,2 9,2 10,2 11,2 12,2 13,2 14,2 15,2

16,2 17,2 18,2 19,2 20,2 21,2 22,2 23,2 24,2 25,2 26,2 27,2 28,2 29,2 30,2 31,2

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 8,3 9,3 10,3 11,3 12,3 13,3 14,3 15,3

16,3 17,3 18,3 19,3 20,3 21,3 22,3 23,3 24,3 25,3 26,3 27,3 28,3 29,3 30,3 31,3

AMD ACCELERATED PARALLEL PROCESSING

6-32 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

Devices

6.6.4 Optimizing for Cedar

To focus the discussion, this section has used specific hardware characteristics

that apply to most of the Evergreen series. The value Evergreen part, referred to

as Cedar and used in products such as the ATI Radeon™ HD 5450 GPU, has

different architecture characteristics, as shown below.

Note the maximum workgroup size can be obtained with

clGetDeviceInfo...(...,CL_DEVICE_MAX_WORK_GROUP_SIZE,...).

Applications must ensure that the requested kernel launch dimensions that are

fewer than the threshold reported by this API call.

The difference in total register size can impact the compiled code and cause

that can be useful is to specify the required work-group size as 128 (half the

default of 256). In this case, the compiler has the same number of registers

available as for other devices and uses the same number of registers. The

developer must ensure that the kernel is launched with the reduced work size

(128) on Cedar-class devices.

6.6.5 Summary of NDRange Optimizations

As shown above, execution range optimization is a complex topic with many

interacting variables and which frequently requires some experimentation to

determine the optimal values. Some general guidelines are:

•Select the work-group size to be a multiple of 64, so that the wavefronts are

fully populated.

•Always provide at least two wavefronts (128 work-items) per compute unit.

For a ATI Radeon™ HD 5870 GPU, this implies 40 wave-fronts or 2560 work-

items. If necessary, reduce the work-group size (but not below 64 work-

items) to provide work-groups for all compute units in the system.

•Latency hiding depends on both the number of wavefronts/compute unit, as

well as the execution time for each kernel. Generally, two to eight

wavefronts/compute unit is desirable, but this can vary significantly,

depending on the complexity of the kernel and the available memory

bandwidth. The AMD APP Profiler and associated performance counters can

help to select an optimal value.

Evergreen

Cypress, Juniper,

Redwood

Evergreen

Cedar

Work-items/Wavefront 64 32

Stream Cores / CU 16 8

GP Registers / CU 16384 8192

Local Memory Size 32K 32K

Maximum Work-Group Size 256 128

AMD ACCELERATED PARALLEL PROCESSING

6.7 Using Multiple OpenCL Devices 6-33

6.7 Using Multiple OpenCL Devices

The AMD OpenCL runtime supports both CPU and GPU devices. This section

introduces techniques for appropriately partitioning the workload and balancing it

across the devices in the system.

6.7.1 CPU and GPU Devices

Table 6.9 lists some key performance characteristics of two exemplary CPU and

GPU devices: a quad-core AMD Phenom II X4 processor running at 2.8 GHz,

and a mid-range ATI Radeon™ 5670 GPU running at 750 MHz. The “best” device

in each characteristic is highlighted, and the ratio of the best/other device is

shown in the final column.

Table 6.9 CPU and GPU Performance Characteristics