OpenCL_Programming_Guide AMD Accelerated Parallel Processing Open CL Programming Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 286

Download
Open PDF In Browser	View PDF

Programming Guide

AMD Accelerated Parallel Processing
OpenCL™

July 2012

rev2.3

© 2012 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo,
AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI,
the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows
Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their
respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by
permission by Khronos.
The contents of this document are provided in connection with Advanced Micro Devices,
Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication and reserves the right to
make changes to specifications and product descriptions at any time without notice. The
information contained herein may be of a preliminary or advance nature and is subject to
change without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth
in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever,
and disclaims any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular purpose, or
infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications
intended to support or sustain life, or in any other application in which the failure of AMD’s
product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to
its products at any time without notice.

Advanced Micro Devices, Inc.
One AMD Place
P.O. Box 3453
Sunnyvale, CA 94088-3453
www.amd.com

For AMD Accelerated Parallel Processing:

URL:

developer.amd.com/appsdk

Developing:

developer.amd.com/

Support:

developer.amd.com/appsdksupport

Forum:

developer.amd.com/openclforum

AMD ACCELERATED PARALLEL PROCESSING

Preface

About This Document
This document provides a basic description of the AMD Accelerated Parallel
Processing environment and components. It describes the basic architecture of
stream processors and provides useful performance tips. This document also
provides a guide for programmers who want to use AMD Accelerated Parallel
Processing to accelerate their applications.

Audience
This document is intended for programmers. It assumes prior experience in
writing code for CPUs and a basic understanding of threads (work-items). While
a basic understanding of GPU architectures is useful, this document does not
assume prior graphics knowledge. It further assumes an understanding of
chapters 1, 2, and 3 of the OpenCL Specification (for the latest version, see
http://www.khronos.org/registry/cl/ ).

Organization
This AMD Accelerated Parallel Processing document begins, in Chapter 1, with
an overview of: the AMD Accelerated Parallel Processing programming models,
OpenCL, the AMD Compute Abstraction Layer (CAL), the AMD APP Kernel
Analyzer, and the AMD APP Profiler. Chapter 2 discusses the compiling and
running of OpenCL programs. Chapter 3 describes using GNU debugger (GDB)
to debug OpenCL programs. Chapter 4 is a discussion of general performance
and optimization considerations when programming for AMD Accelerated Parallel
Processing devices. Chapter 5 details performance and optimization
considerations specifically for Southern Island devices. Chapter 6 details
performance and optimization devices for Evergreen and Northern Islands
devices. Appendix A describes the supported optional OpenCL extensions.
Appendix B details the installable client driver (ICD) for OpenCL. Appendix C
details the compute kernel and contrasts it with a pixel shader. Appendix D lists
the device parameters. Appendix E describes the OpenCL binary image format
(BIF). Appendix F describes the OpenVideo Decode API. Appendix G describes
the interoperability between OpenCL and OpenGL. The last section of this book
is a glossary of acronyms and terms, as well as an index.

Preface
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

iii

AMD ACCELERATED PARALLEL PROCESSING

Conventions
The following conventions are used in this document.
mono-spaced font

A filename, file path, or code.

Any number of alphanumeric characters in the name of a code format, parameter,
or instruction.

[1,2)

A range that includes the left-most value (in this case, 1) but excludes the right-most
value (in this case, 2).

[1,2]

A range that includes both the left-most and right-most values (in this case, 1 and 2).

{x | y}

One of the multiple options listed. In this case, x or y.

0.0f
0.0

A single-precision (32-bit) floating-point value.
A double-precision (64-bit) floating-point value.

1011b

A binary value, in this example a 4-bit value.

7:4

A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.

italicized word or phrase

The first use of a term or concept basic to the understanding of stream computing.

Related Documents

•

The OpenCL Specification, Version 1.1, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2010.

•

AMD, R600 Technology, R600 Instruction Set Architecture, Sunnyvale, CA,
est. pub. date 2007. This document includes the RV670 GPU instruction
details.

•

ISO/IEC 9899:TC2 - International Standard - Programming Languages - C

•

Kernighan Brian W., and Ritchie, Dennis M., The C Programming Language,
Prentice-Hall, Inc., Upper Saddle River, NJ, 1978.

•

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P.
Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” ACM
Trans. Graph., vol. 23, no. 3, pp. 777–786, 2004.

•

AMD Compute Abstraction Layer (CAL) Intermediate Language (IL)
Reference Manual. Published by AMD.

•

Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Hanrahan, Pat;
Houston, Mike; Fatahalian, Kayvon. “BrookGPU”
http://graphics.stanford.edu/projects/brookgpu/

•

Buck, Ian. “Brook Spec v0.2”. October 31, 2003.
http://merrimac.stanford.edu/brook/brookspec-05-20-03.pdf

•

OpenGL Programming Guide, at http://www.glprogramming.com/red/

•

Microsoft DirectX Reference Website, at http://msdn.microsoft.com/enus/directx

•

GPGPU: http://www.gpgpu.org, and Stanford BrookGPU discussion forum
http://www.gpgpu.org/forums/

Preface
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Contact Information
URL:

developer.amd.com/appsdk

Developing:

developer.amd.com/

Support:

developer.amd.com/appsdksupport

Forum:

developer.amd.com/openclforum

REVISION HISTORY
Rev

Description

1.3 e

Deleted encryption reference.

1.3f

Added basic guidelines to CL-GL Interop appendix.
Corrected code in two samples in Chpt. 4.

1.3g

Numerous changes to CL-GL Interop appendix.
Added subsections to Additional Performance Guidance on CPU Programmers
Using OpenCL to Program CPUs and Using Special CPU Instructions in the
Optimizing Kernel Code subsection.

2.0

Added ELF Header section in Appendix E.

2.1

New Profiler and KernelAnalyzer sections in chapter 4.
New AMD gDEBugger section in chapter 3.
Added extensions to Appendix A.
Numerous updates throughout for Southern Islands, especially in Chapters 1
and 5.
Split original chapter 4 into three chapters. Now, chapter 4 is general considerations for Evergreen, Northern Islands, and Southern Islands; chapter 5 is
specifically for Southern Islands devices; chapter 6 is for Evergreen and
Northern Islands devices.
Update of Device Parameters in Appendix D.

2.1a

Reinstated some supplemental compiler options in Section 2.1.4.
Changes/additions to Table 4.3

2.1b

Minor change in Section 1.8.3, indicating that LDS model has not changed
from previous GPU families.

2.2

Addition of channel mapping information (chpt 5). Minor corrections
throughout. Deletion of duplicate material from chpt 6.

2.3

Inclusion of upgraded index. Minor rewording and corrections.

Preface
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Preface
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Contents

Preface
Contents
Chapter 1

OpenCL Architecture and AMD Accelerated Parallel Processing
1.1

Software Overview ........................................................................................................................... 1-1
1.1.1
Data-Parallel Programming Model .................................................................................1-1
1.1.2

Task-Parallel Programming Model .................................................................................1-1

1.1.3

Synchronization................................................................................................................1-1

1.2

Hardware Overview for Southern Islands Devices ...................................................................... 1-2

1.3

Hardware Overview for Evergreen and Northern Islands Devices ............................................ 1-4

1.4

The AMD Accelerated Parallel Processing Implementation of OpenCL ................................... 1-6
1.4.1
Work-Item Processing .....................................................................................................1-9

1.5

1.6

1.4.2

Flow Control ...................................................................................................................1-10

1.4.3

Work-Item Creation ........................................................................................................1-11

Memory Architecture and Access................................................................................................ 1-11
1.5.1
Memory Access ..............................................................................................................1-13
1.5.2

Global Buffer...................................................................................................................1-13

1.5.3

Image Read/Write ...........................................................................................................1-13

1.5.4

Memory Load/Store........................................................................................................1-14

Communication Between Host and GPU in a Compute Device............................................... 1-14
1.6.1
PCI Express Bus ............................................................................................................1-14
1.6.2

Processing API Calls: The Command Processor ......................................................1-14

1.6.3

DMA Transfers ................................................................................................................1-15

1.6.4

Masking Visible Devices................................................................................................1-15

1.7

GPU Compute Device Scheduling ............................................................................................... 1-15

1.8

Terminology .................................................................................................................................... 1-17
1.8.1
Compute Kernel..............................................................................................................1-17
1.8.2

Wavefronts and Work-groups.......................................................................................1-18

1.8.3

Local Data Store (LDS)..................................................................................................1-18

1.9

Programming Model ...................................................................................................................... 1-18

1.10

Example Programs......................................................................................................................... 1-20
1.10.1 First Example: Simple Buffer Write .............................................................................1-20
1.10.2

Example: Parallel Min() Function .................................................................................1-23

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

vii

AMD ACCELERATED PARALLEL PROCESSING

Chapter 2

Building and Running OpenCL Programs
2.1

2.2

Compiling the Program ................................................................................................................... 2-2
2.1.1
Compiling on Windows ...................................................................................................2-2
2.1.2

Compiling on Linux .........................................................................................................2-3

2.1.3

Supported Standard OpenCL Compiler Options..........................................................2-4

2.1.4

AMD-Developed Supplemental Compiler Options .......................................................2-4

Running the Program ...................................................................................................................... 2-5
2.2.1
Running Code on Windows............................................................................................2-6
2.2.2

2.3
Chapter 3

Chapter 4

Calling Conventions ........................................................................................................................ 2-7

Debugging OpenCL
3.1

AMD gDEBugger .............................................................................................................................. 3-1

3.2

Debugging CPU Kernels with GDB ............................................................................................... 3-2
3.2.1
Setting the Environment .................................................................................................3-2
3.2.2

Setting the Breakpoint in an OpenCL Kernel...............................................................3-2

3.2.3

Sample GDB Session ......................................................................................................3-3

3.2.4

Notes..................................................................................................................................3-4

OpenCL Performance and Optimization
4.1

AMD APP Profiler............................................................................................................................. 4-1
4.1.1
Collecting OpenCL Application Trace ...........................................................................4-2
4.1.2

Collecting OpenCL GPU Kernel Performance Counters.............................................4-5

4.1.3

OpenCL Kernel Occupancy Modeler .............................................................................4-6

4.2

AMD APP KernelAnalyzer ............................................................................................................... 4-8

4.3

Analyzing Processor Kernels ......................................................................................................... 4-9
4.3.1
Intermediate Language and GPU Disassembly............................................................4-9
4.3.2

4.4

4.5

4.6

Generating IL and ISA Code.........................................................................................4-10

Estimating Performance................................................................................................................ 4-10
4.4.1
Measuring Execution Time ...........................................................................................4-10
4.4.2

Using the OpenCL timer with Other System Timers .................................................4-11

4.4.3

Estimating Memory Bandwidth ....................................................................................4-12

OpenCL Memory Objects.............................................................................................................. 4-13
4.5.1
Types of Memory Used by the Runtime......................................................................4-13
4.5.2

Placement........................................................................................................................4-16

4.5.3

Memory Allocation .........................................................................................................4-18

4.5.4

Mapping...........................................................................................................................4-18

4.5.5

Reading, Writing, and Copying ....................................................................................4-21

4.5.6

Command Queue ...........................................................................................................4-21

OpenCL Data Transfer Optimization............................................................................................ 4-21
4.6.1
Definitions .......................................................................................................................4-22
4.6.2

viii

Running Code on Linux ..................................................................................................2-7

Buffers .............................................................................................................................4-22

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

4.7

Chapter 5

4.7.2

When to Use Multiple Devices .....................................................................................4-31

4.7.3

Partitioning Work for Multiple Devices........................................................................4-32

4.7.4

Synchronization Caveats...............................................................................................4-34

4.7.5

GPU and CPU Kernels...................................................................................................4-35

4.7.6

Contexts and Devices....................................................................................................4-36

OpenCL Performance and Optimization for Southern Islands Devices
5.1

Global Memory Optimization .......................................................................................................... 5-1
5.1.1
Channel Conflicts.............................................................................................................5-2
5.1.2

Coalesced Writes .............................................................................................................5-8

5.1.3

Hardware Variations.........................................................................................................5-9

5.2

Local Memory (LDS) Optimization ................................................................................................. 5-9

5.3

Constant Memory Optimization.................................................................................................... 5-12

5.4

OpenCL Memory Resources: Capacity and Performance ........................................................ 5-14

5.5

Using LDS or L1 Cache ................................................................................................................ 5-15

5.6

NDRange and Execution Range Optimization............................................................................ 5-16
5.6.1
Hiding ALU and Memory Latency ................................................................................5-16

5.7

5.8

5.9
Chapter 6

Using Multiple OpenCL Devices .................................................................................................. 4-29
4.7.1
CPU and GPU Devices ..................................................................................................4-29

5.6.2

Resource Limits on Active Wavefronts.......................................................................5-17

5.6.3

Partitioning the Work.....................................................................................................5-20

5.6.4

Summary of NDRange Optimizations ..........................................................................5-22

Instruction Selection Optimizations............................................................................................. 5-23
5.7.1
Instruction Bandwidths .................................................................................................5-23
5.7.2

AMD Media Instructions ................................................................................................5-24

5.7.3

Math Libraries.................................................................................................................5-24

5.7.4

Compiler Optimizations .................................................................................................5-25

Additional Performance Guidance............................................................................................... 5-25
5.8.1
Loop Unroll pragma......................................................................................................5-25
5.8.2

Memory Tiling .................................................................................................................5-26

5.8.3

General Tips....................................................................................................................5-27

5.8.4

Guidance for CUDA Programmers Using OpenCL ....................................................5-29

5.8.5

Guidance for CPU Programmers Using OpenCL to Program GPUs .......................5-29

5.8.6

Optimizing Kernel Code ................................................................................................5-30

5.8.7

Optimizing Kernels for Southern Island GPUs...........................................................5-31

Specific Guidelines for Southern Islands GPUs ........................................................................ 5-32

OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
6.1

Global Memory Optimization .......................................................................................................... 6-1
6.1.1
Two Memory Paths...........................................................................................................6-3
6.1.2

Channel Conflicts.............................................................................................................6-6

6.1.3

Float4 Or Float1..............................................................................................................6-11

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

6.1.5

Alignment ........................................................................................................................6-14

6.1.6

Summary of Copy Performance...................................................................................6-16

6.1.7

Hardware Variations.......................................................................................................6-16

Local Memory (LDS) Optimization............................................................................................... 6-16

6.3

Constant Memory Optimization.................................................................................................... 6-19

6.4

OpenCL Memory Resources: Capacity and Performance ........................................................ 6-20

6.5

Using LDS or L1 Cache ................................................................................................................ 6-22

6.6

NDRange and Execution Range Optimization............................................................................ 6-23
6.6.1
Hiding ALU and Memory Latency ................................................................................6-23

6.8

6.6.2

Resource Limits on Active Wavefronts.......................................................................6-24

6.6.3

Partitioning the Work.....................................................................................................6-28

6.6.4

Optimizing for Cedar .....................................................................................................6-32

6.6.5

Summary of NDRange Optimizations..........................................................................6-32

Using Multiple OpenCL Devices .................................................................................................. 6-33
6.7.1
CPU and GPU Devices ..................................................................................................6-33
6.7.2

When to Use Multiple Devices .....................................................................................6-35

6.7.3

Partitioning Work for Multiple Devices .......................................................................6-35

6.7.4

Synchronization Caveats ..............................................................................................6-37

6.7.5

GPU and CPU Kernels...................................................................................................6-39

6.7.6

Contexts and Devices....................................................................................................6-40

Instruction Selection Optimizations ............................................................................................ 6-41
6.8.1
Instruction Bandwidths .................................................................................................6-41
6.8.2

AMD Media Instructions ................................................................................................6-42

6.8.3

Math Libraries.................................................................................................................6-42

6.8.4

VLIW and SSE Packing .................................................................................................6-43

6.8.5

Compiler Optimizations.................................................................................................6-45

6.9

Clause Boundaries ........................................................................................................................ 6-46

6.10

Additional Performance Guidance............................................................................................... 6-48
6.10.1 Loop Unroll pragma......................................................................................................6-48
6.10.2

Memory Tiling .................................................................................................................6-48

6.10.3

General Tips....................................................................................................................6-49

6.10.4

Guidance for CUDA Programmers Using OpenCL ....................................................6-51

6.10.5

Guidance for CPU Programmers Using OpenCL to Program GPUs .......................6-52

6.10.6

Optimizing Kernel Code ................................................................................................6-52

6.10.7

Optimizing Kernels for Evergreen and 69XX-Series GPUs.......................................6-53

OpenCL Static C++ Programming Language
7.1

Coalesced Writes ...........................................................................................................6-12

6.2

6.7

Chapter 7

6.1.4

Overview ........................................................................................................................................... 7-1
7.1.1
Supported Features .........................................................................................................7-1
7.1.2

Unsupported Features.....................................................................................................7-2

7.1.3

Relations with ISO/IEC C++ ............................................................................................7-2

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

7.2

Additions and Changes to Section 5 - The OpenCL C Runtime ............................................... 7-2
7.2.1
Additions and Changes to Section 5.7.1 - Creating Kernel Objects .........................7-2
7.2.2

7.3

7.4

Passing Classes between Host and Device .................................................................7-3

Additions and Changes to Section 6 - The OpenCL C Programming Language .................... 7-3
7.3.1
Building C++ Kernels.......................................................................................................7-3
7.3.2

Classes and Derived Classes .........................................................................................7-3

7.3.3

Namespaces......................................................................................................................7-4

7.3.4

Overloading.......................................................................................................................7-4

7.3.5

Templates ..........................................................................................................................7-5

7.3.6

Exceptions ........................................................................................................................7-6

7.3.7

Libraries ............................................................................................................................7-6

7.3.8

Dynamic Operation ..........................................................................................................7-6

Examples........................................................................................................................................... 7-6
7.4.1
Passing a Class from the Host to the Device and Back.............................................7-6
7.4.2

Kernel Overloading ..........................................................................................................7-7

7.4.3

Kernel Template................................................................................................................7-8

Appendix A OpenCL Optional Extensions
A.1

Extension Name Convention ..........................................................................................................A-1

A.2

Querying Extensions for a Platform..............................................................................................A-1

A.3

Querying Extensions for a Device.................................................................................................A-2

A.4

Using Extensions in Kernel Programs..........................................................................................A-2

A.5

Getting Extension Function Pointers ............................................................................................A-3

A.6

List of Supported Extensions that are Khronos-Approved ........................................................A-3

A.7

cl_ext Extensions.........................................................................................................................A-4

A.8

AMD Vendor-Specific Extensions ..................................................................................................A-4
A.8.1
cl_amd_fp64................................................................................................................. A-4

A.9

A.8.2

cl_amd_vec3................................................................................................................. A-4

A.8.3

cl_amd_device_persistent_memory.................................................................. A-4

A.8.4

cl_amd_device_attribute_query....................................................................... A-5

A.8.5

cl_amd_device_profiling_timer_offset...................................................... A-5

A.8.6

cl_amd_device_topology....................................................................................... A-5

A.8.7

cl_amd_device_board_name................................................................................... A-5

A.8.8

cl_amd_compile_options ...................................................................................... A-6

A.8.9

cl_amd_offline_devices....................................................................................... A-6

A.8.10

cl_amd_event_callback.......................................................................................... A-6

A.8.11

cl_amd_popcnt............................................................................................................ A-7

A.8.12

cl_amd_media_ops..................................................................................................... A-7

A.8.13

cl_amd_media_ops2................................................................................................... A-9

A.8.14

cl_amd_printf.......................................................................................................... A-12

cl_amd_predefined_macros.................................................................................................A-13

A.10 Supported Functions for cl_amd_fp64 / cl_khr_fp64.......................................................A-15

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

A.11

Extension Support by Device.......................................................................................................A-15

Appendix B The OpenCL Installable Client Driver (ICD)
B.1

Overview ...........................................................................................................................................B-1

B.2

Using ICD..........................................................................................................................................B-1

Appendix C Compute Kernel
C.1

Differences from a Pixel Shader ....................................................................................................C-1

C.2

Indexing.............................................................................................................................................C-1

C.3

Performance Comparison ...............................................................................................................C-2

C.4

Pixel Shader .....................................................................................................................................C-2

C.5

Compute Kernel ...............................................................................................................................C-3

C.6

LDS Matrix Transpose .....................................................................................................................C-4

C.7

Results Comparison ........................................................................................................................C-4

Appendix D Device Parameters
Appendix E OpenCL Binary Image Format (BIF) v2.0
E.1

Overview ........................................................................................................................................... E-1
E.1.1
Executable and Linkable Format (ELF) Header........................................................... E-2
E.1.2

E.2

Bitness.............................................................................................................................. E-3

BIF Options....................................................................................................................................... E-3

Appendix F Open Decode API Tutorial
F.1

Overview ........................................................................................................................................... F-1

F.2

Initializing.......................................................................................................................................... F-2

F.3

Creating the Context ....................................................................................................................... F-2

F.4

Creating the Session ....................................................................................................................... F-3

F.5

Decoding ........................................................................................................................................... F-3

F.6

Destroying Session and Context ................................................................................................... F-4

Appendix G OpenCL-OpenGL Interoperability
G.1

G.2

Under Windows................................................................................................................................G-1
G.1.1
Single GPU Environment ...............................................................................................G-2
G.1.2

Multi-GPU Environment..................................................................................................G-4

G.1.3

Limitations .......................................................................................................................G-7

Linux Operating System .................................................................................................................G-8
G.2.1
Single GPU Environment ...............................................................................................G-8
G.2.2

Multi-GPU Configuration .............................................................................................. G-11

Index

xii

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Figures

1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
2.1
2.2
4.1
4.2
4.3
4.4
4.5
4.6
5.1
5.2
5.3
5.4
5.5
5.6
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
C.1
C.2
C.3
F.1

Generalized AMD GPU Compute Device Structure for Southern Islands Devices ...............1-2
AMD Radeon™ HD 79XX Device Partial Block Diagram ......................................................1-3
Generalized AMD GPU Compute Device Structure................................................................1-4
Simplified Block Diagram of and Evergreen-Family GPU ......................................................1-5
AMD Accelerated Parallel Processing Software Ecosystem ..................................................1-6
Simplified Mapping of OpenCL onto AMD Accelerated Parallel Processing for
Evergreen and Northern Island Devices .................................................................................1-8
Work-Item Grouping Into Work-Groups and Wavefronts ........................................................1-9
Interrelationship of Memory Domains for Southern Islands Devices ...................................1-12
Dataflow between Host and GPU .........................................................................................1-12
Simplified Execution Of Work-Items On A Single Stream Core ...........................................1-16
Stream Core Stall Due to Data Dependency ........................................................................1-17
OpenCL Programming Model ................................................................................................1-19
OpenCL Compiler Toolchain....................................................................................................2-1
Runtime Processing Structure .................................................................................................2-6
Timeline and API Trace View in Microsoft Visual Studio 2010 ..............................................4-3
Context Summary Page View in Microsoft Visual Studio 2010 ..............................................4-4
Warning(s) and Error(s) Page .................................................................................................4-5
Sample Session View in Microsoft Visual Studio 2010 ..........................................................4-6
Sample Kernel Occupancy Modeler Screen ...........................................................................4-7
AMD APP Kernel Analyzer......................................................................................................4-9
Memory System .......................................................................................................................5-1
Channel Remapping/Interleaving.............................................................................................5-4
Transformation to Staggered Offsets.......................................................................................5-7
One Example of a Tiled Layout Format ................................................................................5-27
Northern Islands Compute Unit Arrangement .......................................................................5-35
Southern Island Compute Unit Arrangement ........................................................................5-35
Memory System .......................................................................................................................6-2
FastPath (blue) vs CompletePath (red) Using float1 ..............................................................6-3
Transformation to Staggered Offsets.......................................................................................6-9
Two Kernels: One Using float4 (blue), the Other float1 (red) .............................................. 6-11
Effect of Varying Degrees of Coalescing - Coal (blue), NoCoal (red), Split (green) ..........6-13
Unaligned Access Using float1..............................................................................................6-15
Unmodified Loop....................................................................................................................6-43
Kernel Unrolled 4X.................................................................................................................6-44
Unrolled Loop with Stores Clustered.....................................................................................6-44
Unrolled Kernel Using float4 for Vectorization ......................................................................6-45
One Example of a Tiled Layout Format ................................................................................6-49
Pixel Shader Matrix Transpose .............................................................................................. C-2
Compute Kernel Matrix Transpose......................................................................................... C-3
LDS Matrix Transpose ............................................................................................................ C-4
Open Decode with Optional Post-Processing ........................................................................ F-1
Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

xiii

AMD ACCELERATED PARALLEL PROCESSING

xiv

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Tables

4.1
4.2
4.3
4.4
5.1
5.2
5.3
5.4
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
A.1
A.2
D.1
D.2
D.3
D.4
D.5
D.6
D.7
E.1

Memory Bandwidth in GB/s (R = read, W = write) in GB/s ................................................4-14
OpenCL Memory Object Properties .....................................................................................4-17
Transfer policy on clEnqueueMapBuffer / clEnqueueMapImage /
clEnqueueUnmapMemObject for Copy Memory Objects ........................................................4-20
CPU and GPU Performance Characteristics ........................................................................4-30
Hardware Performance Parameters ......................................................................................5-14
Effect of LDS Usage on Wavefronts/CU1 ............................................................................5-19
Instruction Throughput (Operations/Cycle for Each Stream Processor) .............................5-23
Resource Limits for Northern Islands and Southern Islands ................................................5-34
Bandwidths for 1D Copies .......................................................................................................6-4
Bandwidths for Different Launch Dimensions .........................................................................6-8
Bandwidths Including float1 and float4..................................................................................6-12
Bandwidths Including Coalesced Writes ...............................................................................6-14
Bandwidths Including Unaligned Access...............................................................................6-15
Hardware Performance Parameters ......................................................................................6-21
Impact of Register Type on Wavefronts/CU..........................................................................6-26
Effect of LDS Usage on Wavefronts/CU1 ............................................................................6-28
CPU and GPU Performance Characteristics ........................................................................6-33
Instruction Throughput (Operations/Cycle for Each Stream Processor) .............................6-41
Native Speedup Factor ..........................................................................................................6-43
Extension Support for AMD GPU Devices 1 ....................................................................... A-15
Extension Support for Older AMD GPUs and CPUs........................................................... A-16
Parameters for 7xxx Devices ................................................................................................. D-2
Parameters for 68xx and 69xx Devices ................................................................................. D-3
Parameters for 65xx, 66xx, and 67xx Devices ...................................................................... D-4
Parameters for 64xx Devices ................................................................................................. D-5
Parameters for Zacate and Ontario Devices ......................................................................... D-6
Parameters for 56xx, 57xx, 58xx, Eyfinity6, and 59xx Devices ............................................ D-7
Parameters for Exxx, Cxx, 54xx, and 55xx Devices ............................................................. D-8
ELF Header Fields ................................................................................................................ E-2

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

xvi

Contents
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Chapter 1
OpenCL Architecture and AMD
Accelerated Parallel Processing
This chapter provides a general software and hardware overview of the AMD
Accelerated Parallel Processing implementation of the OpenCL standard. It
explains the memory structure and gives simple programming examples.

1.1 Software Overview
OpenCL supports data-parallel and task-parallel programming models, as well as
hybrids of these models. Of the two, the primary one is the data-parallel model.

1.1.1

Data-Parallel Programming Model
In the data parallel programming model, a computation is defined in terms of a
sequence of instructions that executes at each point in an N-dimensional index
space. It is a common, though by not required, formulation of an algorithm that
each computation index maps to an element in an input data set.
The OpenCL data-parallel programming model is hierarchical. The hierarchical
subdivision can be specified in two ways:

1.1.2

•

Explicitly - the developer defines the total number of work-items to execute
in parallel, as well as the division of work-items into specific work-groups.

•

Implicitly - the developer specifies the total number of work-items to execute
in parallel, and OpenCL manages the division into work-groups.

Task-Parallel Programming Model
In this model, a kernel instance is executed independently of any index space.
This is equivalent to executing a kernel on a compute device with a work-group
and NDRange containing a single work-item. Parallelism is expressed using
vector data types implemented by the device, enqueuing multiple tasks, and/or
enqueuing native kernels developed using a programming model orthogonal to
OpenCL.

1.1.3

Synchronization
The two domains of synchronization in OpenCL are work-items in a single workgroup and command-queue(s) in a single context. Work-group barriers enable
synchronization of work-items in a work-group. Each work-item in a work-group
must first execute the barrier before executing any instruction beyond this barrier.
Either all of, or none of, the work-items in a work-group must encounter the

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-1

AMD ACCELERATED PARALLEL PROCESSING

barrier. A barrier or mem_fence operation does not have global scope, but is
relevant only to the local workgroup on which they operate. However, atomic
operations done on global memory do have a global scope, hence may provide
a way to do global synchronization.
There are two types of synchronization between commands in a commandqueue:

•

command-queue barrier - enforces ordering within a single queue. Any
resulting changes to memory are available to the following commands in the
queue.

•

events - enforces ordering between, or within, queues. Enqueued commands
in OpenCL return an event identifying the command as well as the memory
object updated by it. This ensures that following commands waiting on that
event see the updated memory objects before they execute.

1.2 Hardware Overview for Southern Islands Devices
A general OpenCL device comprises compute units (CUs), each of which has
sub-modules that ultimately have ALUs. A work-item (or SPMD kernel instance)
executes on an ALU, as shown in Figure 1.1).

GPU
Compute Device

Compute
Unit

GPU
Compute Device

Compute
Unit

16 Processing Elements
1 Scalar Unit

16 Processing Elements
16 Processing Elements

4 Vector Units

16 Processing Elements
Figure 1.1

Generalized AMD GPU Compute Device Structure for Southern
Islands Devices

For AMD Radeon™ HD 79XX devices, each of the 32 CUs has one Scalar Unit
and four Vector Units, each of which contain an array of 16 PEs. Each PE
consists of one ALU. Figure 1.2 shows only two compute units of the array that
comprises the compute device of the AMD Radeon™ HD 7XXX family. The four
Vector Units use SIMD execution of a scalar instruction. This makes it possible
to run four separate instructions at once, but they are dynamically scheduled (as
1-2

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

opposed to those for the AMD Radeon™ HD 69XX devices, which are statically
scheduled.

ZĞĂĚͬtƌŝƚĞŵĞŵŽƌǇŝŶƚĞƌĨĂĐĞ

^ĐĂĐŚĞ
/ĐĂĐŚĞ
^ĐĂĐŚĞ
/ĐĂĐŚĞ
^ĐĂĐŚĞ
/ĐĂĐŚĞ

>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^

>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ

sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ
sĞĐƚŽƌhŶŝƚ

^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ
^ĐĂůĂƌhŶŝƚ

^ĐĂĐŚĞ
/ĐĂĐŚĞ

>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ
>ϭ

^ĐĂĐŚĞ
/ĐĂĐŚĞ

ƐǇŶĐŚƌŽŶŽƵƐŽŵƉƵƚĞŶŐŝŶĞ
ͬŽŵŵĂŶĚWƌŽĐĞƐƐŽƌ
^ĐĂĐŚĞ
/ĐĂĐŚĞ

^ĐĂĐŚĞ
/ĐĂĐŚĞ

ƐǇŶĐŚƌŽŶŽƵƐŽŵƉƵƚĞŶŐŝŶĞ
ͬŽŵŵĂŶĚWƌŽĐĞƐƐŽƌ

>ĞǀĞůϮĐĂĐŚĞ
'ZϱDĞŵŽƌǇ^ǇƐƚĞŵ
Figure 1.2

AMD Radeon™ HD 79XX Device Partial Block Diagram

In Figure 1.2, there are two command processors, which can process two
command queues concurrently. The Scalar Unit, Vector Unit, Level 1 data cache
(L1), and Local Data Share (LDS) are the components of one compute unit, of
which there are 32. The SC cache is the scalar unit data cache, and the Level
2 cache consists of instructions and data.
As noted, the AMD Radeon™ HD 79XX devices also have a scalar unit, and the
instruction stream contains both scalar and vector instructions. On each cycle, it
selects a scalar instruction and a vector instruction (as well as a memory
operation and a branch operation, if available); it issues one to the scalar unit,
the other to the vector unit; this takes four cycles to issue over the four vector
cores (the same four cycles over which the 16 units execute 64 work-items).
The number of compute units in an AMD GPU, and the way they are structured,
varies with the device family, as well as device designations within a family. Each
of these vector units possesses ALUs (processing elements). For devices in the
Northern Islands (AMD Radeon™ HD 69XX) and Southern Islands (AMD
Radeon™ HD 7XXX) families, these ALUs are arranged in four (in the Evergreen
family, there are five) SIMD arrays consisting of 16 processing elements each.
(See Section 1.3, “Hardware Overview for Evergreen and Northern Islands
1.2 Hardware Overview for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-3

AMD ACCELERATED PARALLEL PROCESSING

Devices.”) Each of these arrays executes a single instruction across each lane
for each of a block of 16 work-items. That instruction is repeated over four cycles
to make the 64-element vector called a wavefront. On devices in the Southern
Island family, the four stream cores execute code from four different wavefronts.

1.3 Hardware Overview for Evergreen and Northern Islands Devices
A general OpenCL device comprises compute units, each of which can have
multiple processing elements. A work-item (or SPMD kernel instance) executes
on a single processing element. The processing elements within a compute unit
can execute in lock-step using SIMD execution. Compute units, however,
execute independently (see Figure 1.3).
AMD GPUs consists of multiple compute units. The number of them and the way
they are structured varies with the device family, as well as device designations
within a family. Each of these processing elements possesses ALUs. For devices
in the Northern Islands and Southern Islands families, these ALUs are arranged
in four (in the Evergreen family, there are five) processing elements with arrays
of 16 ALUs. Each of these arrays executes a single instruction across each lane
for each of a block of 16 work-items. That instruction is repeated over four cycles
to make the 64-element vector called a wavefront. On devices in the Southern
Island family, the four processing elements execute code from four different
wavefronts. On Northern Islands and Evergreen family devices, the four arrays
execute instructions from one wavefront, so that each work-item issues four (for
Northern Islands) or five (for Evergreen) instructions at once in a very-longinstruction-word (VLIW) packet.
Figure 1.3 shows a simplified block diagram of a generalized AMD GPU compute
device.

GPU
Compute Device

Compute
Unit

GPU
Compute Device

Compute
Unit

Processing Elements

ALUs
Figure 1.3

1-4

Generalized AMD GPU Compute Device Structure

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Figure 1.4 is a simplified diagram of an AMD GPU compute device. Different
GPU compute devices have different characteristics (such as the number of
compute units), but follow a similar design pattern.

Ultra-Threaded Dispatch Processor (UTDP)

Compute
Unit

Processing Element

Compute
Unit
Instruction
and Control
Flow

Branch
Execution
Unit
ALUs

General-Purpose Registers

Figure 1.4

Simplified Block Diagram of and Evergreen-Family GPU1

GPU compute devices comprise groups of compute units. Each compute unit
contains numerous processing elements, which are responsible for executing
kernels, each operating on an independent data stream. Processing elements, in
turn, contain numerous processing elements, which are the fundamental,
1. Much of this is transparent to the programmer.
1.3 Hardware Overview for Evergreen and Northern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-5

AMD ACCELERATED PARALLEL PROCESSING

programmable ALUs that perform integer, single-precision floating-point, doubleprecision floating-point, and transcendental operations. All processing elements
within a compute unit execute the same instruction sequence in lock-step for
Evergreen and Northern Islands devices; different compute units can execute
different instructions.
A processing element is arranged as a five-way or four-way (depending on the
GPU type) very long instruction word (VLIW) processor (see bottom of
Figure 1.4). Up to five scalar operations (or four, depending on the GPU type)
can be co-issued in a VLIW instruction, each of which are executed on one of
the corresponding five ALUs. ALUs can execute single-precision floating point or
integer operations. One of the five ALUs also can perform transcendental
operations (sine, cosine, logarithm, etc.). Double-precision floating point
operations are processed (where supported) by connecting two or four of the
ALUs (excluding the transcendental core) to perform a single double-precision
operation. The processing element also contains one branch execution unit to
handle branch instructions.
Different GPU compute devices have different numbers of processing elements.
For example, the ATI Radeon™ HD 5870 GPU has 20 compute units, each with
16 processing elements, and each processing elements contains five ALUs; this
yields 1600 physical ALUs.

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL
AMD Accelerated Parallel Processing harnesses the tremendous processing
power of GPUs for high-performance, data-parallel computing in a wide range of
applications. The AMD Accelerated Parallel Processing system includes a
software stack and the AMD GPUs. Figure 1.5 illustrates the relationship of the
AMD Accelerated Parallel Processing components.

Stream Applications
Libraries

Third-Party Tools

OpenCL Runtime

Multicore
CPUs

Figure 1.5

AMD GPUs

AMD Accelerated Parallel Processing Software Ecosystem

The AMD Accelerated Parallel Processing software stack provides end-users and
developers with a complete, flexible suite of tools to leverage the processing
1-6

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

power in AMD GPUs. AMD Accelerated Parallel Processing software embraces
open-systems, open-platform standards. The AMD Accelerated Parallel
Processing open platform strategy enables AMD technology partners to develop
and provide third-party development tools.
The software includes the following components:

•

OpenCL compiler and runtime

•

Performance Profiling Tools – AMD APP Profiler and AMD APP
KernelAnalyzer.

•

Performance Libraries – AMD Core Math Library (ACML) for optimized
NDRange-specific algorithms.

The latest generations of AMD GPUs use unified shader architectures capable
of running different kernel types interleaved on the same hardware.
Programmable GPU compute devices execute various user-developed programs,
known to graphics programmers as shaders and to compute programmers as
kernels. These GPU compute devices can execute non-graphics functions using
a data-parallel programming model that maps executions onto compute units. In
this programming model, known as AMD Accelerated Parallel Processing, arrays
of input data elements stored in memory are accessed by a number of compute
units.
Each instance of a kernel running on a compute unit is called a work-item. A
specified rectangular region of the output buffer to which work-items are mapped
is known as the n-dimensional index space, called an NDRange.
The GPU schedules the range of work-items onto a group of processing
elements, until all work-items have been processed. Subsequent kernels then
can be executed, until the application completes. A simplified view of the AMD
Accelerated Parallel Processing programming model and the mapping of workitems to processing elements is shown in Figure 1.6.

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-7

AMD ACCELERATED PARALLEL PROCESSING

Processing Elementk
Registers/Constants/Literals

ALU0

ALUn-1

ALU1

Process.
Element0

Process.
Elementk

Process.
Elementn-1

Compute Unitn
Scheduler (UTDP)

Memory Controller

Memory Interface

Memory
Input
Data

Output
Data

Compute Device
Scheduler maps work-item (i, j) onto Stream Core k

Figure 1.6

Simplified Mapping of OpenCL onto AMD Accelerated Parallel
Processing for Evergreen and Northern Island Devices

OpenCL maps the total number of work-items to be launched onto an ndimensional grid (ND-Range). The developer can specify how to divide these
items into work-groups. AMD GPUs execute on wavefronts (groups of work-items
executed in lock-step in a compute unit); there are an integer number of
wavefronts in each work-group. Thus, as shown in Figure 1.7, hardware that
schedules work-items for execution in the AMD Accelerated Parallel Processing
environment includes the intermediate step of specifying wavefronts within a
work-group. This permits achieving maximum performance from AMD GPUs. For
a more detailed discussion of wavefronts, see Section 1.8.2, “Wavefronts and
Work-groups,” page 1-18.

1-8

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Range

Dim Y

WORK-GROUP

on
si
en
m
Di

Dim Y

Dimension X

WORK-ITEM

on
si
en
m
Di

MVECTOR
(HW SPECIFIC SIZE)

Dimension X
Figure 1.7

1.4.1

Work-Item Grouping Into Work-Groups and Wavefronts

Work-Item Processing
All stream cores within a compute unit execute the same instruction for each
cycle. A work item can issue one VLIW instruction per clock cycle. The block of
work-items that are executed together is called a wavefront. To hide latencies
due to memory accesses and processing element operations, up to four workitems from the same wavefront are pipelined on the same stream core. For
example, on the AMD Radeon™ HD 6970 GPU compute device, the 16
processing elements execute the same instructions for four cycles, which
effectively appears as a 64-wide compute unit in execution width.
The size of wavefronts can differ on different GPU compute devices. For
example, the AMD Radeon™ HD 54XX series graphics cards has a wavefront
size of 32 work-items. Higher-end AMD GPUs have a wavefront size of 64 workitems.
Compute units operate independently of each other, so it is possible for different
compute units to execute different instructions.

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-9

AMD ACCELERATED PARALLEL PROCESSING

1.4.2

Flow Control
Before discussing flow control, it is necessary to clarify the relationship of a
wavefront to a work-group. If a user defines a work-group, it consists of one or
more wavefronts. A wavefront is a hardware thread with its own program counter;
it is capable of following control flow independently of other wavefronts. A
wavefronts consists of 64 or fewer work-items, two wavefronts are between 65
to 128 work-items, etc., on a device with a wavefront size of 64. For optimum
hardware usage, an integer multiple of 64 work-items is recommended.
Flow control, such as branching, is done by combining all necessary paths as a
wavefront. If work-items within a wavefront diverge, all paths are executed
serially. For example, if a work-item contains a branch with two paths, the
wavefront first executes one path, then the second path. The total time to
execute the branch is the sum of each path time. An important point is that even
if only one work-item in a wavefront diverges, the rest of the work-items in the
wavefront execute the branch. The number of work-items that must be executed
during a branch is called the branch granularity. On AMD hardware, the branch
granularity is the same as the wavefront granularity.
Masking of wavefronts is effected by constructs such as:
if(x)
{
.
.
.
}
else
{
.
.
.
}

//items within these braces = A

//items within these braces = B

The wavefront mask is set true for lanes (elements/items) in which x is true, then
execute A. The mask then is inverted, and B is executed.
Example 1: If two branches, A and B, take the same amount of time t to execute
over a wavefront, the total time of execution, if any work-item diverges, is 2t.
Loops execute in a similar fashion, where the wavefront occupies a compute unit
as long as there is at least one work-item in the wavefront still being processed.
Thus, the total execution time for the wavefront is determined by the work-item
with the longest execution time.
Example 2: If t is the time it takes to execute a single iteration of a loop; and
within a wavefront all work-items execute the loop one time, except for a single
work-item that executes the loop 100 times, the time it takes to execute that
entire wavefront is 100t.

1-10

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

1.4.3

Work-Item Creation
For each work-group, the GPU compute device spawns the required number of
wavefronts on a single compute unit. If there are non-active work-items within a
wavefront, the stream cores that would have been mapped to those work-items
are idle. An example is a work-group that is a non-multiple of a wavefront size
(for example: if the work-group size is 32, the wavefront is half empty and
unused).

1.5 Memory Architecture and Access
OpenCL has four memory domains: private, local, global, and constant; the AMD
Accelerated Parallel Processing system also recognizes host (CPU) and PCI
Express® (PCIe®) memory.

•

private memory - specific to a work-item; it is not visible to other work-items.

•

local memory - specific to a work-group; accessible only by work-items
belonging to that work-group.

•

global memory - accessible to all work-items executing in a context, as well
as to the host (read, write, and map commands).

•

constant memory - read-only region for host-allocated and -initialized objects
that are not changed during kernel execution.

•

host (CPU) memory - host-accessible region for an application’s data
structures and program data.

•

PCIe memory - part of host (CPU) memory accessible from, and modifiable
by, the host program and the GPU compute device. Modifying this memory
requires synchronization between the GPU compute device and the CPU.

Figure 1.8 illustrates the interrelationship of the memories. (Note that the
referenced color buffer is a write-only output buffer in a pixel shader that has a
predetermined location based on the pixel location.)

1.5 Memory Architecture and Access
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-11

AMD ACCELERATED PARALLEL PROCESSING

Compute Device

Compute Unit n

Private Memory
(Reg Files)
m
Private Memory
(Reg Files)
1 2

Proc. Elem.
(ALU)

Local Mem.1
(LDS) 1

Local Mem.n
(LDS) n

atomic
R/W

W/O

Color Buffer
write only

Proc. Elem.
(ALU)

Color Buffer

atomic
read / write

Compute Unit 1

IMAGE / CONSTANT DATA
CACHE (L2)

Compute Device
GLOBAL MEMORY
Memory (VRAM)

Figure 1.8

CONSTANT MEMORY

Host

DMA

PCIe

Interrelationship of Memory Domains for Southern Islands
Devices

Figure 1.9 illustrates the standard dataflow between host (CPU) and GPU.

S
T

I
e
H
O
S
T

Figure 1.9

P
C
I
e

B
A
L

G
L
O
B
A
L

A
L

L
O
C
A
L

A
TP
ER

I
V
A
T
E

Dataflow between Host and GPU

There are two ways to copy data from the host to the GPU compute device
memory:

1-12

•

Implicitly by using clEnqueueMapBuffer and clEnqueueUnMapMemObject.

•

Explicitly through clEnqueueReadBuffer and clEnqueueWriteBuffer
(clEnqueueReadImage, clEnqueueWriteImage.).

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

When using these interfaces, it is important to consider the amount of copying
involved. There is a two-copy processes: between host and PCIe, and between
PCIe and GPU compute device. This is why there is a large performance
difference between the system GFLOPS and the kernel GFLOPS.
With proper memory transfer management and the use of system pinned
memory (host/CPU memory remapped to the PCIe memory space), copying
between host (CPU) memory and PCIe memory can be skipped. Note that this
is not an easy API call to use and comes with many constraints, such as page
boundary and memory alignment.
Double copying lowers the overall system memory bandwidth. In GPU compute
device programming, pipelining and other techniques help reduce these
bottlenecks. See Chapter 4, Chapter 5, and Chapter 6 for more specifics about
optimization techniques.

1.5.1

Memory Access
Using local memory (known as local data store, or LDS, as shown in Figure 1.8)
typically is an order of magnitude faster than accessing host memory through
global memory (VRAM), which is one order of magnitude faster again than PCIe.
However, stream cores do not directly access memory; instead, they issue
memory requests through dedicated hardware units. When a work-item tries to
access memory, the work-item is transferred to the appropriate fetch unit. The
work-item then is deactivated until the access unit finishes accessing memory.
Meanwhile, other work-items can be active within the compute unit, contributing
to better performance. The data fetch units handle three basic types of memory
operations: loads, stores, and streaming stores. GPU compute devices can store
writes to random memory locations using global buffers.

1.5.2

Global Buffer
The global buffer lets applications read from, and write to, arbitrary locations in
memory. When using a global buffer, memory-read and memory-write operations
from the stream kernel are done using regular GPU compute device instructions
with the global buffer used as the source or destination for the instruction. The
programming interface is similar to load/store operations used with CPU
programs, where the relative address in the read/write buffer is specified.

1.5.3

Image Read/Write
Image reads are done by addressing the desired location in the input memory
using the fetch unit. The fetch units can process either 1D or 2 D addresses.
These addresses can be normalized or un-normalized. Normalized coordinates
are between 0.0 and 1.0 (inclusive). For the fetch units to handle 2D addresses
and normalized coordinates, pre-allocated memory segments must be bound to
the fetch unit so that the correct memory address can be computed. For a single
kernel invocation, up to 128 images can be bound at once for reading, and eight
for writing. The maximum number of 2D addresses is 8192 x 8192.

1.5 Memory Architecture and Access
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-13

AMD ACCELERATED PARALLEL PROCESSING

Image reads are cached through the texture system (corresponding to the L2 and
L1 caches).

1.5.4

Memory Load/Store
When using a global buffer, each work-item can write to an arbitrary location
within the global buffer. Global buffers use a linear memory layout. If consecutive
addresses are written, the compute unit issues a burst write for more efficient
memory access. Only read-only buffers, such as constants, are cached.

1.6 Communication Between Host and GPU in a Compute Device
The following subsections discuss the communication between the host (CPU)
and the GPU in a compute device. This includes an overview of the PCIe bus,
processing API calls, and DMA transfers.

1.6.1

PCI Express Bus
Communication and data transfers between the system and the GPU compute
device occur on the PCIe channel. AMD Accelerated Parallel Processing
graphics cards use PCIe 2.0 x16 (second generation, 16 lanes). Generation 1
x16 has a theoretical maximum throughput of 4 GBps in each direction.
Generation 2 x16 doubles the throughput to 8 GBps in each direction. Southern
Islands AMD GPUs support PCIe 3.0 with a theoretical peak performance of
16 GBps. Actual transfer performance is CPU and chipset dependent.
Transfers from the system to the GPU compute device are done either by the
command processor or by the DMA engine. The GPU compute device also can
read and write system memory directly from the compute unit through kernel
instructions over the PCIe bus.

1.6.2

Processing API Calls: The Command Processor
The host application does not interact with the GPU compute device directly. A
driver layer translates and issues commands to the hardware on behalf of the
application.
Most commands to the GPU compute device are buffered in a command queue
on the host side. The command queue is sent to the GPU compute device, and
the commands are processed by it. There is no guarantee as to when commands
from the command queue are executed, only that they are executed in order.
Unless the GPU compute device is busy, commands are executed immediately.
Command queue elements include:

1-14

•

Kernel execution calls

•

Kernels

•

Constants

•

Transfers between device and host

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

1.6.3

DMA Transfers
Direct Memory Access (DMA) memory transfers can be executed separately from
the command queue using the DMA engine on the GPU compute device. DMA
calls are executed immediately; and the order of DMA calls and command queue
flushes is guaranteed.
DMA transfers can occur asynchronously. This means that a DMA transfer is
executed concurrently with other system or GPU compute operations when there
are no dependencies. However, data is not guaranteed to be ready until the DMA
engine signals that the event or transfer is completed. The application can query
the hardware for DMA event completion. If used carefully, DMA transfers are
another source of parallelization.

1.6.4

Masking Visible Devices
By default, OpenCL applications are exposed to all GPUs installed in the system;
this allows applications to use multiple GPUs to run the compute task.
In some cases, the user might want to mask the visibility of the GPUs seen by
the OpenCL application. One example is to dedicate one GPU for regular
graphics operations and the other three (in a four-GPU system) for Compute. To
do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a commaseparated list variable:

•

Under Windows: set GPU_DEVICE_ORDINAL=1,2,3

•

Under Linux: export GPU_DEVICE_ORDINAL=1,2,3

Another example is a system with eight GPUs, where two distinct OpenCL
applications are running at the same time. The administrator might want to set
GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the
second application; thus, partitioning the available GPUs so that both
applications can run at the same time.

1.7 GPU Compute Device Scheduling
GPU compute devices are very efficient at parallelizing large numbers of workitems in a manner transparent to the application. Each GPU compute device
uses the large number of wavefronts to hide memory access latencies by having
the resource scheduler switch the active wavefront in a given compute unit
whenever the current wavefront is waiting for a memory access to complete.
Hiding memory access latencies requires that each work-item contain a large
number of ALU operations per memory load/store.
Figure 1.10 shows the timing of a simplified execution of work-items in a single
stream core. At time 0, the work-items are queued and waiting for execution. In
this example, only four work-items (T0…T3) are scheduled for the compute unit.
The hardware limit for the number of active work-items is dependent on the
resource usage (such as the number of active registers used) of the program

1.7 GPU Compute Device Scheduling
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-15

AMD ACCELERATED PARALLEL PROCESSING

being executed. An optimally programmed GPU compute device typically has
thousands of active work-items.

Work-Item
T0
STALL

READY

T1
STALL

READY

T2
READY

STALL

T3
STALL

READY

20
= executing

= ready (not executing)

80
= stalled

Figure 1.10 Simplified Execution Of Work-Items On A Single Stream Core
At runtime, work-item T0 executes until cycle 20; at this time, a stall occurs due
to a memory fetch request. The scheduler then begins execution of the next
work-item, T1. Work-item T1 executes until it stalls or completes. New work-items
execute, and the process continues until the available number of active workitems is reached. The scheduler then returns to the first work-item, T0.
If the data work-item T0 is waiting for has returned from memory, T0 continues
execution. In the example in Figure 1.10, the data is ready, so T0 continues.
Since there were enough work-items and processing element operations to cover
the long memory latencies, the stream core does not idle. This method of
memory latency hiding helps the GPU compute device achieve maximum
performance.
If none of T0 – T3 are runnable, the stream core waits (stalls) until one of T0 –
T3 is ready to execute. In the example shown in Figure 1.11, T0 is the first to
continue execution.

1-16

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Work-Item
T0
STALL

T1
STALL

T2
STALL

T3
STALL

20
= executing

= ready (not executing)

80
= stalled

Figure 1.11 Stream Core Stall Due to Data Dependency
The causes for this situation are discussed in the following sections.

1.8 Terminology
1.8.1

Compute Kernel
To define a compute kernel, it is first necessary to define a kernel. A kernel is a
small unit of execution that performs a clearly defined function and that can be
executed in parallel. Such a kernel can be executed on each element of an input
stream (called an NDRange), or simply at each point in an arbitrary index space.
A kernel is analogous and, on some devices identical, to what graphics
programmers call a shader program. This kernel is not to be confused with an
OS kernel, which controls hardware. The most basic form of an NDRange is
simply mapped over input data and produces one output item for each input
tuple. Subsequent extensions of the basic model provide random-access
functionality, variable output counts, and reduction/accumulation operations.
Kernels are specified using the kernel keyword.
A compute kernel is a specific type of kernel that is not part of the traditional
graphics pipeline. The compute kernel type can be used for graphics, but its
strength lies in using it for non-graphics fields such as physics, AI, modeling,
HPC, and various other computationally intensive applications.

1.8 Terminology
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-17

AMD ACCELERATED PARALLEL PROCESSING

1.8.1.1 Work-Item Spawn Order
In a compute kernel, the work-item spawn order is sequential. This means that
on a chip with N work-items per wavefront, the first N work-items go to wavefront
1, the second N work-items go to wavefront 2, etc. Thus, the work-item IDs for
wavefront K are in the range (K•N) to ((K+1)•N) - 1.

1.8.2

Wavefronts and Work-groups
Wavefronts and work-groups are two concepts relating to compute kernels that
provide data-parallel granularity. A wavefront executes a number of work-items
in lock step relative to each other. Sixteen work-items are execute in parallel
across the vector unit, and the whole wavefront is covered over four clock cycles.
It is the lowest level that flow control can affect. This means that if two work-items
inside of a wavefront go divergent paths of flow control, all work-items in the
wavefront go to both paths of flow control.
Grouping is a higher-level granularity of data parallelism that is enforced in
software, not hardware. Synchronization points in a kernel guarantee that all
work-items in a work-group reach that point (barrier) in the code before the next
statement is executed.
Work-groups are composed of wavefronts. Best performance is attained when
the group size is an integer multiple of the wavefront size.

1.8.3

Local Data Store (LDS)
The LDS is a high-speed, low-latency memory private to each compute unit. It is
a full gather/scatter model: a work-group can write anywhere in its allocated
space. This model is unchanged for the AMD Radeon™ HD 7XXX series. The
constraints of the current LDS model are:
1. The LDS size is allocated per work-group. Each work-group specifies how
much of the LDS it requires. The hardware scheduler uses this information
to determine which work groups can share a compute unit.
2. Data can only be shared within work-items in a work-group.
3. Memory accesses outside of the work-group result in undefined behavior.

1.9 Programming Model
The OpenCL programming model is based on the notion of a host device,
supported by an application API, and a number of devices connected through a
bus. These are programmed using OpenCL C. The host API is divided into
platform and runtime layers. OpenCL C is a C-like language with extensions for
parallel programming such as memory fence operations and barriers. Figure 1.12
illustrates this model with queues of commands, reading/writing data, and
executing kernels for specific devices.

1-18

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

_kernel foo(...) {

Wi0 Wi1 Wi3 Win

Wi0 Wi1 Wi3 Win
barrier(...)

}

}
Local Memory

Local Memory

Context
Queue

Queue

Global/Constant Memory
Figure 1.12 OpenCL Programming Model
The devices are capable of running data- and task-parallel work. A kernel can be
executed as a function of multi-dimensional domains of indices. Each element is
called a work-item; the total number of indices is defined as the global work-size.
The global work-size can be divided into sub-domains, called work-groups, and
individual work-items within a group can communicate through global or locally
shared memory. Work-items are synchronized through barrier or fence
operations. Figure 1.12 is a representation of the host/device architecture with a
single platform, consisting of a GPU and a CPU.
An OpenCL application is built by first querying the runtime to determine which
platforms are present. There can be any number of different OpenCL
implementations installed on a single system. The desired OpenCL platform can
be selected by matching the platform vendor string to the desired vendor name,
such as “Advanced Micro Devices, Inc.” The next step is to create a context. As
shown in Figure 1.12, an OpenCL context has associated with it a number of
compute devices (for example, CPU or GPU devices),. Within a context, OpenCL
guarantees a relaxed consistency between these devices. This means that
memory objects, such as buffers or images, are allocated per context; but
changes made by one device are only guaranteed to be visible by another device
at well-defined synchronization points. For this, OpenCL provides events, with the
ability to synchronize on a given event to enforce the correct order of execution.
Many operations are performed with respect to a given context; there also are
many operations that are specific to a device. For example, program compilation
and kernel execution are done on a per-device basis. Performing work with a
device, such as executing kernels or moving data to and from the device’s local
memory, is done using a corresponding command queue. A command queue is
associated with a single device and a given context; all work for a specific device
is done through this interface. Note that while a single command queue can be
associated with only a single device, there is no limit to the number of command
queues that can point to the same device. For example, it is possible to have

1.9 Programming Model
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-19

AMD ACCELERATED PARALLEL PROCESSING

one command queue for executing kernels and a command queue for managing
data transfers between the host and the device.
Most OpenCL programs follow the same pattern. Given a specific platform, select
a device or devices to create a context, allocate memory, create device-specific
command queues, and perform data transfers and computations. Generally, the
platform is the gateway to accessing specific devices, given these devices and
a corresponding context, the application is independent of the platform. Given a
context, the application can:

•

Create one or more command queues.

•

Create programs to run on one or more associated devices.

•

Create kernels within those programs.

•

Allocate memory buffers or images, either on the host or on the device(s).
(Memory can be copied between the host and device.)

•

Write data to the device.

•

Submit the kernel (with appropriate arguments) to the command queue for
execution.

•

Read data back to the host from the device.

The relationship between context(s), device(s), buffer(s), program(s), kernel(s),
and command queue(s) is best seen by looking at sample code.

1.10 Example Programs
The following subsections provide simple programming examples with
explanatory comments.

1.10.1

First Example: Simple Buffer Write
This sample shows a minimalist OpenCL C program that sets a given buffer to
some value. It illustrates the basic programming steps with a minimum amount
of code. This sample contains no error checks and the code is not generalized.
Yet, many simple test programs might look very similar. The entire code for this
sample is provided at the end of this section.
1. The host program must select a platform, which is an abstraction for a given
OpenCL implementation. Implementations by multiple vendors can coexist on
a host, and the sample uses the first one available.
2. A device id for a GPU device is requested. A CPU device could be requested
by using CL_DEVICE_TYPE_CPU instead. The device can be a physical device,
such as a given GPU, or an abstracted device, such as the collection of all
CPU cores on the host.
3. On the selected device, an OpenCL context is created. A context ties
together a device, memory buffers related to that device, OpenCL programs,
and command queues. Note that buffers related to a device can reside on

1-20

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

either the host or the device. Many OpenCL programs have only a single
context, program, and command queue.
4. Before an OpenCL kernel can be launched, its program source is compiled,
and a handle to the kernel is created.
5. A memory buffer is allocated in the context.
6. The kernel is launched. While it is necessary to specify the global work size,
OpenCL determines a good local work size for this device. Since the kernel
was launch asynchronously, clFinish() is used to wait for completion.
7. The data is mapped to the host for examination. Calling
clEnqueueMapBuffer ensures the visibility of the buffer on the host, which in
this case probably includes a physical transfer. Alternatively, we could use
clEnqueueWriteBuffer(), which requires a pre-allocated host-side buffer.

Example Code 1 –
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
// A minimalist OpenCL program.
#include
#include
#define NWITEMS 512
// A simple memset kernel
const char *source =
"__kernel void memset( __global uint *dst )
"{
"
dst[get_global_id(0)] = get_global_id(0);
"}

\n"
\n"
\n"
\n";

int main(int argc, char ** argv)
{
// 1. Get a platform.
cl_platform_id platform;
clGetPlatformIDs( 1, &platform, NULL );
// 2. Find a gpu device.
cl_device_id device;
clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU,
1,
&device,
NULL);

1.10 Example Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-21

AMD ACCELERATED PARALLEL PROCESSING

// 3. Create a context and command queue on that device.
cl_context context = clCreateContext( NULL,
1,
&device,
NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue( context,
device,
0, NULL );
// 4. Perform runtime source compilation, and obtain kernel entry point.
cl_program program = clCreateProgramWithSource( context,
1,
&source,
NULL, NULL );
clBuildProgram( program, 1, &device, NULL, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, "memset", NULL );
// 5. Create a data buffer.
cl_mem buffer = clCreateBuffer( context,
CL_MEM_WRITE_ONLY,
NWITEMS * sizeof(cl_uint),
NULL, NULL );
// 6. Launch the kernel. Let OpenCL pick the local work size.
size_t global_work_size = NWITEMS;
clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);
clFinish( queue );
// 7. Look at the results via synchronous buffer map.
cl_uint *ptr;
ptr = (cl_uint *) clEnqueueMapBuffer( queue,
buffer,
CL_TRUE,
CL_MAP_READ,
0,
NWITEMS * sizeof(cl_uint),
0, NULL, NULL, NULL );
int i;
for(i=0; i < NWITEMS; i++)
printf("%d %d\n", i, ptr[i]);
return 0;
}

1-22

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

1.10.2

Example: Parallel Min() Function
This medium-complexity sample shows how to implement an efficient parallel
min() function.
The code is written so that it performs very well on either CPU or GPU. The
number of threads launched depends on how many hardware processors are
available. Each thread walks the source buffer, using a device-optimal access
pattern selected at runtime. A multi-stage reduction using __local and __global
atomics produces the single result value.
The sample includes a number of programming techniques useful for simple
tests. Only minimal error checking and resource tear-down is used.
Runtime Code –
1. The source memory buffer is allocated, and initialized with a random pattern.
Also, the actual min() value for this data set is serially computed, in order to
later verify the parallel result.
2. The compiler is instructed to dump the intermediate IL and ISA files for
further analysis.
3. The main section of the code, including device setup, CL data buffer creation,
and code compilation, is executed for each device, in this case for CPU and
GPU. Since the source memory buffer exists on the host, it is shared. All
other resources are device-specific.
4. The global work size is computed for each device. A simple heuristic is used
to ensure an optimal number of threads on each device. For the CPU, a
given CL implementation can translate one work-item per CL compute unit
into one thread per CPU core.
On the GPU, an initial multiple of the wavefront size is used, which is
adjusted to ensure even divisibility of the input data over all threads. The
value of 7 is a minimum value to keep all independent hardware units of the
compute units busy, and to provide a minimum amount of memory latency
hiding for a kernel with little ALU activity.
5. After the kernels are built, the code prints errors that occurred during kernel
compilation and linking.
6. The main loop is set up so that the measured timing reflects the actual kernel
performance. If a sufficiently large NLOOPS is chosen, effects from kernel
launch time and delayed buffer copies to the device by the CL runtime are
minimized. Note that while only a single clFinish() is executed at the end
of the timing run, the two kernels are always linked using an event to ensure
serial execution.
The bandwidth is expressed as “number of input bytes processed.” For highend graphics cards, the bandwidth of this algorithm is about an order of
magnitude higher than that of the CPU, due to the parallelized memory
subsystem of the graphics card.

1.10 Example Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-23

AMD ACCELERATED PARALLEL PROCESSING

7. The results then are checked against the comparison value. This also
establishes that the result is the same on both CPU and GPU, which can
serve as the first verification test for newly written kernel code.
8. Note the use of the debug buffer to obtain some runtime variables. Debug
buffers also can be used to create short execution traces for each thread,
assuming the device has enough memory.
9. You can use the Timer.cpp and Timer.h files from the TransferOverlap
sample, which is in the SDK samples.
Kernel Code –
10. The code uses four-component vectors (uint4) so the compiler can identify
concurrent execution paths as often as possible. On the GPU, this can be
used to further optimize memory accesses and distribution across ALUs. On
the CPU, it can be used to enable SSE-like execution.
11. The kernel sets up a memory access pattern based on the device. For the
CPU, the source buffer is chopped into continuous buffers: one per thread.
Each CPU thread serially walks through its buffer portion, which results in
good cache and prefetch behavior for each core.
On the GPU, each thread walks the source buffer using a stride of the total
number of threads. As many threads are executed in parallel, the result is a
maximally coalesced memory pattern requested from the memory back-end.
For example, if each compute unit has 16 physical processors, 16 uint4
requests are produced in parallel, per clock, for a total of 256 bytes per clock.
12. The kernel code uses a reduction consisting of three stages: __global to
__private, __private to __local, which is flushed to __global, and finally
__global to __global. In the first loop, each thread walks __global
memory, and reduces all values into a min value in __private memory
(typically, a register). This is the bulk of the work, and is mainly bound by
__global memory bandwidth. The subsequent reduction stages are brief in
comparison.
13. Next, all per-thread minimum values inside the work-group are reduced to a
__local value, using an atomic operation. Access to the __local value is
serialized; however, the number of these operations is very small compared
to the work of the previous reduction stage. The threads within a work-group
are synchronized through a local barrier(). The reduced min value is
stored in __global memory.
14. After all work-groups are finished, a second kernel reduces all work-group
values into a single value in __global memory, using an atomic operation.
This is a minor contributor to the overall runtime.

1-24

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Example Code 3 –
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include
#include
#include
#include
#include

"Timer.h"

#define NDEVS

// A parallel min() kernel that works well on CPU and GPU
const char *kernel_source =
"
"#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
"#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
"
" // 9. The source buffer is accessed as 4-vectors.
"
"__kernel void minp( __global uint4 *src,
"
__global uint *gmin,
"
__local uint *lmin,
"
__global uint *dbg,
"
int
nitems,
"
uint
dev )
"{
"
// 10. Set up __global memory access pattern.
"
"
uint count = ( nitems / 4 ) / get_global_size(0);
"
uint idx
= (dev == 0) ? get_global_id(0) * count
"
: get_global_id(0);
"
uint stride = (dev == 0) ? 1 : get_global_size(0);
"
uint pmin
= (uint) -1;
"
"
// 11. First, compute private min, for this work-item.
"
"
for( int n=0; n < count; n++, idx += stride )
"
{
"
pmin = min( pmin, src[idx].x );
"
pmin = min( pmin, src[idx].y );
"
pmin = min( pmin, src[idx].z );
"
pmin = min( pmin, src[idx].w );
"
}
"
"
// 12. Reduce min values inside work-group.
"
"
if( get_local_id(0) == 0 )
"
lmin[0] = (uint) -1;
"
"
barrier( CLK_LOCAL_MEM_FENCE );
"
"
(void) atom_min( lmin, pmin );
"
"
barrier( CLK_LOCAL_MEM_FENCE );
"
"
// Write out to __global.
"
"
if( get_local_id(0) == 0 )
"
gmin[ get_group_id(0) ] = lmin[0];

1.10 Example Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"

1-25

AMD ACCELERATED PARALLEL PROCESSING

"
"
// Dump some debug information.
"
"
if( get_global_id(0) == 0 )
"
{
"
dbg[0] = get_num_groups(0);
"
dbg[1] = get_global_size(0);
"
dbg[2] = count;
"
dbg[3] = stride;
"
}
"}
"
"// 13. Reduce work-group min values from __global to __global.
"
"__kernel void reduce( __global uint4 *src,
"
__global uint *gmin )
"{
"
(void) atom_min( gmin, gmin[get_global_id(0)] ) ;
"}

int main(int argc, char ** argv)
{
cl_platform_id
platform;
int
cl_device_type

dev, nw;
devs[NDEVS] = { CL_DEVICE_TYPE_CPU,
CL_DEVICE_TYPE_GPU };

cl_uint
unsigned int

*src_ptr;
num_src_items = 4096*4096;

// 1. quick & dirty MWC random init of source buffer.
// Random seed (portable).
time_t ltime;
time(<ime);
src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) );
cl_uint a =
(cl_uint) ltime,
b =
(cl_uint) ltime;
cl_uint min = (cl_uint) -1;
// Do serial computation of min() for result verification.
for( int i=0; i < num_src_items; i++ )
{
src_ptr[i] = (cl_uint) (b = ( a * ( b & 65535 )) + (
min = src_ptr[i] < min ? src_ptr[i] : min;
}

b >> 16 ));

// Get a platform.
clGetPlatformIDs( 1, &platform, NULL );
// 3. Iterate over devices.
for(dev=0; dev < NDEVS; dev++)
{
cl_device_id
device;
cl_context
context;
cl_command_queue queue;
1-26

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n";

AMD ACCELERATED PARALLEL PROCESSING

cl_program
cl_kernel
cl_kernel

program;
minp;
reduce;

cl_mem
cl_mem
cl_mem

src_buf;
dst_buf;
dbg_buf;

cl_uint

*dst_ptr,
*dbg_ptr;

printf("\n%s: ", dev == 0 ? "CPU" : "GPU");
// Find the device.
clGetDeviceIDs( platform,
devs[dev],
1,
&device,
NULL);
// 4. Compute work sizes.
cl_uint
size_t
size_t
size_t

compute_units;
global_work_size;
local_work_size;
num_groups;

clGetDeviceInfo( device,
CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint),
&compute_units,
NULL);
if( devs[dev] == CL_DEVICE_TYPE_CPU )
{
global_work_size = compute_units * 1;
local_work_size = 1;
}
else
{
cl_uint ws = 64;

// 1 thread per core

global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD
while( (num_src_items / 4) % global_work_size != 0 )
global_work_size += ws;
local_work_size = ws;
}
num_groups = global_work_size / local_work_size;
// Create a context and command queue on that device.
context = clCreateContext( NULL,
1,
&device,
NULL, NULL, NULL);
queue = clCreateCommandQueue(context,
device,
0, NULL);

1.10 Example Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-27

AMD ACCELERATED PARALLEL PROCESSING

// Minimal error check.
if( queue == NULL )
{
printf("Compute device setup failed\n");
return(-1);
}
// Perform runtime source compilation, and obtain kernel entry point.
program = clCreateProgramWithSource( context,
1,
&kernel_source,
NULL, NULL );
//Tell compiler to dump intermediate .il and .isa GPU files.
// 5. Print compiler error messages
if(ret != CL_SUCCESS)
{
printf("clBuildProgram failed: %d\n", ret);
char buf[0x10000];
clGetProgramBuildInfo( program,
device,
CL_PROGRAM_BUILD_LOG,
0x10000,
buf,
NULL);
printf("\n%s\n", buf);
return(-1);
}
minp
= clCreateKernel( program, "minp", NULL );
reduce = clCreateKernel( program, "reduce", NULL );
// Create input, output and debug buffers.
src_buf = clCreateBuffer( context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
num_src_items * sizeof(cl_uint),
src_ptr,
NULL );
dst_buf = clCreateBuffer( context,
CL_MEM_READ_WRITE,
num_groups * sizeof(cl_uint),
NULL, NULL );
dbg_buf = clCreateBuffer( context,
CL_MEM_WRITE_ONLY,
global_work_size * sizeof(cl_uint),
NULL, NULL );
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,

1-28

0,
1,
2,
3,
4,
5,

sizeof(void *),
sizeof(void *),
1*sizeof(cl_uint),
sizeof(void *),
sizeof(num_src_items),
sizeof(dev),

(void*)
(void*)
(void*)
(void*)
(void*)
(void*)

&src_buf);
&dst_buf);
NULL);
&dbg_buf);
&num_src_items);
&dev);

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

clSetKernelArg(reduce, 0, sizeof(void *),
clSetKernelArg(reduce, 1, sizeof(void *),

(void*) &src_buf);
(void*) &dst_buf);

CPerfCounter t;
t.Reset();
t.Start();
// 6. Main timing loop.
#define NLOOPS 500
cl_event ev;
int nloops = NLOOPS;
while(nloops--)
{
clEnqueueNDRangeKernel( queue,
minp,
1,
NULL,
&global_work_size,
&local_work_size,
0, NULL, &ev);
clEnqueueNDRangeKernel( queue,
reduce,
1,
NULL,
&num_groups,
NULL, 1, &ev, NULL);
}
clFinish( queue );
t.Stop();
printf("B/W %.2f GB/sec, ", ((float) num_src_items *
sizeof(cl_uint) * NLOOPS) /
t.GetElapsedTime() / 1e9 );
// 7. Look at the results via synchronous buffer map.
dst_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
dst_buf,
CL_TRUE,
CL_MAP_READ,
0,
num_groups * sizeof(cl_uint),
0, NULL, NULL, NULL );
dbg_ptr = (cl_uint *) clEnqueueMapBuffer( queue,
dbg_buf,
CL_TRUE,
CL_MAP_READ,
0,
global_work_size *
sizeof(cl_uint),
0, NULL, NULL, NULL );
// 8. Print some debug info.
printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0],
dbg_ptr[1],
dbg_ptr[2],
1.10 Example Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

1-29

AMD ACCELERATED PARALLEL PROCESSING

dbg_ptr[3] );
if( dst_ptr[0] == min )
printf("result correct\n");
else
printf("result INcorrect\n");
}
printf("\n");
return 0;
}

1-30

Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Chapter 2
Building and Running OpenCL
Programs
The compiler tool-chain provides a common framework for both CPUs and
GPUs, sharing the front-end and some high-level compiler transformations. The
back-ends are optimized for the device type (CPU or GPU). Figure 2.1 is a highlevel diagram showing the general compilation path of applications using
OpenCL. Functions of an application that benefit from acceleration are re-written
in OpenCL and become the OpenCL source. The code calling these functions
are changed to use the OpenCL API. The rest of the application remains
unchanged. The kernels are compiled by the OpenCL compiler to either CPU
binaries or GPU binaries, depending on the target device.

OpenCL
Source

O
p
e
n
C
L

OpenCL Compiler

Built-In
Library
LLVM IR

Linker

Front-End

R
u
n
t
i
m
e

LLVM
Optimizer
LLVM IR

LLVM AS

CPU
Figure 2.1

AMD IL

GPU

OpenCL Compiler Toolchain

For CPU processing, the OpenCL runtime uses the LLVM AS to generate x86
binaries. The OpenCL runtime automatically determines the number of
processing elements, or cores, present in the CPU and distributes the OpenCL
kernel between them.
For GPU processing, the OpenCL runtime post-processes the incomplete AMD
IL from the OpenCL compiler and turns it into complete AMD IL. This adds
macros (from a macro database, similar to the built-in library) specific to the

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

2-1

AMD ACCELERATED PARALLEL PROCESSING

GPU. The OpenCL Runtime layer then removes unneeded functions and passes
the complete IL to the CAL compiler for compilation to GPU-specific binaries.

2.1 Compiling the Program
An OpenCL application consists of a host program (C/C++) and an optional
kernel program (.cl). To compile an OpenCL application, the host program must
be compiled; this can be done using an off-the-shelf compiler such as g++ or
MSVC++. The application kernels are compiled into device-specific binaries
using the OpenCL compiler.
This compiler uses a standard C front-end, as well as the low-level virtual
machine (LLVM) framework, with extensions for OpenCL. The compiler starts
with the OpenCL source that the user program passes through the OpenCL
runtime interface (Figure 2.1). The front-end translates the OpenCL source to
LLVM IR. It keeps OpenCL-specific information as metadata structures. (For
example, to debug kernels, the front end creates metadata structures to hold the
debug information; also, a pass is inserted to translate this into LLVM debug
nodes, which includes the line numbers and source code mapping.) The frontend supports additional data-types (int4, float8, etc.), additional keywords (kernel,
global, etc.) and built-in functions (get_global_id(), barrier(), etc.). Also, it
performs additional syntactic and semantic checks to ensure the kernels meet
the OpenCL specification. The input to the LLVM linker is the output of the frontend and the library of built-in functions. This links in the built-in OpenCL functions
required by the source and transfers the data to the optimizer, which outputs
optimized LLVM IR.
For GPU processing, the LLVM IR-to-CAL IL module receives LLVM IR and
generates optimized IL for a specific GPU type in an incomplete format, which is
passed to the OpenCL runtime, along with some metadata for the runtime layer
to finish processing.
For CPU processing, LLVM AS generates x86 binary.

2.1.1

Compiling on Windows
To compile OpenCL applications on Windows requires that Visual Studio 2008
Professional Edition (or later) or the Intel C (C++) compiler are installed. All C++
files must be added to the project, which must have the following settings.

2-2

•

Project Properties → C/C++ → Additional Include Directories
These must include $(ATISTREAMSDKROOT)/include for OpenCL headers.
Optionally, they can include $(ATISTREAMSDKSAMPLESROOT)/include for
SDKUtil headers.

•

Project Properties → C/C++ → Preprocessor Definitions
These must define ATI_OS_WIN.

Chapter 2: Building and Running OpenCL Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

2.1.2

•

Project Properties → Linker → Additional Library Directories
These must include $(ATISTREAMSDKROOT)/lib/x86 for OpenCL libraries.
Optionally, they can include $(ATISTREAMSDKSAMPLESROOT)/lib/x86 for
SDKUtil libraries.

•

Project Properties → Linker → Input → Additional Dependencies
These must include OpenCL.lib. Optionally, they can include SDKUtil.lib.

Compiling on Linux
To compile OpenCL applications on Linux requires that the gcc or the Intel C
compiler is installed. There are two major steps to do this: compiling and linking.
1. Compile all the C++ files (Template.cpp), and get the object files.
For 32-bit object files on a 32-bit system, or 64-bit object files on 64-bit
system:
g++ -o Template.o -DATI_OS_LINUX -c Template.cpp -I$ATISTREAMSDKROOT/include

For building 32-bit object files on a 64-bit system:
g++ -o Template.o -DATI_OS_LINUX -c Template.cpp -I$ATISTREAMSDKROOT/include

2. Link all the object files generated in the previous step to the OpenCL library
and create an executable.
For linking to a 64-bit library:
g++ -o Template Template.o -lOpenCL -L$ATISTREAMSDKROOT/lib/x86_64

For linking to a 32-bit library:
g++ -o Template Template.o -lOpenCL -L$ATISTREAMSDKROOT/lib/x86

The OpenCL samples in the SDK provided by AMD Accelerated Parallel
Processing depend on the SDKUtil library. In Linux, the samples use the shipped
SDKUtil.lib, whether or not the sample is built for release or debug. When
compiling all samples from the samples/opencl folder, the SDKUtil.lib is
created first; then, the samples use this generated library. When compiling the
SDKUtil library, the created library replaces the shipped library.
The following are linking options if the samples depend on the SDKUtil Library
(assuming the SDKUtil library is created in $ATISTREAMSDKROOT/lib/x86_64 for
64-bit libraries, or $ATISTREAMSDKROOT/lib/x86 for 32-bit libraries).
g++ -o Template Template.o -lSDKUtil -lOpenCL -L$ATISTREAMSDKROOT/lib/x86_64
g++ -o Template Template.o -lSDKUtil -lOpenCL -L$ATISTREAMSDKROOT/lib/x86

2.1 Compiling the Program
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

2-3

AMD ACCELERATED PARALLEL PROCESSING

2.1.3

Supported Standard OpenCL Compiler Options
The currently supported options are:

2.1.4

•

-I dir — Add the directory dir to the list of directories to be searched for
header files. When parsing #include directives, the OpenCL compiler
resolves relative paths using the current working directory of the application.

•

-D name — Predefine name as a macro, with definition = 1. For D name=definition, the contents of definition are tokenized and processed
as if they appeared during the translation phase three in a #define directive.
In particular, the definition is truncated by embedded newline characters.
-D options are processed in the order they are given in the options argument
to clBuildProgram.

AMD-Developed Supplemental Compiler Options
The following supported options are not part of the OpenCL specification:

2-4

•

-g — This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
cygwin/minGW under Windows. For more details, see Chapter 3, “Debugging
OpenCL.” This option does not affect the default optimization of the OpenCL
code.

•

-O0 — Specifies to the compiler not to optimize. This is equivalent to the
OpenCL standard option -cl-opt-disable.

•

-f[no-]bin-source — Does [not] generate OpenCL source in the .source
section. For more information, see Appendix E, “OpenCL Binary Image
Format (BIF) v2.0.”

•

-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”

•

-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”

•

-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text
section. For more information, see Appendix E, “OpenCL Binary Image
Format (BIF) v2.0.”

•

-save-temps[=] — This option dumps intermediate temporary
files, such as IL and ISA code, for each OpenCL kernel. If is not
given, temporary files are saved in the default temporary directory (the
current directory for Linux, C:\Users\\AppData\Local for Windows).
If is given, those temporary files are saved with the given
. If is an absolute path prefix, such as
C:\your\work\dir\mydumpprefix, those temporaries are saved under
C:\your\work\dir, with mydumpprefix as prefix to all temporary names. For
example,

Chapter 2: Building and Running OpenCL Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

-save-temps
under the default directory
_temp_nn_xxx_yyy.il, _temp_nn_xxx_yyy.isa
-save-temps=aaa
under the default directory
aaa_nn_xxx_yyy.il, aaa_nn_xxx_yyy.isa
-save-temps=C:\you\dir\bbb
under C:\you\dir
bbb_nn_xxx_yyy.il, bbb_nn_xxx_yyy.isa
where xxx and yyy are the device name and kernel name for this build,
respectively, and nn is an internal number to identify a build to avoid
overriding temporary files. Note that this naming convention is subject to
change.
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.

•

AMD_OCL_BUILD_OPTIONS — Overrides the CL options specified in
clBuildProgram().

•

AMD_OCL_BUILD_OPTIONS_APPEND — Appends options to those specified in
clBuildProgram().

2.2 Running the Program
The runtime system assigns the work in the command queues to the underlying
devices. Commands are placed into the queue using the clEnqueue commands
shown in the listing below.
OpenCL API Function

Description

clCreateCommandQueue()

Create a command queue for a specific device (CPU,
GPU).

clCreateProgramWithSource()
clCreateProgramWithBinary()

Create a program object using the source code of the
application kernels.

clBuildProgram()

Compile and link to create a program executable from
the program source or binary.

clCreateKernel()

Creates a kernel object from the program object.

clCreateBuffer()

Creates a buffer object for use via OpenCL kernels.

clSetKernelArg()
clEnqueueNDRangeKernel()

Set the kernel arguments, and enqueue the kernel in a
command queue.

clEnqueueReadBuffer(),
clEnqueueWriteBuffer()

Enqueue a command in a command queue to read from
a buffer object to host memory, or write to the buffer
object from host memory.

clEnqueueWaitForEvents()

Wait for the specified events to complete.

2.2 Running the Program
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

2-5

AMD ACCELERATED PARALLEL PROCESSING

The commands can be broadly classified into three categories.

•

Kernel commands (for example, clEnqueueNDRangeKernel(), etc.),

•

Memory commands (for example, clEnqueueReadBuffer(), etc.), and

•

Event commands (for example, clEnqueueWaitForEvents(), etc.

As illustrated in Figure 2.2, the application can create multiple command queues
(some in libraries, for different components of the application, etc.). These
queues are muxed into one queue per device type. The figure shows command
queues 1 and 3 merged into one CPU device queue (blue arrows); command
queue 2 (and possibly others) are merged into the GPU device queue (red
arrow). The device queue then schedules work onto the multiple compute
resources present in the device. Here, K = kernel commands, M = memory
commands, and E = event commands.

2.2.1

Running Code on Windows
The following steps ensure the execution of OpenCL applications on Windows.
1. The path to OpenCL.lib ($ATISTREAMSDKROOT/lib/x86) must be included in
path environment variable.
2. Generally, the path to the kernel file (Template_Kernel.cl) specified in the
host program is relative to the executable. Unless an absolute path is
specified, the kernel file must be in the same directory as the executable.

Programming
Layer
Command
Queues

For CPU queue

Device
Command
Queue

K111

K112

CPU Core 1

CPU Core 2

For CPU queue

For GPU queue

GPU
CPU

M11 K11 E11 K12 M12 M31 K32 M32

Figure 2.2

2-6

M1 K1 E1 K2 M2 K3 M3

Scheduler
GPU Core 1

Runtime Processing Structure

Chapter 2: Building and Running OpenCL Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

GPU Core 2

AMD ACCELERATED PARALLEL PROCESSING

2.2.2

Running Code on Linux
The following steps ensure the execution of OpenCL applications on Linux.
1. The path to libOpenCL.so ($ATISTREAMSDKROOT/lib/x86) must be included
in $LD_LIBRARY_PATH.
2. /usr/lib/OpenCL/vendors/ must have libatiocl32.so and/or
libatiocl64.so.
3. Generally, the path to the kernel file (Template_Kernel.cl) specified in the
host program is relative to the executable. Unless an absolute path is
specified, the kernel file must be in the same directory as the executable.

2.3 Calling Conventions
For all Windows platforms, the __stdcall calling convention is used. Function
names are undecorated.
For Linux, the calling convention is __cdecl.

2.3 Calling Conventions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

2-7

AMD ACCELERATED PARALLEL PROCESSING

2-8

Chapter 2: Building and Running OpenCL Programs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Chapter 3
Debugging OpenCL

This chapter discusses how to debug OpenCL programs running on AMD
Accelerated Parallel Processing GPU and CPU compute devices. The first,
preferred, method is to debug with the AMD gDEBugger, as described in
Section 3.1, “AMD gDEBugger.” The second method, described in Section 3.2,
“Debugging CPU Kernels with GDB,” is to use experimental features provided by
AMD Accelerated Parallel Processing (GNU project debugger, GDB) to debug
kernels on x86 CPUs running Linux or cygwin/minGW under Windows.

3.1 AMD gDEBugger
gDEBugger 6.2 is available as an extension to Microsoft® Visual Studio®, a
stand-alone version for Windows and a stand alone version for Linux.
gDEBugger offers real-time OpenCL kernel debugging and memory analysis on
GPU devices, allowing developers to access the kernel execution directly from
the API call that issues it, debug inside the kernel, and view all variable values
across the different work-groups and work-items. For Microsoft® Visual Studio®,
it also provides OpenGL debugging and memory analysis. For information on
downloading and installing gDEBugger, see:
http://developer.amd.com/tools/gDEBugger/Pages/default.aspx
After installing gDEBugger for Visual Studio, launch Visual Studio, and open the
solution to be worked on. In the Visual Studio menu bar, note the new
gDEBugger menu, which contains all the required controls.
Select a Visual C/C++ project, and set its debugging properties as normal. To
add a breakpoint, either select New gDEBugger Breakpoint from the gDEBugger
menu, or navigate to a kernel file used in the application and set a breakpoint on
the desired source line. Then, select the Launch OpenCL/OpenGL Debugging
from the gDEBugger menu to start debugging.
gDEBugger currently supports only API-level debugging and OpenCL kernel
debugging; stepping through C/C++ code is not yet possible. However, the
C/C++ call stack can be seen in the Visual Studio call stack view, which shows
what led to the API function call.
To start kernel debugging, you can choose one of several options; one of these
is to Step Into (F11) the appropriate clEnqueueNDRangeKernel function call.
Once the kernel starts executing, debug it like C/C++ code, stepping into, out of,
or over function calls in the kernel, setting source breakpoints, and inspecting the
locals, autos, watch, and call stack views.

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

3-1

AMD ACCELERATED PARALLEL PROCESSING

To view OpenCL and OpenGL objects and their information, use the gDEBugger
Explorer and gDEBugger Properties view. Additional views and features provide
more detailed and advanced information on the OpenCL and OpenGL runtimes,
their states, and the objects created within them.
For further information and more detailed usage instructions, see the gDEBugger
User Guide:
http://developer.amd.com/tools/gDEBugger/webhelp/index.html
or the online help provided with gDEBugger.

3.2 Debugging CPU Kernels with GDB
This section describes an experimental feature for using the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or cygwin/minGW
under Windows.

3.2.1

Setting the Environment
The OpenCL program to be debugged first is compiled by passing the “-g -O0”
(or “-g -cl-opt-disable”) option to the compiler through the options string to
clBuildProgram. For example, using the C++ API:
err = program.build(devices,"-g -O0");
To avoid source changes, set the environment variable as follows:
AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0" or
AMD_OCL_BUILD_OPTIONS="-g -O0"
Below is a sample debugging session of a program with a simple hello world
kernel. The following GDB session shows how to debug this kernel. Ensure that
the program is configured to be executed on the CPU. It is important to set
CPU_MAX_COMPUTE_UNITS=1. This ensures that the program is executed
deterministically.

3.2.2

Setting the Breakpoint in an OpenCL Kernel
To set a breakpoint, use:
b [N | function | kernel_name]
where N is the line number in the source code, function is the function name,
and kernel_name is constructed as follows: if the name of the kernel is
bitonicSort, the kernel_name is __OpenCL_bitonicSort_kernel.
Note that if no breakpoint is set, the program does not stop until execution is
complete.
Also note that OpenCL kernel symbols are not visible in the debugger until the
kernel is loaded. A simple way to check for known OpenCL symbols is to set a

3-2

Chapter 3: Debugging OpenCL
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

breakpoint in the host code at clEnqueueNDRangeKernel, and to use the GDB
info functions __OpenCL command, as shown in the example below.

3.2.3

Sample GDB Session
The following is a sample debugging session. Note that two separate breakpoints
are set. The first is set in the host code, at clEnqueueNDRangeKernel(). The
second breakpoint is set at the actual CL kernel function.
$ export AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0"
$ export CPU_MAX_COMPUTE_UNITS=1
$ gdb BitonicSort
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /home/himanshu/Desktop/ati-stream-sdk-v2.3lnx64/samples/opencl/bin/x86_64/BitonicSort...done.
(gdb) b clEnqueueNDRangeKernel
Breakpoint 1 at 0x403228
(gdb) r --device cpu
Starting program: /home/himanshu/Desktop/ati-stream-sdk-v2.3lnx64/samples/opencl/bin/x86_64/BitonicSort --device cpu
[Thread debugging using libthread_db enabled]
Unsorted Input
53 5 199 15 120 9 71 107 71 242 84 150 134 180 26 128 196 9 98 4 102 65
206 35 224 2 52 172 160 94 2 214 99 .....
Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : AMD Athlon(tm) II X4 630 Processor
[New Thread 0x7ffff7e6b700 (LWP 1894)]
[New Thread 0x7ffff2fcc700 (LWP 1895)]
Executing kernel for 1 iterations
------------------------------------------Breakpoint 1, 0x00007ffff77b9b20 in clEnqueueNDRangeKernel () from
/home/himanshu/Desktop/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libOpenCL.so
(gdb) info functions __OpenCL
All functions matching regular expression "__OpenCL":
File OCLm2oVFr.cl:
void __OpenCL_bitonicSort_kernel(uint *, const uint, const uint, const
uint, const uint);
Non-debugging symbols:
0x00007ffff23c2dc0 __OpenCL_bitonicSort_kernel@plt
0x00007ffff23c2f40 __OpenCL_bitonicSort_stub
(gdb) b __OpenCL_bitonicSort_kernel
Breakpoint 2 at 0x7ffff23c2de9: file OCLm2oVFr.cl, line 32.
(gdb) c
Continuing.
[Switching to Thread 0x7ffff2fcc700 (LWP 1895)]
Breakpoint 2, __OpenCL_bitonicSort_kernel (theArray=0x615ba0, stage=0,
passOfStage=0, width=1024, direction=0) at OCLm2oVFr.cl:32
32
uint sortIncreasing = direction;
(gdb) p get_global_id(0)
$1 = 0
(gdb) c

3.2 Debugging CPU Kernels with GDB
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

3-3

AMD ACCELERATED PARALLEL PROCESSING

Continuing.
Breakpoint 2, __OpenCL_bitonicSort_kernel (theArray=0x615ba0, stage=0,
passOfStage=0, width=1024, direction=0) at OCLm2oVFr.cl:32
32
uint sortIncreasing = direction;
(gdb) p get_global_id(0)
$2 = 1
(gdb)

3.2.4

Notes
1. To make a breakpoint in a working thread with some particular ID in
dimension N, one technique is to set a conditional breakpoint when the
get_global_id(N) == ID. To do this, use:
b [ N | function | kernel_name ] if (get_global_id(N)==ID)
where N can be 0, 1, or 2.
2. For complete GDB documentation, see
http://www.gnu.org/software/gdb/documentation/ .
3. For debugging OpenCL kernels in Windows, a developer can use GDB
running in cygwin or minGW. It is done in the same way as described in
sections 3.1 and 3.2.
Notes:

3-4

–

Only OpenCL kernels are visible to GDB when running cygwin or
minGW. GDB under cygwin/minGW currently does not support host code
debugging.

–

It is not possible to use two debuggers attached to the same process.
Do not try to attach Visual Studio to a process, and concurrently GDB to
the kernels of that process.

–

Continue to develop the application code using Visual Studio. Currently,
gcc running in cygwin or minGW is not supported.

Chapter 3: Debugging OpenCL
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Chapter 4
OpenCL Performance and
Optimization
This chapter discusses performance and optimization when programming for
AMD Accelerated Parallel Processing (APP) GPU compute devices, as well as
CPUs and multiple devices. Details specific to the Southern Islands series of
GPUs is at the end of the chapter.

4.1 AMD APP Profiler
The AMD APP Profiler (hereafter Profiler) is a performance analysis tool that
gathers data from the OpenCL run-time and AMD Radeon™ GPUs during the
execution of an OpenCL application. This information is used to discover
bottlenecks in the application and find ways to optimize the application’s
performance for AMD platforms. The Profiler can be installed as part of the AMD
APP SDK installation, or separately using its own installer package. It is
downloadable from:
http://developer.amd.com/tools/AMDAPPProfiler/Pages/default.aspx.
This section describes the major features of Profiler version 2.4. Because the
Profiler is still being developed, please see the documentation for the latest
features of the tool at the same URL provided above.
The Profiler supports two usage models.

•

Plug-in for Microsoft Visual Studio 2008 or 2010 (recommended). This lets
you visualize and analyze the results in multiple ways.

•

Command-line utility tool for both Windows and Linux platforms. This is a
way to collect data for applications without source code access. The results
can be analyzed directly in the text format or visualized in the Visual Studio
plug-in.

The Profiler supports two modes of operations.

•

Collecting OpenCL application traces.

•

Collecting OpenCL kernel GPU performance counters.

These are described in the following subsections.

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-1

AMD ACCELERATED PARALLEL PROCESSING

4.1.1

Collecting OpenCL Application Trace
The OpenCL application trace lists all the OpenCL API calls made by the
application. For each of the API calls, the input parameters and output results
are recorded, in addition to the CPU timestamps for the host code and device
timestamps retrieved from the OpenCL run-time. The output data is recorded in
an AMD custom application trace profile (*.atp) file format. See the Profiler
documentation for more information.
This mode is especially useful for investigating the high-level structure of a
complex application.
From the OpenCL application trace data, it is possible to:

•

Reveal the high-level structure of the application with the Timeline view. This
lets you investigate the number of OpenCL contexts and command queues
created, as well as the relationships of these items in the application. The
timeline shows the host code, kernel, and data transfer execution. See
Section 4.1.1.1, “Timeline View,” page 4-2.

•

Identify whether the application is bound by kernel execution or data transfer
time; find the top ten most expensive kernels and data transfers; find the API
hot spots (most frequently called or expensive API call) in the application with
the Summary Pages view. See Section 4.1.1.2, “Summary Pages View,”
page 4-4).

•

View and debug the input parameters and output results for all API calls in
the application with the API Trace view. See Section 4.1.1.3, “API Trace
View,” page 4-5.

•

An OpenCL Performance Marker (CLPerfMarker) library is also provided for
visualizing and analyzing non-OpenCL host code on the Timeline. Users can
instrument their code with calls to clBeginPerfMarkerAMD() and
clEndPerfMarkerAMD(). These calls are then used by the Profiler to
annotate the host-code timeline hierarchically. For more information, see the
CLPerfMarkerAMD.pdf in the CLPerfMarker/Doc subdirectory under the
Profiler installation directory, typically
$AMDAPPSDKROOT/Tools/AMDAPPProfiler-vx.x/.

4.1.1.1 Timeline View
The timeline view (the top half of Figure 4.1) provides a visual representation of
the execution of the application.

4-2

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Figure 4.1

Timeline and API Trace View in Microsoft Visual Studio 2010

Along the top of the timeline is the time grid, which shows the total elapsed time
of the application, in milliseconds. Timing begins when the first OpenCL call is
made by the application, and ends when the final OpenCL is made. Directly
below the time grid, each host (OS) thread that made at least one OpenCL call
is listed. For each host thread, the OpenCL API calls are plotted along the time
grid, showing the start time and duration of each call. Below the host threads, an
OpenCL tree shows all contexts and queues created by the application, along
with data transfer operations and kernel execution operations for each queue.
The Timeline View can be navigated by zooming, panning, collapsing/expanding,
and selecting an interest region. From the Timeline View, we can also navigate
to the corresponding API call in the API Trace View and vice versa.
The Timeline View can be useful for debugging your OpenCL application. Some
examples are:

•

Easily confirm that the high-level structure of the algorithm is correct (the
number of queues and contexts created match your expectation).

•

Confirm that synchronization has been performed properly in the application.
For example, if kernel A execution is dependent on a buffer write or copy
and/or outputs from kernel B execution, then, if the synchronization has been
set up correctly, kernel A execution appears after the completion of the buffer
execution and kernel B execution in the timeline grid. It can be hard to find
this type of synchronization error using traditional debugging techniques.

4.1 AMD APP Profiler
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-3

AMD ACCELERATED PARALLEL PROCESSING

•

Confirm that the kernel and data transfer execution from all the queues have
been performed efficiently. This is easily verified by ensuring that nondependent kernel and data transfer execution happens concurrently in the
timeline grid.

4.1.1.2 Summary Pages View
The Summary Pages View (Figure 4.2) shows the statistics of your OpenCL
application. It can provide a general idea of the location of the program's
bottlenecks. It also provides useful information such as the number of buffers and
number of images created on each context, most expensive kernel call, etc.

Figure 4.2

Context Summary Page View in Microsoft Visual Studio 2010

The Summary Pages View consists of the following pages.
1. API Summary shows the useful statistics for all OpenCL API calls made in
the application for API hot spots identification.
2. Context Summary shows the timing information for all the kernel dispatches
and data transfers for each context. This permits identifying whether the
application is bound by the kernel execution or data transfer. If the
application is bound by the data transfers, this page permits finding the most
expensive data transfer type (read, write, copy, or map) in the application.
3. Kernel Summary lists all the kernels that are created in the application. If the
application is bound by the kernel execution, it is possible to find the device
causing the bottleneck. If the kernel execution on the GPU device is the
bottleneck, use the GPU performance counters (see Section 4.1.2,
“Collecting OpenCL GPU Kernel Performance Counters,” page 4-5) to
investigate the bottleneck inside the kernel.
4. Top 10 Data Transfer Summary shows the top ten most expensive individual
data transfers.
5. Top 10 Kernel Summary shows the top ten most expensive individual kernel
executions.
6. Warning(s) and Error(s) shows potential problems in your OpenCL
application.

4-4

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

In order to minimize expensive data transfers, the algorithm/application may have
be modified. With the help from the timeline view, we can investigate whether the
data transfer execution has been most efficient (occurs concurrently with a kernel
execution).
The Warning(s) and Error(s) page (Figure 4.3) shows potential problems and
optimization hints in your OpenCL application, including unreleased OpenCL
resources, OpenCL API failures, non-optimized work size, non-optimized data
transfer, and excessive synchronization; it also provides suggestions to achieve
better performance. Clicking on a hyperlink takes you to the corresponding
OpenCL API that generated the message.

Figure 4.3

Warning(s) and Error(s) Page

4.1.1.3 API Trace View
The API Trace View (the bottom half in Figure 4.1) lists all the OpenCL API calls
made by the application. Each host thread that makes at least one OpenCL call
is listed in a separate tab. Each tab contains a list of all the API calls made by
that particular thread. For each call, it shows the index of the call (representing
execution order), the name of the API function, a semicolon-delimited list of
parameters passed to the function, and the value returned by the function.
Double-clicking an item in the API Trace view displays and zooms into that API
call in the Host Thread row in the Timeline View. If stack trace is enabled while
collecting the API trace, and the application contains debug information, it is
possible to navigate from the API trace to source code.
The lets you analyze and debug the input parameters and output results for each
API call. For example it is easy to check that all the API calls are returning
CL_SUCCESS, all the buffers are created with the correct flags, as well as to
identify redundant API calls. The API Trace shows additional information about
data transfers using clEnqueueMapBuffer/clEnqueueMapImage; this includes the
source, destination, and transfer type of the map operation. Some devices can
take advantage of zero copy to save on data transfer time.

4.1.2

Collecting OpenCL GPU Kernel Performance Counters
The GPU kernel performance counters can be used to find the possible
bottlenecks in the kernel execution. You can find the list of performance counters
supported by AMD Radeon™ GPUs in the Profiler documentation.
After determining the most expensive kernel to be optimized using the trace data,
collect the GPU performance counters to drill down to the kernel execution on
the GPU devices. Using the performance counters, it is possible to:

4.1 AMD APP Profiler
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-5

AMD ACCELERATED PARALLEL PROCESSING

•

Find the number of resources (VGPR, SGPR [if applicable], and Local
Memory size) allocated for the kernel. These resources affect the possible
number of in-flight wavefronts in the GPU (higher number is required to hide
the data latency). The Occupancy Modeler identifies the limiting factor for
achieving a higher count of in-flight wavefronts.

•

Identify the number of ALU, global, and local memory instructions executed
in the GPU.

•

Identify the number of bytes fetched from and written to the global memory.

•

View use of the SIMD engine and memory units in the system.

•

View the efficiency of the Shader Compiler to pack ALU instructions into the
VLIW instructions in AMD GPUs.

•

View the local memory (LDS) bank conflict.

The Session view (Figure 4.4) shows the resulting performance counters for a
Profiler session. The output data is recorded in a csv format.

Figure 4.4

4.1.3

Sample Session View in Microsoft Visual Studio 2010

OpenCL Kernel Occupancy Modeler
Figure 4.5 shows a sample screen shot of the OpenCL Kernel Occupancy
Modeler.
The top part of the page shows four graphs (only three on non-GCN devices)
that provide a visual indication of how kernel resources affect the theoretical
number of in-flight wavefronts on a compute unit. The graph representing the
limiting resource has its title displayed in red text. If there is more than one
limiting resource, more than one graph can have a red title. In each graph, the
actual usage of the particular resource being graphed is highlighted with an
orange square. Hovering the mouse over a point in the graph causes a popup
hint to be displayed; this shows the current X and Y values at that location.

4-6

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Figure 4.5

Sample Kernel Occupancy Modeler Screen

The first graph, titled Number of waves limited by Work-group size, shows
how the number of active wavefronts is affected by the size of the work-group
for the dispatched kernel. In the above screen shot, note that the highest number
of wavefronts is achieved when the work-group size is in the range of 64 to 128.
The second graph, titled Number of waves limited by VGPRs, shows how the
number of active wavefronts is affected by the number of vector GPRs used by
the dispatched kernel. In the above screen shot, note that as the number of
VGPRs used increases, the number active wavefronts decreases in steps. Note
that this graph shows that more than 62 GPRs can be allocated, although 62 is
the maximum number, since the Shader Compiler assumes the work-group size
is 256 items by default (the largest possible work-group size). For the Shader
Compiler to allocate more than 62 GPRs, the kernel source code must be
marked with the reqd_work_group_size kernel attribute. This attribute can tell
the Shader Compiler that the kernel is to be launched with a work-group size
smaller than the maximum, allowing it to allocate more GPRs. Thus, for X-axis
values greater than 62, the GPR graph shows the theoretical number of
wavefronts that can be launched if the kernel specified a smaller work-group size
using that attribute.
If running on and AMD Radeon™ HD 7XXX series (GCN) device, the third graph,
titled Number of waves limited by SGPR, shows how the number of active
wavefronts is affected by the number of scalar GPRs used by the dispatched
kernel. In the above screen shot, note that as the number of used SGPRs
increases, the number active wavefronts decreases in steps.

4.1 AMD APP Profiler
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-7

AMD ACCELERATED PARALLEL PROCESSING

The fourth graph, titled Number of waves limited by LDS, shows how the
number of active wavefronts is affected by the amount of LDS used by the
dispatched kernel. In the above screen shot, note that as the amount of LDS
usage increases, the number active wavefronts decreases in steps.
Below the graphs is a table that provides information about the device, the
kernel, and the kernel occupancy. In the Kernel Occupancy section, note the
limits imposed by each kernel resource, as well as which resource is currently
limiting the number of waves for the kernel dispatch. This also displays the kernel
occupancy percentage.

4.2 AMD APP KernelAnalyzer
The AMD APP KernelAnalyzer is a static analysis tool to compile, analyze, and
disassemble an OpenCL kernel for AMD Radeon GPUs. It is commonly used for
rapid prototyping of OpenCL kernels since it quickly compiles the kernel for
multiple GPU device targets.
It can be used as a GUI tool for interactive tunings of an OpenCL kernel or in
command-line mode to generate detailed reports.
The KernelAnalyzer can be installed as part of the AMD APP SDK installation,
or individually using its own installer package. The KernelAnalyzer package can
be downloaded from:
http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx.
To use the KernelAnalyzer, the AMD OpenCL run-time is required to be installed
in the system; however, no GPU is required in the system.
To compile an OpenCL kernel in the KernelAnalyzer, drop the OpenCL source
kernel to the source code panel in the GUI (Figure 4.6). The entire OpenCL
application is not required to compile or analyze the OpenCL kernel.
With the KernelAnalyzer, it is possible to:

4-8

•

Compile and disassemble the OpenCL kernel for multiple Catalyst driver
versions and GPU device targets.

•

View the OpenCL kernel compilation’s error messages from the OpenCL runtime.

•

View the AMD Intermediate Language (IL) code generated by the OpenCL
run-time.

•

View the ISA code generated by the AMD Shader Compiler. Typically, hard
core kernel optimizations are performed by analyzing the ISA code.

•

View the statistics generated by analyzing the ISA code.

•

View General-Purpose Register usage and spill registers allocated for the
kernel.

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Figure 4.6

AMD APP Kernel Analyzer

4.3 Analyzing Processor Kernels
4.3.1

Intermediate Language and GPU Disassembly
The AMD Accelerated Parallel Processing software exposes the Intermediate
Language (IL) and instruction set architecture (ISA) code generated for
OpenCL™ kernels through the compiler options -save-temps[=prefix].
The AMD Intermediate Language (IL) is an abstract representation for hardware
vertex, pixel, and geometry shaders, as well as compute kernels that can be
taken as input by other modules implementing the IL. An IL compiler uses an IL
shader or kernel in conjunction with driver state information to translate these
shaders into hardware instructions or a software emulation layer. For a complete
description of IL, see the AMD Intermediate Language (IL) Specification v2.

4.3 Analyzing Processor Kernels
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-9

AMD ACCELERATED PARALLEL PROCESSING

The instruction set architecture (ISA) defines the instructions and formats
accessible to programmers and compilers for the AMD GPUs. The Northern
Islands-family ISA instructions and microcode are documented in the AMD
Northern Islands-Family ISA Instructions and Microcode.

4.3.2

Generating IL and ISA Code
In Microsoft Visual Studio, the AMD APP Profiler provides an integrated tool to
view IL and ISA code. (The AMD APP KernelAnalyzer also can show the IL and
ISA code.) After running the Profiler, single-click the name of the kernel for
detailed programming and disassembly information. The associated ISA
disassembly is shown in a new tab. A drop-down menu provides the option to
view the IL, ISA, or source OpenCL for the selected kernel.
Developers also can generate IL and ISA code from their OpenCL kernel by
setting the environment variable AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps
(see Section 2.1.4, “AMD-Developed Supplemental Compiler Options,” page 24).

4.4 Estimating Performance
4.4.1

Measuring Execution Time
The OpenCL runtime provides a built-in mechanism for timing the execution of
kernels by setting the CL_QUEUE_PROFILING_ENABLE flag when the queue is
created. Once profiling is enabled, the OpenCL runtime automatically records
timestamp information for every kernel and memory operation submitted to the
queue.
OpenCL provides four timestamps:

•

CL_PROFILING_COMMAND_QUEUED - Indicates when the command is enqueued
into a command-queue on the host. This is set by the OpenCL runtime when
the user calls an clEnqueue* function.

•

CL_PROFILING_COMMAND_SUBMIT - Indicates when the command is submitted
to the device. For AMD GPU devices, this time is only approximately defined
and is not detailed in this section.

•

CL_PROFILING_COMMAND_START - Indicates when the command starts
execution on the requested device.

•

CL_PROFILING_COMMAND_END - Indicates when the command finishes
execution on the requested device.

The sample code below shows how to compute the kernel execution time (EndStart):
cl_event myEvent;
cl_ulong startTime, endTime;
clCreateCommandQueue (…, CL_QUEUE_PROFILING_ENABLE, NULL);
clEnqueueNDRangeKernel(…, &myEvent);
clFinish(myCommandQ); // wait for all events to finish
4-10

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &startTime, NULL);
clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &endTimeNs, NULL);
cl_ulong kernelExecTimeNs = endTime-startTime;

The AMD APP Profiler also can record the execution time for a kernel
automatically. The Kernel Time metric reported in the Profiler output uses the
built-in OpenCL timing capability and reports the same result as the
kernelExecTimeNs calculation shown above.
Another interesting metric to track is the kernel launch time (Start – Queue). The
kernel launch time includes both the time spent in the user application (after
enqueuing the command, but before it is submitted to the device), as well as the
time spent in the runtime to launch the kernel. For CPU devices, the kernel
launch time is fast (tens of μs), but for discrete GPU devices it can be several
hundred μs. Enabling profiling on a command queue adds approximately 10 μs
to 40 μs overhead to all clEnqueue calls. Much of the profiling overhead affects
the start time; thus, it is visible in the launch time. Be careful when interpreting
this metric. To reduce the launch overhead, the AMD OpenCL runtime combines
several command submissions into a batch. Commands submitted as batch
report similar start times and the same end time.

4.4.2

Using the OpenCL timer with Other System Timers
The resolution of the timer, given in ns, can be obtained from:
clGetDeviceInfo(…,CL_DEVICE_PROFILING_TIMER_RESOLUTION…);

AMD CPUs and GPUs report a timer resolution of 1 ns. AMD OpenCL devices
are required to correctly track time across changes in frequency and power
states. Also, the AMD OpenCL SDK uses the same time-domain for all devices
in the platform; thus, the profiling timestamps can be directly compared across
the CPU and GPU devices.
The sample code below can be used to read the current value of the OpenCL
timer clock. The clock is the same routine used by the AMD OpenCL runtime to
generate the profiling timestamps. This function is useful for correlating other
program events with the OpenCL profiling timestamps.
uint64_t
timeNanos()
{
#ifdef linux
struct timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
return (unsigned long long) tp.tv_sec * (1000ULL * 1000ULL * 1000ULL) +
(unsigned long long) tp.tv_nsec;
#else
LARGE_INTEGER current;
QueryPerformanceCounter(¤t);
return (unsigned long long)((double)current.QuadPart / m_ticksPerSec * 1e9);
#endif
}

4.4 Estimating Performance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-11

AMD ACCELERATED PARALLEL PROCESSING

Normal CPU time-of-day routines can provide a rough measure of the elapsed
time of a GPU kernel. GPU kernel execution is non-blocking, that is, calls to
enqueue*Kernel return to the CPU before the work on the GPU is finished. For
an accurate time value, ensure that the GPU is finished. In OpenCL, you can
force the CPU to wait for the GPU to become idle by inserting calls to
clFinish() before and after the sequence you want to time; this increases the
timing accuracy of the CPU routines. The routine clFinish() blocks the CPU
until all previously enqueued OpenCL commands have finished.
For more information, see section 5.9, “Profiling Operations on Memory Objects
and Kernels,” of the OpenCL 1.0 Specification.

4.4.3

Estimating Memory Bandwidth
The memory bandwidth required by a kernel is perhaps the most important
performance consideration. To calculate this:
Effective Bandwidth = (Br + Bw)/T
where:
Br = total number of bytes read from global memory.
Bw = total number of bytes written to global memory.
T = time required to run kernel, specified in nanoseconds.
If Br and Bw are specified in bytes, and T in ns, the resulting effective bandwidth
is measured in GB/s, which is appropriate for current CPUs and GPUs for which
the peak bandwidth range is 20-260 GB/s. Computing Br and Bw requires a
thorough understanding of the kernel algorithm; it also can be a highly effective
way to optimize performance. For illustration purposes, consider a simple matrix
addition: each element in the two source arrays is read once, added together,
then stored to a third array. The effective bandwidth for a 1024x1024 matrix
addition is calculated as:
Br = 2 x (1024 x 1024 x 4 bytes) = 8388608 bytes ;; 2 arrays, 1024x1024, each
element 4-byte float
Bw = 1 x (1024 x 1024 x 4 bytes) = 4194304 bytes ;; 1 array, 1024x1024, each
element 4-byte float.
If the elapsed time for this copy as reported by the profiling timers is 1000000 ns
(1 million ns, or .001 sec), the effective bandwidth is:
(Br+Bw)/T = (8388608+4194304)/1000000 = 12.6GB/s
The AMD APP Profiler can report the number of dynamic instructions per thread
that access global memory through the FetchInsts and WriteInsts counters. The
Fetch and Write reports average the per-thread counts; these can be fractions if
the threads diverge. The Profiler also reports the dimensions of the global
NDRange for the kernel in the GlobalWorkSize field. The total number of threads
can be determined by multiplying together the three components of the range. If
all (or most) global accesses are the same size, the counts from the Profiler and
the approximate size can be used to estimate Br and Bw:

4-12

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Br = Fetch * GlobalWorkitems * Size
Bw = Write * GlobalWorkitems * Element Size
where GlobalWorkitems is the dispatch size.
An example Profiler output and bandwidth calculation:
Method
runKernel_Cypress

GlobalWorkSize

Time

Fetch

Write

{192; 144; 1}

0.9522

70.8

0.5

WaveFrontSize = 192*144*1 = 27648 global work items.
In this example, assume we know that all accesses in the kernel are four bytes;
then, the bandwidth can be calculated as:
Br = 70.8 * 27648 * 4 = 7829914 bytes
Bw = 0.5 * 27648 * 4 =

55296 bytes

The bandwidth then can be calculated as:
(Br + Bw)/T = (7829914 bytes + 55296 bytes) / .9522 ms / 1000000
= 8.2 GB/s

4.5 OpenCL Memory Objects
This section explains the AMD OpenCL runtime policy for memory objects. It also
recommends best practices for best performance.
OpenCL uses memory objects to pass data to kernels. These can be either
buffers or images. Space for these is managed by the runtime, which uses
several types of memory, each with different performance characteristics. Each
type of memory is suitable for a different usage pattern. The following
subsections describe:

4.5.1

•

the memory types used by the runtime;

•

how to control which memory kind is used for a memory object;

•

how the runtime maps memory objects for host access;

•

how the runtime performs memory object reading, writing and copying;

•

how best to use command queues; and

•

some recommended usage patterns.

Types of Memory Used by the Runtime
Memory is used to store memory objects that are accessed by kernels executing
on the device, as well as to hold memory object data when they are mapped for
access by the host application. This section describes the different memory kinds
used by the runtime. Table 4.1 lists the performance of each memory type given
a PCIe3-capable platform and a high-end AMD Radeon™ 7XXX discrete GPU. In
4.5 OpenCL Memory Objects
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-13

AMD ACCELERATED PARALLEL PROCESSING

Table 4.1, when host memory is accessed by the GPU shader, it is of type
CL_MEM_ALLOC_HOST_PTR. When GPU memory is accessed by the CPU, it is of
type CL_MEM_PERSISTENT_MEM_AMD.
Table 4.1

Memory Bandwidth in GB/s (R = read, W = write) in GB/s

CPU R

GPU R

GPU Shader R

GPU Shader W

GPU DMA R GPU DMA W

Host Memory

10 - 20

9 - 10

2.5

11 - 12

GPU Memory

.01

9 - 10

230

120 -150

n/a

Host memory and device memory in the above table consists of one of the
subtypes given below.
4.5.1.1 Host Memory
This regular CPU memory can be access by the CPU at full memory bandwidth;
however, it is not directly accessible by the GPU. For the GPU to transfer host
memory to device memory (for example, as a parameter to
clEnqueueReadBuffer or clEnqueueWriteBuffer), it first must be pinned (see
section 4.5.1.2). Pinning takes time, so avoid incurring pinning costs where CPU
overhead must be avoided.
When host memory is copied to device memory, the OpenCL runtime uses the
following transfer methods.

•

<=32 kB: For transfers from the host to device, the data is copied by the CPU
to a runtime pinned host memory buffer, and the DMA engine transfers the
data to device memory. The opposite is done for transfers from the device to
the host.

•

>32 kB and <=16 MB: The host memory physical pages containing the data
are pinned, the GPU DMA engine is used, and the pages then are unpinned.

•

>16 MB: Runtime pinned host memory staging buffers are used. The CPU
copies the data in pieces, which then are transferred to the device using the
GPU DMA engine. Double buffering is used to overlap the CPU copies with
the DMA.

Due to the cost of copying to staging buffers, or pinning/unpinning host memory,
host memory does not offer the best transfer performance.
4.5.1.2 Pinned Host Memory
This is host memory that the operating system has bound to a fixed physical
address and that the operating system ensures is resident. The CPU can access
pinned host memory at full memory bandwidth. The runtime limits the total
amount of pinned host memory that can be used for memory objects. (See
Section 4.5.2, “Placement,” page 4-16, for information about pinning memory.

4-14

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

If the runtime knows the data is in pinned host memory, it can be transferred to,
and from, device memory without requiring staging buffers or having to perform
pinning/unpinning on each transfer. This offers improved transfer performance.
Currently, the runtime recognizes only data that is in pinned host memory for
operation arguments that are memory objects it has allocated in pinned host
memory. For example, the buffer argument of
clEnqueueReadBuffer/clEnqueueWriteBuffer and image argument of
clEnqueueReadImage/clEnqueueWriteImage. It does not detect that the ptr
arguments of these operations addresses pinned host memory, even if they are
the result of clEnqueueMapBuffer/clEnqueueMapImage on a memory object that
is in pinned host memory.
The runtime can make pinned host memory directly accessible from the GPU.
Like regular host memory, the CPU uses caching when accessing pinned host
memory. Thus, GPU accesses must use the CPU cache coherency protocol
when accessing. For discrete devices, the GPU access to this memory is through
the PCIe bus, which also limits bandwidth. For fusion devices that do not have
the PCIe overhead, GPU access is significantly slower than accessing devicevisible host memory (see section 4.5.1.3), which does not use the cache
coherency protocol.
4.5.1.3 Device-Visible Host Memory
The runtime allocates a limited amount of pinned host memory that is accessible
by the GPU without using the CPU cache coherency protocol. This allows the
GPU to access the memory at a higher bandwidth than regular pinned host
memory.
A portion of this memory is also configured to be accessible by the CPU as
uncached memory. Thus, reads by the CPU are significantly slower than those
from regular host memory. However, these pages are also configured to use the
memory system write combining buffers. The size, alignment, and number of
write combining buffers is chip-set dependent. Typically, there are 4 to 8 buffers
of 64 bytes, each aligned to start on a 64-byte memory address. These allow
writes to adjacent memory locations to be combined into a single memory
access. This allows CPU streaming writes to perform reasonably well. Scattered
writes that do not fill the write combining buffers before they have to be flushed
do not perform as well.
Fusion devices have no device memory and use device-visible host memory for
their global device memory.
4.5.1.4 Device Memory
Discrete GPU devices have their own dedicated memory, which provides the
highest bandwidth for GPU access. The CPU cannot directly access device
memory on a discrete GPU (except for the host-visible device memory portion
described in section 4.5.1.5).

4.5 OpenCL Memory Objects
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-15

AMD ACCELERATED PARALLEL PROCESSING

On an APU, the system memory is shared between the GPU and the CPU; it is
visible by either the CPU or the GPU at any given time. A significant benefit of
this is that buffers can be zero copied between the devices by using map/unmap
operations to logically move the buffer between the CPU and the GPU address
space. See Section 4.5.4, “Mapping,” page 4-18, for more information on zero
copy.
4.5.1.5 Host-Visible Device Memory
A limited portion of discrete GPU device memory is configured to be directly
accessible by the CPU. It can be accessed by the GPU at full bandwidth, but
CPU access is over the PCIe bus; thus, it is much slower that host memory
bandwidth. The memory is mapped into the CPU address space as uncached,
but using the memory system write combining buffers. This results in slow CPU
reads and scattered writes, but streaming CPU writes perform much better
because they reduce PCIe overhead.

4.5.2

Placement
Every OpenCL memory object has a location that is defined by the flags passed
to clCreateBuffer/clCreateImage. A memory object can be located either on
a device, or (as of SDK 2.4) it can be located on the host and accessed directly
by all the devices. The Location column of Table 4.2 gives the memory type used
for each of the allocation flag values for different kinds of devices. When a device
kernel is executed, it accesses the contents of memory objects from this location.
The performance of these accesses is determined by the memory kind used.
An OpenCL context can have multiple devices, and a memory object that is
located on a device has a location on each device. To avoid over-allocating
device memory for memory objects that are never used on that device, space is
not allocated until first used on a device-by-device basis. For this reason, the first
use of a memory object after it is created can be slower than subsequent uses.

4-16

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Table 4.2

OpenCL Memory Object Properties
clEnqueueMapBuffer/
clEnqueueMapImage/
clEnqueueUnmapMemObject

clCreateBuffer/
clCreateImage Flags Argument

Default
(none of the following flags)

CL_MEM_ALLOC_HOST_PTR
(clCreateBuffer on Windows 7
and Vista for Evergreen, Northern
Islands, and Southern Islands;
Southern Islands devices have
this functionality on Linux as well.)
CL_MEM_ALLOC_HOST_PTR
(clCreateImage on Windows 7,
Vista and Linux; clCreateBuffer
on Linux)

Device Type
Discrete GPU

Device memory

Fusion APU

Device-visible host
memory

CPU

Use Map Location
directly

Discrete GPU Pinned host memory
shared by all devices
Fusion APU
in context (unless
only device in
context
is CPU;
CPU
then, host memory)
Discrete GPU

Device memory

Fusion APU

Device-visible
memory

CPU
Discrete GPU

CL_MEM_USE_HOST_PTR

Map
Mode

Location

Device-visible host
memory

CPU

Use Map Location
directly

CL_MEM_USE_PERSISTENT_MEM_AMD Discrete GPU
on Windows 7 and Vista for
Evergreen, Northern Islands, and
Southern Islands; Southern Islands Fusion APU
devices have this functionality on
Linux as well.)
CPU
CL_MEM_USE_PERSISTENT_MEM_AM
D
(Linux for Evergreen and Northern
Islands)

Mapped data size:
• <=32MiB: Pinned
Copy
host memory
• >32MiB: Host
memory (different
memory area can be
Zero copy
used on each map)

Use Location directly
Zero copy (same memory area is
used on each map)

Pinned host memory,
unless only device in
Copy
context is CPU; then,
host memory (same
memory area is used
Zero copy on each map)

Device memory

Fusion APU

Copy

Zero copy

Host-visible device
memory
Device-visible host
memory

Map Location

Zero copy

Pinned host memory,
unless only device in
context is CPU; then,
host memory (same
memory area is used
on each map)

Use Location directly
(different memory area
can be used on each
map)

Host memory

Same as default.

4.5 OpenCL Memory Objects
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-17

AMD ACCELERATED PARALLEL PROCESSING

4.5.3

Memory Allocation

4.5.3.1 Using the CPU
Create memory objects with CL_MEM_ALLOC_HOST_PTR, and use map/unmap; do
not use read/write. The reason for this is that if the object is created with
CL_MEM_USE_HOST_PTR the CPU is running the kernel on the buffer provided by
the application (a hack that all vendors use). This results in zero copy between
the CPU and the application buffer; the kernel updates the application buffer, and
in this case a map/unmap is actually a no-op. Also, when allocating the buffer on
the host, ensure that it is created with the correct alignment. For example, a
buffer to be used as float4* must be 128-bit aligned.
4.5.3.2 Using Both CPU and GPU Devices, or using an APU Device
When creating memory objects, create them with
CL_MEM_USE_PERSISTENT_MEM_AMD. This enables the zero copy feature, as
explained in Section 4.5.3.1, “Using the CPU.”.
4.5.3.3 Buffers vs Images
Unlike GPUs, CPUs do not contain dedicated hardware (samplers) for accessing
images. Instead, image access is emulated in software. Thus, a developer may
prefer using buffers instead of images if no sampling operation is needed.
4.5.3.4 Choosing Execution Dimensions
Note the following guidelines.

4.5.4

•

Make the number of work-groups a multiple of the number of logical CPU
cores (device compute units) for maximum use.

•

When work-groups number exceed the number of CPU cores, the CPU cores
execute the work-groups sequentially.

Mapping
The host application can use clEnqueueMapBuffer/clEnqueueMapImage to
obtain a pointer that can be used to access the memory object data. When
finished accessing, clEnqueueUnmapMemObject must be used to make the data
available to device kernel access. When a memory object is located on a device,
the data either can be transferred to, and from, the host, or (as of SDK 2.4) be
accessed directly from the host. Memory objects that are located on the host, or
located on the device but accessed directly by the host, are termed zero copy
memory objects. The data is never transferred, but is accessed directly by both
the host and device. Memory objects that are located on the device and
transferred to, and from, the device when mapped and unmapped are termed
copy memory objects. The Map Mode column of Table 4.2 specifies the transfer
mode used for each kind of memory object, and the Map Location column
indicates the kind of memory referenced by the pointer returned by the map
operations.

4-18

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

4.5.4.1 Zero Copy Memory Objects
CL_MEM_USE_PERSISTENT_MEM_AMD, CL_MEM_USE_HOST_PTR, and
CL_MEM_ALLOC_HOST_PTR support zero copy memory objects. The first provides
device-resident zero copy memory objects; the other two provide host-resident
zero copy memory objects.
Zero copy memory objects can be used by an application to optimize data
movement. When clEnqueueMapBuffer / clEnqueueMapImage /
clEnqueueUnmapMemObject are used, no runtime transfers are performed, and
the operations are very fast; however, the runtime can return a different pointer
value each time a zero copy memory object is mapped. Note that only images
created with CL_MEM_USE_PERSISTENT_MEM_AMD can be zero copy.
Southern Island devices support zero copy memory objects under Linux;
however, only images created with CL_MEM_USE_PERSISTENT_MEM_AMD can be
zero copy.
Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large host
memory buffer is shared between multiple devices and the copies are too
expensive. When choosing this, the cost of the transfer must be greater than the
extra cost of the slower accesses.
Streaming writes by the host to zero copy device resident memory objects are
about as fast as the transfer rates, so this can be a good choice when the host
does not read the memory object to avoid the host having to make a copy of the
data to transfer. Memory objects requiring partial updates between kernel
executions can also benefit. If the contents of the memory object must be read
by the host, use clEnqueueCopyBuffer to transfer the data to a separate
CL_MEM_ALLOC_HOST_PTR buffer.
4.5.4.2 Copy Memory Objects
For memory objects with copy map mode, the memory object location is on the
device, and it is transferred to, and from, the host when clEnqueueMapBuffer /
clEnqueueMapImage / clEnqueueUnmapMemObject are called. Table 4.3 shows
how the map_flags argument affects transfers. The runtime transfers only the
portion of the memory object requested in the offset and cb arguments. When
accessing only a portion of a memory object, only map that portion for improved
performance.

4.5 OpenCL Memory Objects
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-19

AMD ACCELERATED PARALLEL PROCESSING

Table 4.3

clEnqueueMapBuffer /
clEnqueueMapImage
map_flags argument

Transfer policy on clEnqueueMapBuffer / clEnqueueMapImage /
clEnqueueUnmapMemObject for Copy Memory Objects

Transfer on clEnqueueMapBuffer /
clEnqueueMapImage

Transfer on
clEnqueueUnmapMemObject

CL_MAP_READ

Device to host, if map location is not current.

None.

CL_MAP_WRITE

Device to host, if map location is not current.

Host to device.

CL_MAP_READ
CL_MAP_WRITE

Device to host if map location is not current.

Host to device.

None.

Host to device.

CL_MAP_WRITE_INVALI
DATE_REGION

For default memory objects, the pointer returned by clEnqueueMapBuffer /
clEnqueueMapImage may not be to the same memory area each time because
different runtime buffers may be used.
For CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR the same map location
is used for all maps; thus, the pointer returned is always in the same memory
area. For other copy memory objects, the pointer returned may not always be to
the same memory region.
For CL_MEM_USE_HOST_PTR and the CL_MEM_ALLOC_HOST_PTR cases that use
copy map mode, the runtime tracks if the map location contains an up-to-date
copy of the memory object contents and avoids doing a transfer from the device
when mapping as CL_MAP_READ. This determination is based on whether an
operation such as clEnqueueWriteBuffer/clEnqueueCopyBuffer or a kernel
execution has modified the memory object. If a memory object is created with
CL_MEM_READ_ONLY, then a kernel execution with the memory object as an
argument is not considered as modifying the memory object. Default memory
objects cannot be tracked because the map location changes between map calls;
thus, they are always transferred on the map.
For CL_MEM_USE_HOST_PTR, clCreateBuffer/clCreateImage pins the host
memory passed to the host_ptr argument. It is unpinned when the memory
object is deleted. To minimize pinning costs, align the memory to 4KiB. This
avoids the runtime having to pin/unpin on every map/unmap transfer, but does
add to the total amount of pinned memory.
For CL_MEM_USE_HOST_PTR, the host memory passed as the ptr argument of
clCreateBuffer/clCreateImage is used as the map location. As mentioned in
section 4.5.1.1, host memory transfers incur considerable cost in
pinning/unpinning on every transfer. If used, minimize the pinning cost by
ensuring the memory is 4 kB aligned. If host memory that is updated once is
required, use CL_MEM_ALLOC_HOST_PTR with the CL_MEM_COPY_HOST_PTR flag
instead. If device memory is needed, use CL_MEM_USE_PERSISTENT_MEM_AMD and
clEnqueueWriteBuffer.

4-20

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

If CL_MEM_COPY_HOST_PTR is specified with CL_MEM_ALLOC_HOST_PTR when
creating a memory object, the memory is allocated in pinned host memory and
initialized with the passed data. For other kinds of memory objects, the deferred
allocation means the memory is not yet allocated on a device, so the runtime has
to copy the data into a temporary runtime buffer. The memory is allocated on the
device when the device first accesses the resource. At that time, any data that
must be transferred to the resource is copied. For example, this would apply
when a buffer was allocated with the flag CL_MEM_COPY_HOST_PTR. Using
CL_MEM_COPY_HOST_PTR for these buffers is not recommended because of the
extra copy. Instead, create the buffer without CL_MEM_COPY_HOST_PTR, and
initialize with clEnqueueWriteBuffer/clEnqueueWriteImage.
When images are transferred, additional costs are involved because the image
must be converted to, and from, linear address mode for host access. The
runtime does this by executing kernels on the device.

4.5.5

Reading, Writing, and Copying
There are numerous OpenCL commands to read, write, and copy buffers and
images. The runtime performs transfers depending on the memory kind of the
source and destination. When transferring between host memory and device
memory the methods described in section Section 4.5.1.1, “Host Memory,”
page 4-14, are used. Memcpy is used to transferring between the various kinds of
host memory, this may be slow if reading from device visible host memory, as
described in section Section 4.5.1.3, “Device-Visible Host Memory,” page 4-15.
Finally, device kernels are used to copy between device memory. For images,
device kernels are used to convert to and from the linear address mode when
necessary.

4.5.6

Command Queue
It is best to use non-blocking commands to allow multiple commands to be
queued before the command queue is flushed to the GPU. This sends larger
batches of commands, which amortizes the cost of preparing and submitting
work to the GPU. Use event tracking to specify the dependence between
operations. It is recommended to queue operations that do not depend of the
results of previous copy and map operations. This can help keep the GPU busy
with kernel execution and DMA transfers. Note that if a non-blocking copy or map
is queued, it does not start until the command queue is flushed. Use clFlush if
necessary, but avoid unnecessary flushes because they cause small command
batching.

4.6 OpenCL Data Transfer Optimization
The AMD OpenCL implementation offers several optimized paths for data
transfer to, and from, the device. The following chapters describe buffer and
image paths, as well as how they map to common application scenarios. To find
out where the application’s buffers are stored (and understand how the data

4.6 OpenCL Data Transfer Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-21

AMD ACCELERATED PARALLEL PROCESSING

transfer behaves), use the APP Profiler’s API Trace View, and look at the tool
tips of the clEnqueueMapBuffer calls.

4.6.1

4.6.2

Definitions
•

Deferred allocation — The CL runtime attempts to minimize resource
consumption by delaying buffer allocation until first use. As a side effect, the
first accesses to a buffer may be more expensive than subsequent accesses.

•

Peak interconnect bandwidth — As used in the text below, this is the transfer
bandwidth between host and device that is available under optimal conditions
at the application level. It is dependent on the type of interconnect, the
chipset, and the graphics chip. As an example, a high-performance PC with
a PCIe 3.0 16x bus and a GCN architecture (AMD Radeon™ HD 7XXX
series) graphics card has a nominal interconnect bandwidth of 16 GB/s.

•

Pinning — When a range of host memory is prepared for transfer to the
GPU, its pages are locked into system memory. This operation is called
pinning; it can impose a high cost, proportional to the size of the memory
range. One of the goals of optimizing data transfer is to use pre-pinned
buffers whenever possible. However, if pre-pinned buffers are used
excessively, it can reduce the available system memory and result in
excessive swapping. Host side zero copy buffers provide easy access to prepinned memory.

•

WC — Write Combine is a feature of the CPU write path to a select region
of the address space. Multiple adjacent writes are combined into cache lines
(for example, 64 bytes) before being sent to the external bus. This path
typically provides fast streamed writes, but slower scattered writes.
Depending on the chip set, scattered writes across a graphics interconnect
can be very slow. Also, some platforms require multi-core CPU writes to
saturate the WC path over an interconnect.

•

Uncached accesses — Host memory and I/O regions can be configured as
uncached. CPU read accesses are typically very slow; for example:
uncached CPU reads of graphics memory over an interconnect.

•

USWC — Host memory from the Uncached Speculative Write Combine heap
can be accessed by the GPU without causing CPU cache coherency traffic.
Due to the uncached WC access path, CPU streamed writes are fast, while
CPU reads are very slow. On Fusion devices, this memory provides the
fastest possible route for CPU writes followed by GPU reads.

Buffers
OpenCL buffers currently offer the widest variety of specialized buffer types and
optimized paths, as well as slightly higher transfer performance.

4.6.2.1 Regular Device Buffers
Buffers allocated using the flags CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, or
CL_MEM_READ_WRITE are placed on the GPU device. These buffers can be
accessed by a GPU kernel at very high bandwidths. For example, on a high-end
4-22

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

graphics card, the OpenCL kernel read/write performance is significantly higher
than 100 GB/s. When device buffers are accessed by the host through any of
the OpenCL read/write/copy and map/unmap API calls, the result is an explicit
transfer across the hardware interconnect.
4.6.2.2 Zero Copy Buffers
AMD APP SDK 2.4 on Windows 7 and Vista introduces a new feature called zero
copy buffers.
If a buffer is of the zero copy type, the runtime tries to leave its content in place,
unless the application explicitly triggers a transfer (for example, through
clEnqueueCopyBuffer()). Depending on its type, a zero copy buffer resides on
the host or the device. Independent of its location, it can be accessed directly by
the host CPU or a GPU device kernel, at a bandwidth determined by the
capabilities of the hardware interconnect.
Calling clEnqueueMapBuffer() and clEnqueueUnmapMemObject() on a zero
copy buffer is typically a low-cost operation.
Since not all possible read and write paths perform equally, check the application
scenarios below for recommended usage. To assess performance on a given
platform, use the BufferBandwidth sample.
If a given platform supports the zero copy feature, the following buffer types are
available:

•

The CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR buffers are:
–

zero copy buffers that resides on the host.

–

directly accessible by the host at host memory bandwidth.

–

directly accessible by the device across the interconnect.

–

a pre-pinned sources or destinations for CL read, write, and copy
commands into device memory at peak interconnect bandwidth.

Note that buffers created with the flag CL_MEM_ALLOC_HOST_PTR together with
CL_MEM_READ_ONLY may reside in uncached write-combined memory. As a
result, CPU can have high streamed write bandwidth, but low read and
potentially low write scatter bandwidth, due to the uncached WC path.

•

The CL_MEM_USE_PERSISTENT_MEM_AMD buffer is
–

a zero copy buffer that resides on the GPU device.

–

directly accessible by the GPU device at GPU memory bandwidth.

–

directly accessible by the host across the interconnect (typically with high
streamed write bandwidth, but low read and potentially low write scatter
bandwidth, due to the uncached WC path).

–

copyable to, and from, the device at peak interconnect bandwidth using
CL read, write, and copy commands.

There is a limit on the maximum size per buffer, as well as on the total size
of all buffers. This is platform-dependent, limited in size for each buffer, and
4.6 OpenCL Data Transfer Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-23

AMD ACCELERATED PARALLEL PROCESSING

also for the total size of all buffers of that type (a good working assumption
is 64 MB for the per-buffer limit, and 128 MB for the total).
Zero copy buffers work well on Fusion APU devices. SDK 2.5 introduced an
optimization that is of particular benefit on Fusion APUs. The runtime uses
USWC memory for buffers allocated as CL_MEM_ALLOC_HOST_PTR |
CL_MEM_READ_ONLY. On Fusion systems, this type of zero copy buffer can be
written to by the CPU at very high data rates, then handed over to the GPU at
minimal cost for equally high GPU read-data rates over the Radeon memory bus.
This path provides the highest data transfer rate for the CPU-to-GPU path. The
use of multiple CPU cores may be necessary to achieve peak write performance.
1. buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)
2. address = clMapBuffer( buffer )
3. memset( address ) or memcpy( address ) (if possible, using multiple CPU
cores)
4. clEnqueueUnmapMemObject( buffer )
5. clEnqueueNDRangeKernel( buffer )
As this memory is not cacheable, CPU read operations are very slow. This type
of buffer also exists on discrete platforms, but transfer performance typically is
limited by PCIe bandwidth.
Zero copy buffers can provide low latency for small transfers, depending on the
transfer path. For small buffers, the combined latency of map/CPU memory
access/unmap can be smaller than the corresponding DMA latency.
4.6.2.3 Pre-pinned Buffers
AMD APP SDK 2.5 introduces a new feature called pre-pinned buffers. This
feature is supported on Windows 7, Windows Vista, and Linux.
Buffers of type CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR are pinned at
creation time. These buffers can be used directly as a source or destination for
clEnqueueCopyBuffer to achieve peak interconnect bandwidth. Mapped buffers
also can be used as a source or destination for clEnqueueRead/WriteBuffer
calls, again achieving peak interconnect bandwidth. Note that using
CL_MEM_USE_HOST_PTR permits turning an existing user memory region into prepinned memory. However, in order to stay on the fast path, that memory must be
aligned to 256 bytes. Buffers of type CL_MEM_USE_HOST_PTR remain pre-pinned
as long as they are used only for data transfer, but not as kernel arguments. If
the buffer is used in a kernel, the runtime creates a cached copy on the device,
and subsequent copies are not on the fast path. The same restriction applies to
CL_MEM_ALLOC_HOST_PTR allocations under Linux.
See usage examples described for various options below.
The pre-pinned path is supported for the following calls.

•
4-24

clEnqueueRead/WriteBuffer

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

•

clEnqueueRead/WriteImage

•

clEnqueueRead/WriteBufferRect (Windows only)

Offsets into mapped buffer addresses are supported, too.
Note that the CL image calls must use pre-pinned mapped buffers on the host
side, and not pre-pinned images.
4.6.2.4 Application Scenarios and Recommended OpenCL Paths
The following section describes various application scenarios, and the
corresponding paths in the OpenCL API that are known to work well on AMD
platforms. The various cases are listed, ordered from generic to more
specialized.
From an application point of view, two fundamental use cases exist, and they can
be linked to the various options, described below.

•

An application wants to transfer a buffer that was already allocated through
malloc() or mmap(). In this case, options 2), 3) and 4) below always consist
of a memcpy() plus a device transfer. Option 1) does not require a memcpy().

•

If an application is able to let OpenCL allocate the buffer, options 2) and 4)
below can be used to avoid the extra memcpy(). In the case of option 5),
memcpy() and transfer are identical.

Note that the OpenCL runtime uses deferred allocation to maximize memory
resources. This means that a complete roundtrip chain, including data transfer
and kernel compute, might take one or two iterations to reach peak performance.
A code sample named BufferBandwidth can be used to investigate and
benchmark the various transfer options in combination with different buffer types.
Option 1 - clEnqueueWriteBuffer() and clEnqueueReadBuffer()
This option is the easiest to use on the application side.
CL_MEM_USE_HOST_PTR is an ideal choice if the application wants to transfer
a buffer that has already been allocated through malloc() or mmap().
There are two ways to use this option. The first uses
clEnqueueRead/WriteBuffer on a pre-pinned, mapped host-side buffer:
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or
CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c.

void *pinnedMemory = clEnqueueMapBuffer( pinnedBuffer )

d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )
e. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory )
The pinning cost is incurred at step a. Step d does not incur any pinning cost.
Typically, an application performs steps a, b, c, and e once. It then
repeatedly reads or modifies the data in pinnedMemory, followed by step d.

4.6 OpenCL Data Transfer Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-25

AMD ACCELERATED PARALLEL PROCESSING

For the second way to use this option, clEnqueueRead/WriteBuffer is used
directly on a user memory buffer. The standard clEnqueueRead/Write calls
require to pin (lock in memory) memory pages before they can be copied (by
the DMA engine). This creates a performance penalty that is proportional to
the buffer size. The performance of this path is currently about two-thirds of
peak interconnect bandwidth.
Option 2 - clEnqueueCopyBuffer() on a pre-pinned host buffer (requires
pre-pinned buffer support)
This is analogous to Option 1. Performing a CL copy of a pre-pinned buffer
to a device buffer (or vice versa) runs at peak interconnect bandwidth.
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or
CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
This is followed either by:
c.

void *memory = clEnqueueMapBuffer( pinnedBuffer )

d. Application writes or modifies memory.
e. clEnqueueUnmapMemObject( pinnedBuffer, memory )
f.

clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer )

or by:
g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer )
h. void *memory = clEnqueueMapBuffer( pinnedBuffer )
i.

Application reads memory.

clEnqueueUnmapMemObject( pinnedBuffer, memory )

Since the pinnedBuffer resides in host memory, the clMap() and clUnmap()
calls do not result in data transfers, and they are of very low latency. Sparse
or dense memory operations by the application take place at host memory
bandwidth.
Option 3 - clEnqueueMapBuffer() and clEnqueueUnmapMemObject() of a
Device Buffer
This is a good choice if the application fills in the data on the fly, or requires
a pointer for calls to other library functions (such as fread() or fwrite()).
An optimized path exists for regular device buffers; this path provides peak
interconnect bandwidth at map/unmap time.
For buffers already allocated through malloc() or mmap(), the total transfer
cost includes a memcpy() into the mapped device buffer, in addition to the
interconnect transfer. Typically, this is slower than option 1), above.
The transfer sequence is as follows:
a. Data transfer from host to device buffer.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )

4-26

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Since the buffer is mapped write-only, no data is transferred from
device buffer to host. The map operation is very low cost. A pointer
to a pinned host buffer is returned.
2. The application fills in the host buffer through memset( ptr ),
memcpy ( ptr, srcptr ), fread( ptr ), or direct CPU writes.
This happens at host memory bandwidth.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
The pre-pinned buffer is transferred to the GPU device, at peak
interconnect bandwidth.
b. Data transfer from device buffer to host.
1. ptr = clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. )
This command triggers a transfer from the device to host memory,
into a pre-pinned temporary buffer, at peak interconnect bandwidth.
A pointer to the pinned memory is returned.
2. The application reads and processes the data, or executes a
memcpy( dstptr, ptr ), fwrite (ptr), or similar function. Since
the buffer resides in host memory, this happens at host memory
bandwidth.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
Since the buffer was mapped as read-only, no transfer takes place,
and the unmap operation is very low cost.
Option 4 - Direct host access to a zero copy device buffer (requires zero
copy support)
This option allows overlapping of data transfers and GPU compute. It is also
useful for sparse write updates under certain constraints.
a. A zero copy buffer on the device is created using the following command:
buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )

This buffer can be directly accessed by the host CPU, using the
uncached WC path. This can take place at the same time the GPU
executes a compute kernel. A common double buffering scheme has the
kernel process data from one buffer while the CPU fills a second buffer.
See the TransferOverlap code sample.
A zero copy device buffer can also be used to for sparse updates, such
as assembling sub-rows of a larger matrix into a smaller, contiguous
block for GPU processing. Due to the WC path, it is a good design
choice to try to align writes to the cache line size, and to pick the write
block size as large as possible.

4.6 OpenCL Data Transfer Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-27

AMD ACCELERATED PARALLEL PROCESSING

b. Transfer from the host to the device.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
This operation is low cost because the zero copy device buffer is
directly mapped into the host address space.
2. The application transfers data via memset( ptr ), memcpy( ptr,
srcptr ), or direct CPU writes.
The CPU writes directly across the interconnect into the zero copy
device buffer. Depending on the chipset, the bandwidth can be of
the same order of magnitude as the interconnect bandwidth,
although it typically is lower than peak.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is low cost because the
buffer continues to reside on the device.
c.

If the buffer content must be read back later, use
clEnqueueReadBuffer( .., buf, ..) or
clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ).

This bypasses slow host reads through the uncached path.
Option 5 - Direct GPU access to a zero copy host buffer (requires zero
copy support)
This option allows direct reads or writes of host memory by the GPU. A GPU
kernel can import data from the host without explicit transfer, and write data
directly back to host memory. An ideal use is to perform small I/Os straight
from the kernel, or to integrate the transfer latency directly into the kernel
execution time.
a. The application creates a zero copy host buffer.
buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )
b. Next, the application modifies or reads the zero copy host buffer.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ |
CL_MAP_WRITE, .. )

This operation is very low cost because it is a map of a buffer
already residing in host memory.
2. The application modifies the data through memset( ptr ),
memcpy(in either direction), sparse or dense CPU reads or writes.
Since the application is modifying a host buffer, these operations
take place at host memory bandwidth.

4-28

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is very low cost because
the buffer continues to reside in host memory.
c.

The application runs clEnqueueNDRangeKernel(), using buffers of this
type as input or output. GPU kernel reads and writes go across the
interconnect to host memory, and the data transfer becomes part of the
kernel execution.
The achievable bandwidth depends on the platform and chipset, but can
be of the same order of magnitude as the peak interconnect bandwidth.
For discrete graphics cards, it is important to note that resulting GPU
kernel bandwidth is an order of magnitude lower compared to a kernel
accessing a regular device buffer located on the device.

d. Following kernel execution, the application can access data in the host
buffer in the same manner as described above.

4.7 Using Multiple OpenCL Devices
The AMD OpenCL runtime supports both CPU and GPU devices. This section
introduces techniques for appropriately partitioning the workload and balancing it
across the devices in the system.

4.7.1

CPU and GPU Devices
Table 4.4 lists some key performance characteristics of two exemplary CPU and
GPU devices: a quad-core AMD Phenom II X4 processor running at 2.8 GHz,
and a mid-range AMD Radeon™ HD 7770 GPU running at 1 GHz. The “best”
device in each characteristic is highlighted, and the ratio of the best/other device
is shown in the final column.
The GPU excels at high-throughput: the peak execution rate (measured in
FLOPS) is 7X higher than the CPU, and the memory bandwidth is 2.5X higher
than the CPU. The GPU also consumes approximately 65% the power of the
CPU; thus, for this comparison, the power efficiency in flops/watt is 10X higher.
While power efficiency can vary significantly with different devices, GPUs
generally provide greater power efficiency (flops/watt) than CPUs because they
optimize for throughput and eliminate hardware designed to hide latency.

4.7 Using Multiple OpenCL Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-29

AMD ACCELERATED PARALLEL PROCESSING

Table 4.4

CPU and GPU Performance Characteristics
CPU

Example Device

AMD

Phenom™

GPU
II X4

Winner Ratio

AMD Radeon™ HD 7770

Core Frequency

2800 MHz

1 GHz

Compute Units

2.5 X

Approx. Power1

95 W

80 W

1.2 X

Approx. Power/Compute Unit
Peak Single-Precision
Billion Floating-Point Ops/Sec
Approx GFLOPS/Watt

19 W

2.4 X

1280

14 X

0.9

18 X

Max In-flight HW Threads

25600

6400 X

Simultaneous Executing Threads

640

160 X

26 GB/s

72 GB/s

2.8 X

Int Add latency

0.4 ns

4 ns

10 X

FP Add Latency

1.4 ns

4 ns

2.9 X

Approx DRAM Latency

50 ns

270 ns

5.4 X

8192 KB

128 kB

64 X

25 μs

50 μs

Memory Bandwidth

L2+L3 (GPU only L2) cache capacity
Approx Kernel Launch Latency

1. For the power specifications of the AMD Phenom™ II x4, see http://www.amd.com/us/products/desktop/processors/phenom-ii/Pages/phenom-ii-model-number-comparison.aspx.

Conversely, CPUs excel at latency-sensitive tasks. For example, an integer add
is 10X faster on the CPU than on the GPU. This is a product of both the CPUs
higher clock rate (2800 MHz vs 1000 MHz for this comparison), as well as the
operation latency; the CPU is optimized to perform an integer add in just one
cycle, while the GPU requires four cycles. The CPU also has a latency-optimized
path to DRAM, while the GPU optimizes for bandwidth and relies on many inflight threads to hide the latency. The AMD Radeon™ HD 7770 GPU, for example,
supports more than 25,000 in-flight work-items and can switch to a new
wavefront (containing up to 64 work-items) in a single cycle. The CPU supports
only four hardware threads, and thread-switching requires saving and restoring
the CPU registers from memory. The GPU requires many active threads to both
keep the execution resources busy, as well as provide enough threads to hide
the long latency of cache misses.
Each GPU thread has its own register state, which enables the fast single-cycle
switching between threads. Also, GPUs can be very efficient at gather/scatter
operations: each thread can load from any arbitrary address, and the registers
are completely decoupled from the other threads. This is substantially more
flexible and higher-performing than a classic Vector ALU-style architecture (such
as SSE on the CPU), which typically requires that data be accessed from
contiguous and aligned memory locations. SSE supports instructions that write
parts of a register (for example, MOVLPS and MOVHPS, which write the upper and
lower halves, respectively, of an SSE register), but these instructions generate
additional microarchitecture dependencies and frequently require additional pack
instructions to format the data correctly.
4-30

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

In contrast, each GPU thread shares the same program counter with 63 other
threads in a wavefront. Divergent control-flow on a GPU can be quite expensive
and can lead to significant under-utilization of the GPU device. When control flow
substantially narrows the number of valid work-items in a wave-front, it can be
faster to use the CPU device.
CPUs also tend to provide significantly more on-chip cache than GPUs. In this
example, the CPU device contains 512 kB L2 cache/core plus a 6 MB L3 cache
that is shared among all cores, for a total of 8 MB of cache. In contrast, the GPU
device contains only 128 kB cache shared by the five compute units. The larger
CPU cache serves both to reduce the average memory latency and to reduce
memory bandwidth in cases where data can be re-used from the caches.
Finally, note the approximate 2X difference in kernel launch latency. The GPU
launch time includes both the latency through the software stack, as well as the
time to transfer the compiled kernel and associated arguments across the PCIexpress bus to the discrete GPU. Notably, the launch time does not include the
time to compile the kernel. The CPU can be the device-of-choice for small, quickrunning problems when the overhead to launch the work on the GPU outweighs
the potential speedup. Often, the work size is data-dependent, and the choice of
device can be data-dependent as well. For example, an image-processing
algorithm may run faster on the GPU if the images are large, but faster on the
CPU when the images are small.
The differences in performance characteristics present interesting optimization
opportunities. Workloads that are large and data parallel can run orders of
magnitude faster on the GPU, and at higher power efficiency. Serial or small
parallel workloads (too small to efficiently use the GPU resources) often run
significantly faster on the CPU devices. In some cases, the same algorithm can
exhibit both types of workload. A simple example is a reduction operation such
as a sum of all the elements in a large array. The beginning phases of the
operation can be performed in parallel and run much faster on the GPU. The end
of the operation requires summing together the partial sums that were computed
in parallel; eventually, the width becomes small enough so that the overhead to
parallelize outweighs the computation cost, and it makes sense to perform a
serial add. For these serial operations, the CPU can be significantly faster than
the GPU.

4.7.2

When to Use Multiple Devices
One of the features of GPU computing is that some algorithms can run
substantially faster and at better energy efficiency compared to a CPU device.
Also, once an algorithm has been coded in the data-parallel task style for
OpenCL, the same code typically can scale to run on GPUs with increasing
compute capability (that is more compute units) or even multiple GPUs (with a
little more work).
For some algorithms, the advantages of the GPU (high computation throughput,
latency hiding) are offset by the advantages of the CPU (low latency, caches, fast
launch time), so that the performance on either devices is similar. This case is
4.7 Using Multiple OpenCL Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-31

AMD ACCELERATED PARALLEL PROCESSING

more common for mid-range GPUs and when running more mainstream
algorithms. If the CPU and the GPU deliver similar performance, the user can
get the benefit of either improved power efficiency (by running on the GPU) or
higher peak performance (use both devices).
Usually, when the data size is small, it is faster to use the CPU because the startup time is quicker than on the GPU due to a smaller driver overhead and
avoiding the need to copy buffers from the host to the device.

4.7.3

Partitioning Work for Multiple Devices
By design, each OpenCL command queue can only schedule work on a single
OpenCL device. Thus, using multiple devices requires the developer to create a
separate queue for each device, then partition the work between the available
command queues.
A simple scheme for partitioning work between devices would be to statically
determine the relative performance of each device, partition the work so that
faster devices received more work, launch all the kernels, and then wait for them
to complete. In practice, however, this rarely yields optimal performance. The
relative performance of devices can be difficult to determine, in particular for
kernels whose performance depends on the data input. Further, the device
performance can be affected by dynamic frequency scaling, OS thread
scheduling decisions, or contention for shared resources, such as shared caches
and DRAM bandwidth. Simple static partitioning algorithms which “guess wrong”
at the beginning can result in significantly lower performance, since some
devices finish and become idle while the whole system waits for the single,
unexpectedly slow device.
For these reasons, a dynamic scheduling algorithm is recommended. In this
approach, the workload is partitioned into smaller parts that are periodically
scheduled onto the hardware. As each device completes a part of the workload,
it requests a new part to execute from the pool of remaining work. Faster
devices, or devices which work on easier parts of the workload, request new
input faster, resulting in a natural workload balancing across the system. The
approach creates some additional scheduling and kernel submission overhead,
but dynamic scheduling generally helps avoid the performance cliff from a single
bad initial scheduling decision, as well as higher performance in real-world
system environments (since it can adapt to system conditions as the algorithm
runs).
Multi-core runtimes, such as Cilk, have already introduced dynamic scheduling
algorithms for multi-core CPUs, and it is natural to consider extending these
scheduling algorithms to GPUs as well as CPUs. A GPU introduces several new
aspects to the scheduling process:

•

Heterogeneous Compute Devices
Most existing multi-core schedulers target only homogenous computing
devices. When scheduling across both CPU and GPU devices, the scheduler
must be aware that the devices can have very different performance

4-32

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

characteristics (10X or more) for some algorithms. To some extent, dynamic
scheduling is already designed to deal with heterogeneous workloads (based
on data input the same algorithm can have very different performance, even
when run on the same device), but a system with heterogeneous devices
makes these cases more common and more extreme. Here are some
suggestions for these situations.

•

–

The scheduler should support sending different workload sizes to
different devices. GPUs typically prefer larger grain sizes, and higherperforming GPUs prefer still larger grain sizes.

–

The scheduler should be conservative about allocating work until after it
has examined how the work is being executed. In particular, it is
important to avoid the performance cliff that occurs when a slow device
is assigned an important long-running task. One technique is to use small
grain allocations at the beginning of the algorithm, then switch to larger
grain allocations when the device characteristics are well-known.

–

As a special case of the above rule, when the devices are substantially
different in performance (perhaps 10X), load-balancing has only a small
potential performance upside, and the overhead of scheduling the load
probably eliminates the advantage. In the case where one device is far
faster than everything else in the system, use only the fast device.

–

The scheduler must balance small-grain-size (which increase the
adaptiveness of the schedule and can efficiently use heterogeneous
devices) with larger grain sizes (which reduce scheduling overhead).
Note that the grain size must be large enough to efficiently use the GPU.

Asynchronous Launch
OpenCL devices are designed to be scheduled asynchronously from a
command-queue. The host application can enqueue multiple kernels, flush
the kernels so they begin executing on the device, then use the host core for
other work. The AMD OpenCL implementation uses a separate thread for
each command-queue, so work can be transparently scheduled to the GPU
in the background.
Avoid starving the high-performance GPU devices. This can occur if the
physical CPU core, which must re-fill the device queue, is itself being used
as a device. A simple approach to this problem is to dedicate a physical CPU
core for scheduling chores. The device fission extension (see Section A.7,
“cl_ext Extensions,” page A-4) can be used to reserve a core for scheduling.
For example, on a quad-core device, device fission can be used to create an
OpenCL device with only three cores.
Another approach is to schedule enough work to the device so that it can
tolerate latency in additional scheduling. Here, the scheduler maintains a
watermark of uncompleted work that has been sent to the device, and refills
the queue when it drops below the watermark. This effectively increase the
grain size, but can be very effective at reducing or eliminating device
starvation. Developers cannot directly query the list of commands in the
OpenCL command queues; however, it is possible to pass an event to each
clEnqueue call that can be queried, in order to determine the execution

4.7 Using Multiple OpenCL Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-33

AMD ACCELERATED PARALLEL PROCESSING

status (in particular the command completion time); developers also can
maintain their own queue of outstanding requests.
For many algorithms, this technique can be effective enough at hiding
latency so that a core does not need to be reserved for scheduling. In
particular, algorithms where the work-load is largely known up-front often
work well with a deep queue and watermark. Algorithms in which work is
dynamically created may require a dedicated thread to provide low-latency
scheduling.

•

Data Location
Discrete GPUs use dedicated high-bandwidth memory that exists in a
separate address space. Moving data between the device address space
and the host requires time-consuming transfers over a relatively slow PCIExpress bus. Schedulers should be aware of this cost and, for example,
attempt to schedule work that consumes the result on the same device
producing it.
CPU and GPU devices share the same memory bandwidth, which results in
additional interactions of kernel executions.

4.7.4

Synchronization Caveats
The OpenCL functions that enqueue work (clEnqueueNDRangeKernel) merely
enqueue the requested work in the command queue; they do not cause it to
begin executing. Execution begins when the user executes a synchronizing
command, such as clFlush or clWaitForEvents. Enqueuing several commands
before flushing can enable the host CPU to batch together the command
submission, which can reduce launch overhead.
Command-queues that are configured to execute in-order are guaranteed to
complete execution of each command before the next command begins. This
synchronization guarantee can often be leveraged to avoid explicit
clWaitForEvents() calls between command submissions. Using
clWaitForEvents() requires intervention by the host CPU and additional
synchronization cost between the host and the GPU; by leveraging the in-order
queue property, back-to-back kernel executions can be efficiently handled
directly on the GPU hardware.
AMD Southern Islands GPUs can execute multiple kernels simultaneously when
there are no dependencies.
The AMD OpenCL implementation spawns a new thread to manage each
command queue. Thus, the OpenCL host code is free to manage multiple
devices from a single host thread. Note that clFinish is a blocking operation;
the thread that calls clFinish blocks until all commands in the specified
command-queue have been processed and completed. If the host thread is
managing multiple devices, it is important to call clFlush for each commandqueue before calling clFinish, so that the commands are flushed and execute
in parallel on the devices. Otherwise, the first call to clFinish blocks, the

4-34

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

commands on the other devices are not flushed, and the devices appear to
execute serially rather than in parallel.
For low-latency CPU response, it can be more efficient to use a dedicated spin
loop and not call clFinish() Calling clFinish() indicates that the application
wants to wait for the GPU, putting the thread to sleep. For low latency, the
application should use clFlush(), followed by a loop to wait for the event to
complete. This is also true for blocking maps. The application should use nonblocking maps followed by a loop waiting on the event. The following provides
sample code for this.
if (sleep)
{
// this puts host thread to sleep, useful if power is a consideration
or overhead is not a concern
clFinish(cmd_queue_);
}
else
{
// this keeps the host thread awake, useful if latency is a concern
clFlush(cmd_queue_);
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
while (eventStatus > 0)
{
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
Sleep(0);
// be nice to other threads, allow scheduler to find
other work if possible
// Choose your favorite way to yield, SwitchToThread() for example,
in place of Sleep(0)
}
}

4.7.5

•

The Vector ALU floating point resources in a CPU (SSE/AVX) require the use
of vectorized types (such as float4) to enable packed SSE code generation
and extract good performance from the Vector ALU hardware. The GPU
Vector ALU hardware is more flexible and can efficiently use the floatingpoint hardware; however, code that can use float4 often generates hi-quality
code for both the CPU and the AMD GPUs.

•

The AMD OpenCL CPU implementation runs work-items from the same
work-group back-to-back on the same physical CPU core. For optimally

4.7 Using Multiple OpenCL Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-35

AMD ACCELERATED PARALLEL PROCESSING

coalesced memory patterns, a common access pattern for GPU-optimized
algorithms is for work-items in the same wavefront to access memory
locations from the same cache line. On a GPU, these work-items execute in
parallel and generate a coalesced access pattern. On a CPU, the first workitem runs to completion (or until hitting a barrier) before switching to the next.
Generally, if the working set for the data used by a work-group fits in the CPU
caches, this access pattern can work efficiently: the first work-item brings a
line into the cache hierarchy, which the other work-items later hit. For large
working-sets that exceed the capacity of the cache hierarchy, this access
pattern does not work as efficiently; each work-item refetches cache lines
that were already brought in by earlier work-items but were evicted from the
cache hierarchy before being used. Note that AMD CPUs typically provide
512 kB to 2 MB of L2+L3 cache for each compute unit.

•

CPUs do not contain any hardware resources specifically designed to
accelerate local memory accesses. On a CPU, local memory is mapped to
the same cacheable DRAM used for global memory, and there is no
performance benefit from using the __local qualifier. The additional memory
operations to write to LDS, and the associated barrier operations can reduce
performance. One notable exception is when local memory is used to pack
values to avoid non-coalesced memory patterns.

•

CPU devices only support a small number of hardware threads, typically two
to eight. Small numbers of active work-group sizes reduce the CPU switching
overhead, although for larger kernels this is a second-order effect.

For a balanced solution that runs reasonably well on both devices, developers
are encouraged to write the algorithm using float4 vectorization. The GPU is
more sensitive to algorithm tuning; it also has higher peak performance potential.
Thus, one strategy is to target optimizations to the GPU and aim for reasonable
performance on the CPU. For peak performance on all devices, developers can
choose to use conditional compilation for key code loops in the kernel, or in some
cases even provide two separate kernels. Even with device-specific kernel
optimizations, the surrounding host code for allocating memory, launching
kernels, and interfacing with the rest of the program generally only needs to be
written once.
Another approach is to leverage a CPU-targeted routine written in a standard
high-level language, such as C++. In some cases, this code path may already
exist for platforms that do not support an OpenCL device. The program uses
OpenCL for GPU devices, and the standard routine for CPU devices. Loadbalancing between devices can still leverage the techniques described in
Section 4.7.3, “Partitioning Work for Multiple Devices,” page 4-32.

4.7.6

4-36

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

(and automatically across all devices), programs are associated with the context,
and kernel compilation (via clBuildProgram) can easily be done for all devices
in a context. However, with current OpenCL implementations, creating a separate
context for each device provides more flexibility, especially in that buffer
allocations can be targeted to occur on specific devices. Generally, placing the
devices in the same context is the preferred solution.

4.7 Using Multiple OpenCL Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

4-37

AMD ACCELERATED PARALLEL PROCESSING

4-38

Chapter 4: OpenCL Performance and Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Chapter 5
OpenCL Performance and Optimization for Southern Islands Devices
This chapter discusses performance and optimization when programming for
AMD Accelerated Parallel Processing GPU compute devices that are part of the
Southern Islands family, as well as CPUs and multiple devices. Details specific
to the Evergreen and Northern Islands families of GPUs are provided in
Chapter 6, “OpenCL Performance and Optimization for Evergreen and Northern
Islands Devices.”

5.1 Global Memory Optimization
Figure 5.1 is a block diagram of the GPU memory system. The up arrows are
read paths, the down arrows are write paths. WC is the write combine cache.
CU
16 pe
LDS

CU
16 pe
LDS

Compute Unit <> Memory Channel Xbar

Memory Channel

Channel 0

Channel 1

Channeln-2

Channel n-1

Figure 5.1

Memory System

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-1

AMD ACCELERATED PARALLEL PROCESSING

The GPU consists of multiple compute units. Each compute unit contains local
(on-chip) memory, L1 cache, registers, and 16 processing element (PE).
Individual work-items execute on a single processing element; one or more workgroups execute on a single compute unit. On a GPU, hardware schedules groups
of work-items, called wavefronts, onto compute units; thus, work-items within a
wavefront execute in lock-step; the same instruction is executed on different
data.
Each compute unit contains 64 kB local memory, 16 kB of read/write L1 cache,
four vector units, and one scalar unit. The maximum local memory allocation is
32 kB per work-group. Each vector unit contains 512 scalar registers (SGPRs)
for handling branching, constants, and other data constant across a wavefront.
Vector units also contain 256 vector registers (VGPRs). VGPRs actually are
scalar registers, but they are replicated across the whole wavefront. Vector units
contain 16 processing elements (PEs). Each PE is scalar.
Since the L1 cache is 16 kB per compute unit, the total L1 cache size is
16 kB * (# of compute units). For the AMD Radeon™ HD 7970, this means a total
of 512 kB L1 cache. L1 bandwidth can be computed as:
L1 peak bandwidth = Compute Units * (4 threads/clock) * (128 bits per thread) *
(1 byte / 8 bits) * Engine Clock
For the AMD Radeon™ HD 7970, this is ~1.9 TB/s.
The peak memory bandwidth of your device is available in Appendix D, “Device
Parameters.”
If two memory access requests are directed to the same controller, the hardware
serializes the access. This is called a channel conflict. Similarly, if two memory
access requests go to the same memory bank, hardware serializes the access.
This is called a bank conflict. From a developer’s point of view, there is not much
difference between channel and bank conflicts. Often, a large power of two stride
results in a channel conflict. The size of the power of two stride that causes a
specific type of conflict depends on the chip. A stride that results in a channel
conflict on a machine with eight channels might result in a bank conflict on a
machine with four.
In this document, the term bank conflict is used to refer to either kind of conflict.

5.1.1

5-2

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

A CPU with caches presents the same problem, large power-of-two strides force
data into only a few cache lines.
One solution is to rewrite the code to employ array transpositions between the
kernels. This allows all computations to be done at unit stride. Ensure that the
time required for the transposition is relatively small compared to the time to
perform the kernel calculation.
For many kernels, the reduction in performance is sufficiently large that it is
worthwhile to try to understand and solve this problem.
In GPU programming, it is best to have adjacent work-items read or write
adjacent memory addresses. This is one way to avoid channel conflicts.
When the application has complete control of the access pattern and address
generation, the developer must arrange the data structures to minimize bank
conflicts. Accesses that differ in the lower bits can run in parallel; those that differ
only in the upper bits can be serialized.
In this example:
for (ptr=base; ptr> B) & C ==> [u]bit_extract
where

•

–

B and C are compile time constants,

–

A is a 8/16/32bit integer type, and

–

C is a mask.

Bitfield insert on signed/unsigned integers
((A & B) << C) | ((D & E) << F ==> ubit_insert
where
–

B and E have no conflicting bits (B^E == 0),

–

B, C, E, and F are compile-time constants, and

–

B and E are masks.

–

The first bit set in B is greater than the number of bits in E plus the first
bit set in E, or the first bit set in E is greater than the number of bits in
B plus the first bit set in B.

–

If B, C, E, or F are equivalent to the value 0, this optimization is also
supported.

5.8 Additional Performance Guidance
This section is a collection of performance tips for GPU compute and AMDspecific optimizations.

5.8.1

Loop Unroll pragma
The compiler directive #pragma unroll can be placed
immediately prior to a loop as a hint to the compiler to unroll a loop. must be a positive integer, 1 or greater. When is 1,
loop unrolling is disabled. When is 2 or greater, the compiler
uses this as a hint for the number of times the loop is to be unrolled.

5.8 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-25

AMD ACCELERATED PARALLEL PROCESSING

Examples for using this loop follow.
No unrolling example:
#pragma unroll 1
for (int i = 0; i < n; i++) {
...
}

Partial unrolling example:
#pragma unroll 4
for (int i = 0; i < 128; i++) {
...
}

Currently, the unroll pragma requires that the loop boundaries can be determined
at compile time. Both loop bounds must be known at compile time. If n is not
given, it is equivalent to the number of iterations of the loop when both loop
bounds are known. If the unroll-factor is not specified, and the compiler can
determine the loop count, the compiler fully unrolls the loop. If the unroll-factor is
not specified, and the compiler cannot determine the loop count, the compiler
does no unrolling.

5.8.2

Memory Tiling
There are many possible physical memory layouts for images. AMD Accelerated
Parallel Processing devices can access memory in a tiled or in a linear
arrangement.

5-26

•

Linear – A linear layout format arranges the data linearly in memory such
that element addresses are sequential. This is the layout that is familiar to
CPU programmers. This format must be used for OpenCL buffers; it can be
used for images.

•

Tiled – A tiled layout format has a pre-defined sequence of element blocks
arranged in sequential memory addresses (see Figure 5.4 for a conceptual
illustration). A microtile consists of ABIJ; a macrotile consists of the top-left
16 squares for which the arrows are red. Only images can use this format.
Translating from user address space to the tiled arrangement is transparent
to the user. Tiled memory layouts provide an optimized memory access
pattern to make more efficient use of the RAM attached to the GPU compute
device. This can contribute to lower latency.

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Physical

A B C D E F G H
I J K L M N O P
Q R S T U V W X

Logical

A B C D I

J K L

Q R S T E F G H
M N O P U V W X

Figure 5.4

One Example of a Tiled Layout Format

Memory Access Pattern –
Memory access patterns in compute kernels are usually different from those in
the pixel shaders. Whereas the access pattern for pixel shaders is in a
hierarchical, space-filling curve pattern and is tuned for tiled memory
performance (generally for textures), the access pattern for a compute kernel is
linear across each row before moving to the next row in the global id space. This
has an effect on performance, since pixel shaders have implicit blocking, and
compute kernels do not. If accessing a tiled image, best performance is achieved
if the application tries to use workgroups with 16x16 (or 8x8) work-items.

5.8.3

General Tips
•

Avoid declaring global arrays on the kernel’s stack frame as these typically
cannot be allocated in registers and require expensive global memory
operations.

•

Use predication rather than control-flow. The predication allows the GPU to
execute both paths of execution in parallel, which can be faster than
attempting to minimize the work through clever control-flow. The reason for
this is that if no memory operation exists in a ?: operator (also called a
ternary operator), this operation is translated into a single cmov_logical
instruction, which is executed in a single cycle. An example of this is:
If (A>B) {
C += D;
} else {
C -= D;
}

Replace this with:
int factor = (A>B) ? 1:-1;
C += factor*D;

5.8 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-27

AMD ACCELERATED PARALLEL PROCESSING

In the first block of code, this translates into an IF/ELSE/ENDIF sequence of
conditional code, each taking ~8 cycles. If divergent, this code executes in
~36 clocks; otherwise, in ~28 clocks. A branch not taken costs four cycles
(one instruction slot); a branch taken adds four slots of latency to fetch
instructions from the instruction cache, for a total of 16 clocks. Since the
execution mask is saved, then modified, then restored for the branch, ~12
clocks are added when divergent, ~8 clocks when not.
In the second block of code, the ?: operator executes in the vector units, so
no extra CF instructions are generated. Since the instructions are sequentially
dependent, this block of code executes in 12 cycles, for a 1.3x speed
improvement. To see this, the first cycle is the (A>B) comparison, the result
of which is input to the second cycle, which is the cmov_logical factor, bool,
1, -1. The final cycle is a MAD instruction that: mad C, factor, D, C. If the ratio
between conditional code and ALU instructions is low, this is a good pattern
to remove the control flow.

•

Loop Unrolling
–

OpenCL kernels typically are high instruction-per-clock applications.
Thus, the overhead to evaluate control-flow and execute branch
instructions can consume a significant part of resource that otherwise
can be used for high-throughput compute operations.

–

The AMD Accelerated Parallel Processing OpenCL compiler performs
simple loop unrolling optimizations; however, for more complex loop
unrolling, it may be beneficial to do this manually.

•

If possible, create a reduced-size version of your data set for easier
debugging and faster turn-around on performance experimentation. GPUs do
not have automatic caching mechanisms and typically scale well as
resources are added. In many cases, performance optimization for the
reduced-size data implementation also benefits the full-size algorithm.

•

When tuning an algorithm, it is often beneficial to code a simple but accurate
algorithm that is retained and used for functional comparison. GPU tuning
can be an iterative process, so success requires frequent experimentation,
verification, and performance measurement.

•

The profiling and analysis tools report statistics on a per-kernel granularity.
To narrow the problem further, it might be useful to remove or comment-out
sections of code, then re-run the timing and profiling tool.

•

Avoid writing code with dynamic pointer assignment on the GPU. For
example:
kernel void dyn_assign(global int* a, global int* b, global int* c)
{
global int* d;
size_t idx = get_global_id(0);
if (idx & 1) {
d = b;
} else {
d = c;
}
a[idx] = d[idx];
}

5-28

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

This is inefficient because the GPU compiler must know the base pointer that
every load comes from and in this situation, the compiler cannot determine
what ‘d’ points to. So, both B and C are assigned to the same GPU resource,
removing the ability to do certain optimizations.

•

If the algorithm allows changing the work-group size, it is possible to get
better performance by using larger work-groups (more work-items in each
work-group) because the workgroup creation overhead is reduced. On the
other hand, the OpenCL CPU runtime uses a task-stealing algorithm at the
work-group level, so when the kernel execution time differs because it
contains conditions and/or loops of varying number of iterations, it might be
better to increase the number of work-groups. This gives the runtime more
flexibility in scheduling work-groups to idle CPU cores. Experimentation might
be needed to reach optimal work-group size.

•

Since the AMD OpenCL runtime supports only in-order queuing, using
clFinish() on a queue and queuing a blocking command gives the same
result. The latter saves the overhead of another API command.
For example:
clEnqueueWriteBuffer(myCQ, buff, CL_FALSE, 0, buffSize, input, 0, NULL,
NULL);
clFinish(myCQ);

is equivalent, for the AMD OpenCL runtime, to:
clEnqueueWriteBuffer(myCQ, buff, CL_TRUE, 0, buffSize, input, 0, NULL,
NULL);

5.8.4

Guidance for CUDA Programmers Using OpenCL
•

Porting from CUDA to OpenCL is relatively straightforward. Multiple vendors
have documents describing how to do this, including AMD:

http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx#four

•

5.8.5

Some specific performance recommendations which differ from other GPU
architectures:
–

Use a workgroup size that is a multiple of 64. CUDA code can use a
workgroup size of 32; this uses only half the available compute resources
on an AMD Radeon™ HD 7970 GPU.

–

AMD GPUs have a very high single-precision flops capability (3.788
teraflops in a single AMD Radeon™ HD 7970 GPU). Algorithms that
benefit from such throughput can deliver excellent performance on AMD
Accelerated Parallel Processing hardware.

Guidance for CPU Programmers Using OpenCL to Program GPUs
OpenCL is the industry-standard toolchain for programming GPUs and parallel
devices from many vendors. It is expected that many programmers skilled in
CPU programming will program GPUs for the first time using OpenCL. This
section provides some guidance for experienced programmers who are

5.8 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-29

AMD ACCELERATED PARALLEL PROCESSING

programming a GPU for the first time. It specifically highlights the key differences
in optimization strategy.

•

Study the local memory (LDS) optimizations. These greatly affect the GPU
performance. Note the difference in the organization of local memory on the
GPU as compared to the CPU cache. Local memory is shared by many
work-items (64 on Tahiti). This contrasts with a CPU cache that normally is
dedicated to a single work-item. GPU kernels run well when they
collaboratively load the shared memory.

•

GPUs have a large amount of raw compute horsepower, compared to
memory bandwidth and to “control flow” bandwidth. This leads to some highlevel differences in GPU programming strategy.

•

5.8.6

–

A CPU-optimized algorithm may test branching conditions to minimize
the workload. On a GPU, it is frequently faster simply to execute the
workload.

–

A CPU-optimized version can use memory to store and later load precomputed values. On a GPU, it frequently is faster to recompute values
rather than saving them in registers. Per-thread registers are a scarce
resource on the CPU; in contrast, GPUs have many available per-thread
register resources.

Use float4 and the OpenCL built-ins for vector types (vload, vstore, etc.).
These enable the AMD Accelerated Parallel Processing OpenCL
implementation to generate efficient, packed SSE instructions when running
on the CPU. Vectorization is an optimization that benefits both the AMD CPU
and GPU.

Optimizing Kernel Code

5.8.6.1 Using Vector Data Types
The CPU contains a vector unit, which can be efficiently used if the developer is
writing the code using vector data types.
For architectures before Bulldozer, the instruction set is called SSE, and the
vector width is 128 bits. For Bulldozer, there the instruction set is called AVX, for
which the vector width is increased to 256 bits.
Using four-wide vector types (int4, float4, etc.) is preferred, even with Bulldozer.
5.8.6.2 Local Memory
The CPU does not benefit much from local memory; sometimes it is detrimental
to performance. As local memory is emulated on the CPU by using the caches,
accessing local memory and global memory are the same speed, assuming the
information from the global memory is in the cache.

5-30

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

5.8.6.3 Using Special CPU Instructions
The Bulldozer family of CPUs supports FMA4 instructions, exchanging
instructions of the form a*b+c with fma(a,b,c) or mad(a,b,c) allows for the use
of the special hardware instructions for multiplying and adding.
There also is hardware support for OpenCL functions that give the new hardware
implementation of rotating.
For example:
sum.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x +
tempA0.w * tempB3.x;

can be written as a composition of mad instructions which use fused multiple add
(FMA):
sum.x += mad(tempA0.x, tempB0.x, mad(tempA0.y, tempB1.x, mad(tempA0.z,
tempB2.x, tempA0.w*tempB3.x)));

5.8.6.4 Avoid Barriers When Possible
Using barriers in a kernel on the CPU causes a significant performance penalty
compared to the same kernel without barriers. Use a barrier only if the kernel
requires it for correctness, and consider changing the algorithm to reduce
barriers usage.

5.8.7

Optimizing Kernels for Southern Island GPUs

5.8.7.1 Remove Conditional Assignments
A conditional of the form “if-then-else” generates branching. Use the select()
function to replace these structures with conditional assignments that do not
cause branching. For example:
if(x==1) r=0.5;
if(x==2) r=1.0;

becomes
r = select(r, 0.5, x==1);
r = select(r, 1.0, x==2);

Note that if the body of the if statement contains an I/O, the if statement cannot
be eliminated.
5.8.7.2 Bypass Short-Circuiting
A conditional expression with many terms can compile into nested conditional
code due to the C-language requirement that expressions must short circuit. To
prevent this, move the expression out of the control flow statement. For example:
if(a&&b&&c&&d){…}

becomes

5.8 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-31

AMD ACCELERATED PARALLEL PROCESSING

bool cond = a&&b&&c&&d;
if(cond){…}

The same applies to conditional expressions used in loop constructs (do, while,
for).
5.8.7.3 Unroll Small Loops
If the loop bounds are known, and the loop is small (less than 16 or 32
instructions), unrolling the loop usually increases performance.
5.8.7.4 Avoid Nested ifs
Because the GPU is a Vector ALU architecture, there is a cost to executing an
if-then-else block because both sides of the branch are evaluated, then one
result is retained while the other is discarded. When if blocks are nested, the
results are twice as bad; in general, if blocks are nested k levels deep, 2^k
nested conditional structures are generated. In this situation, restructure the code
to eliminate nesting.
5.8.7.5 Experiment With do/while/for Loops
for loops can generate more conditional code than equivalent do or while loops.
Experiment with these different loop types to find the one with best performance.

5.9 Specific Guidelines for Southern Islands GPUs
The AMD Southern Islands (SI) family of products is quite different from previous
generations. These are referred to as SI chips and are based on what is publicly
called Graphics Core Next.
SI compute units are much different than those of previous chips. With previous
generations, a compute unit (Vector ALU) was VLIW in nature, so four (Cayman
GPUs) or five (all other Evergreen/Northern Islands GPUs) instructions could be
packed into a single ALU instruction slot (called a bundle). It was not always easy
to schedule instructions to fill all of these slots, so achieving peak ALU utilization
was a challenge.
With SI GPUs, the compute units are now scalar; however, there now are four
Vector ALUs per compute unit. Each Vector ALU requires at least one wavefront
scheduled to it to achieve peak ALU utilization.
Along with the four Vector ALUs within a compute unit, there is also a scalar unit.
The scalar unit is used to handle branching instructions, constant cache
accesses, and other operations that occur per wavefront. The advantage to
having a scalar unit for each compute unit is that there are no longer large
penalties for branching, aside from thread divergence.
The instruction set for SI is scalar, as are GPRs. Also, the instruction set is no
longer clause-based. There are two types of GPRs: scalar GPRs (SGPRs) and
vector GPRs (VGPRs). Each Vector ALU has its own SGPR and VGPR pool.
There are 512 SGPRs and 256 VGPRs per Vector ALU. VGPRs handle all vector
5-32

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

instructions (any instruction that is handled per thread, such as v_add_f32, a
floating point add). SGPRs are used for scalar instructions: any instruction that
is executed once per wavefront, such as a branch, a scalar ALU instruction, and
constant cache fetches. (SGPRs are also used for constants, all buffer/texture
definitions, and sampler definitions; some kernel arguments are stored, at least
temporarily, in SGPRs.) SGPR allocation is in increments of eight, and VGPR
allocation is in increments of four. These increments also represent the minimum
allocation size of these resources.
Typical vector instructions execute in four cycles; typical scalar ALU instructions
in one cycle. This allows each compute unit to execute one Vector ALU and one
scalar ALU instruction every four clocks (each compute unit is offset by one cycle
from the previous one).
All Southern Islands GPUs have double-precision support. For Tahiti (AMD
Radeon™ HD 79XX series), double precision adds run at one-half the single
precision add rate. Double-precision multiplies and MAD instructions run at onequarter the floating-point rate.
The double-precision rate of Pitcairn (AMD Radeon™ HD 78XX series) and Cape
Verde (AMD Radeon™ HD 77XX series) is one quarter that of Tahiti. This also
affects the performance of single-precision fused multiple add (FMA).
Similar to previous generations local data share (LDS) is a shared resource
within a compute unit. The maximum LDS allocation size for a work-group is still
32 kB, however each compute unit has a total of 64 kB of LDS. On SI GPUs,
LDS memory has 32 banks; thus, it is important to be aware of LDS bank
conflicts on half-wavefront boundaries. The allocation granularity for LDS is 256
bytes; the minimum size is 0 bytes. It is much easier to achieve high LDS
bandwidth use on SI hardware.
L1 cache is still shared within a compute unit. The size has now increased to
16 kB per compute unit for all SI GPUs. The caches now are read/write, so
sharing data between work-items in a work-group (for example, when LDS does
not suffice) is much faster.
It is possible to schedule a maximum of 10 wavefronts per compute unit,
assuming there are no limitations by other resources, such as registers or local
memory; but there is a limit of 16 work-groups per compute unit if the workgroups are larger than a single wavefront. If the dispatch is larger than what can
fit at once on the GPU, the GPU schedules new work-groups as others finish.
Since there are no more clauses in the SI instruction set architecture (ISA), the
compiler inserts “wait” commands to indicate that the compute unit needs the
results of a memory operation before proceeding. If the scalar unit determines
that a wait is required (the data is not yet ready), the Vector ALU can switch to
another wavefront. There are different types of wait commands, depending on
the memory access.

5.9 Specific Guidelines for Southern Islands GPUs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-33

AMD ACCELERATED PARALLEL PROCESSING

Notes –

•

Vectorization is no longer needed, nor desirable; in fact, it can affect
performance because it requires a greater number of VGPRs for storage. I
is recommended not to combine work-items.

•

Register spilling is no greater a problem with four wavefronts per work-group
than it is with one wavefront per work-group. This is because each wavefront
has the same number of SGPRs and VGPRs available in either case.

•

Read coalescing does not work for 64-bit data sizes. This means reads for
float2, int2, and double might be slower than expected.

•

Work-groups with 256 work-items can be used to ensure that each compute
unit is being used. Barriers now are much faster.

•

The engine is wider than previous generations; this means larger dispatches
are required to keep the all the compute units busy.

•

A single wavefront can take twice as long to execute compared to previous
generations (assuming ALU bound). This is because GPUs with VLIW-4
could execute the four instructions in a VLIW bundle in eight clocks (typical),
and SI GPUs can execute one vector instruction in four clocks (typical).

•

Execution of kernel dispatches can overlap if there are no dependencies
between them and if there are resources available in the GPU. This is critical
when writing benchmarks it is important that the measurements are accurate
and that “false dependencies” do not cause unnecessary slowdowns.
An example of false dependency is:
a. Application creates a kernel “foo”.
b. Application creates input and output buffers.
c.

Application binds input and output buffers to kernel “foo”.

d. Application repeatedly dispatches “foo” with the same parameters.
If the output data is the same each time, then this is a false dependency because
there is no reason to stall concurrent execution of dispatches. To avoid stalls,
use multiple output buffers. The number of buffers required to get peak
performance depends on the kernel.
Table 5.4 compares the resource limits for Northern Islands and Southern Islands
GPUs.
Table 5.4

Resource Limits for Northern Islands and Southern Islands
VLIW
Width

Northern
Islands
Southern
Islands

4
1

VGPRs SGPRs LDS Size
256
32 kB
(128-bit)
256
512
64 kB
(32-bit)

LDS
Max
Alloc

L1$/CU

L2$/Channel

32 kB

8 kB

64 kB

32 kB

16 kB

64 kB

Table 5.5 provides a simplified picture showing the Northern Island compute unit
arrangement.
5-34

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Figure 5.5

TEXTURE
UNIT

LDS

Northern Islands Compute Unit Arrangement

Table 5.6 provides a simplified picture showing the Southern Island compute unit
arrangement.

VECTOR
ALU

Figure 5.6

VECTOR
ALU

SCALAR
UNIT

TEXTURE
UNIT

LDS

Southern Island Compute Unit Arrangement

5.9 Specific Guidelines for Southern Islands GPUs
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

5-35

AMD ACCELERATED PARALLEL PROCESSING

5-36

Chapter 5: OpenCL Performance and Optimization for Southern Islands Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Chapter 6
OpenCL Performance and
Optimization for Evergreen and
Northern Islands Devices
This chapter discusses performance and optimization when programming for
AMD Accelerated Parallel Processing GPU compute devices that are part of the
Southern Islands family, as well as CPUs and multiple devices. Details specific
to the Evergreen and Northern Islands families of GPUs are provided in
Chapter 5, “OpenCL Performance and Optimization for Southern Islands
Devices.”

6.1 Global Memory Optimization
Figure 6.1 is a block diagram of the GPU memory system. The up arrows are
read paths, the down arrows are write paths. WC is the write combine cache.
The GPU consists of multiple compute units. Each compute unit contains 32 kB
local (on-chip) memory, L1 cache, registers, and 16 processing element (pe).
Each processing element contains a five-way (or four-way, depending on the
GPU type) VLIW processor. Individual work-items execute on a single processing
element; one or more work-groups execute on a single compute unit. On a GPU,
hardware schedules the work-items. On the ATI Radeon™ HD 5000 series of
GPUs, hardware schedules groups of work-items, called wavefronts, onto stream
cores; thus, work-items within a wavefront execute in lock-step; the same
instruction is executed on different data.
The L1 cache is 8 kB per compute unit. (For the ATI Radeon™ HD 5870 GPU,
this means 160 kB for the 20 compute units.) The L1 cache bandwidth on the
ATI Radeon™ HD 5870 GPU is one terabyte per second:
L1 Bandwidth = Compute Units * Wavefront Size/Compute Unit *
EngineClock
Multiple compute units share L2 caches. The L2 cache size on the ATI Radeon™
HD 5870 GPUs is 512 kB:
L2 Cache Size = Number or channels * L2 per Channel
The bandwidth between L1 caches and the shared L2 cache is 435 GB/s:
L2 Bandwidth = Number of channels * Wavefront Size * Engine Clock

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-1

AMD ACCELERATED PARALLEL PROCESSING

CU
16 pe
LDS

Compute Unit <> Memory Channel Xbar

Complete
Path
Atomics

FastPath

WC
L2

Complete
Path
Atomics

FastPath

Memory Channel

Channel
((Address / 256) % n) == 0

Channel
((Address / 256) % n) == 1

Channel
((Address / 256) % n) == n-2

Channel
((Address / 256) % n) == n-1

Figure 6.1

Memory System

The ATI Radeon™ HD 5870 GPU has eight memory controllers (“Memory
Channel” in Figure 6.1). The memory controllers are connected to multiple banks
of memory. The memory is GDDR5, with a clock speed of 1200 MHz and a data
rate of 4800 Mb/pin. Each channel is 32-bits wide, so the peak bandwidth for the
ATI Radeon™ HD 5870 GPU is:
(8 memory controllers) * (4800 Mb/pin) * (32 bits) * (1 B/8b) = 154 GB/s
The peak memory bandwidth of your device is available in Appendix D, “Device
Parameters.”
If two memory access requests are directed to the same controller, the hardware
serializes the access. This is called a channel conflict. Similarly, if two memory
access requests go to the same memory bank, hardware serializes the access.
This is called a bank conflict. From a developer’s point of view, there is not much
difference between channel and bank conflicts. A large power of two stride
results in a channel conflict; a larger power of two stride results in a bank conflict.
The size of the power of two stride that causes a specific type of conflict depends
6-2
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

on the chip. A stride that results in a channel conflict on a machine with eight
channels might result in a bank conflict on a machine with four.
In this document, the term bank conflict is used to refer to either kind of conflict.

6.1.1

Two Memory Paths
ATI Radeon™ HD 5000 series graphics processors have two, independent
memory paths between the compute units and the memory:

•

FastPath performs only basic operations, such as loads and stores (data
sizes must be a multiple of 32 bits). This often is faster and preferred when
there are no advanced operations.

•

CompletePath, supports additional advanced operations, including atomics
and sub-32-bit (byte/short) data transfers.

6.1.1.1 Performance Impact of FastPath and CompletePath
There is a large difference in performance on ATI Radeon™ HD 5000 series
hardware between FastPath and CompletePath. Figure 6.2 shows two kernels
(one FastPath, the other CompletePath) and the delivered DRAM bandwidth for
each kernel on the ATI Radeon™ HD 5870 GPU. Note that an atomic add forces
CompletePath.
100000

Bandwidth (MB/s)

80000

60000

40000

20000

0e+00

Figure 6.2

1e+07

2e+07

3e+07

FastPath (blue) vs CompletePath (red) Using float1

The kernel code follows. Note that the atomic extension must be enabled under
OpenCL 1.0.
6.1 Global Memory Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-3

AMD ACCELERATED PARALLEL PROCESSING

__kernel void
CopyFastPath(__global const float * input,
__global float * output)
{
int gid = get_global_id(0);
output[gid] = input[gid];
return ;
}
__kernel void
CopyComplete(__global const float * input, __global float* output)
{
int gid = get_global_id(0);
if (gid <0){
atom_add((__global int *) output,1);
}
output[gid] = input[gid];
return ;
}

Table 6.1 lists the effective bandwidth and ratio to maximum bandwidth.
Table 6.1

Bandwidths for 1D Copies
Effective
Bandwidth

Ratio to Peak
Bandwidth

copy 32-bit 1D FP

96 GB/s

63%

copy 32-bit 1D CP

18 GB/s

12%

Kernel

The difference in performance between FastPath and CompletePath is
significant. If your kernel uses CompletePath, consider if there is another way to
approach the problem that uses FastPath. OpenCL read-only images always use
FastPath.
6.1.1.2 Determining The Used Path
Since the path selection is done automatically by the OpenCL compiler, your
kernel may be assigned to CompletePath. This section explains the strategy the
compiler uses, and how to find out what path was used.
The compiler is conservative when it selects memory paths. The compiler often
maps all user data into a single unordered access view (UAV),1 so a single
atomic operation (even one that is not executed) may force all loads and stores
to use CompletePath.
The effective bandwidth listing above shows two OpenCL kernels and the
associated performance. The first kernel uses the FastPath while the second
uses the CompletePath. The second kernel is forced to CompletePath because
in CopyComplete, the compiler noticed the use of an atomic.
There are two ways to find out which path is used. The first method uses the
AMD APP Profiler, which provides the following three performance counters for
this purpose:

1. UAVs allow compute shaders to store results in (or write results to) a buffer at any arbitrary location.
On DX11 hardware, UAVs can be created from buffers and textures. On DX10 hardware, UAVs cannot be created from typed resources (textures). This is the same as a random access target (RAT).
6-4
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

1. FastPath counter: The total bytes written through the FastPath (no atomics,
32-bit types only).
2. CompletePath counter: The total bytes read and written through the
CompletePath (supports atomics and non-32-bit types).
3. PathUtilization counter: The percentage of bytes read and written through the
FastPath or CompletePath compared to the total number of bytes transferred
over the bus.
The second method is static and lets you determine the path by looking at a
machine-level ISA listing (using the AMD APP KernelAnalyzer in OpenCL).
MEM_RAT_CACHELESS -> FastPath
MEM_RAT -> CompPath
MEM_RAT_NOP_RTN -> Comp_load

FastPath operations appear in the listing as:
...
TEX: ...
... VFETCH ...
... MEM_RAT_CACHELESS_STORE_RAW: ...
...

The vfetch Instruction is a load type that in graphics terms is called a vertex
fetch (the group control TEX indicates that the load uses the L1 cache.)
The instruction MEM_RAT_CACHELESS indicates that FastPath operations are used.
Loads in CompletePath are a split-phase operation. In the first phase, hardware
copies the old value of a memory location into a special buffer. This is done by
performing atomic operations on the memory location. After the value has
reached the buffer, a normal load is used to read the value. Note that RAT stands
for random access target, which is the same as an unordered access view (UAV);
it allows, on DX11 hardware, writes to, and reads from, any arbitrary location in
a buffer.
The listing shows:
..
..
..
..

MEM_RAT_NOP_RTN_ACK: RAT(1)
WAIT_ACK: Outstanding_acks <= 0
TEX: ADDR(64) CNT(1)
VFETCH ...

The instruction sequence means the following:
MEM_RAT

Read into a buffer using CompletePath, do no operation on the
memory location, and send an ACK when done.

WAIT_ACK

Suspend execution of the wavefront until the ACK is received. If
there is other work pending this might be free, but if there is no other
work to be done this could take 100’s of cycles.

TEX

Use the L1 cache for the next instruction.

VFETCH

Do a load instruction to (finally) get the value.

6.1 Global Memory Optimization
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-5

AMD ACCELERATED PARALLEL PROCESSING

Stores appear as:
.. MEM_RAT_STORE_RAW: RAT(1)

The instruction MEM_RAT_STORE is the store along the CompletePath.
MEM_RAT means CompletePath; MEM_RAT_CACHELESS means FastPath.

6.1.2

Channel Conflicts
The important concept is memory stride: the increment in memory address,
measured in elements, between successive elements fetched or stored by
consecutive work-items in a kernel. Many important kernels do not exclusively
use simple stride one accessing patterns; instead, they feature large non-unit
strides. For instance, many codes perform similar operations on each dimension
of a two- or three-dimensional array. Performing computations on the low
dimension can often be done with unit stride, but the strides of the computations
in the other dimensions are typically large values. This can result in significantly
degraded performance when the codes are ported unchanged to GPU systems.
A CPU with caches presents the same problem, large power-of-two strides force
data into only a few cache lines.
One solution is to rewrite the code to employ array transpositions between the
kernels. This allows all computations to be done at unit stride. Ensure that the
time required for the transposition is relatively small compared to the time to
perform the kernel calculation.
For many kernels, the reduction in performance is sufficiently large that it is
worthwhile to try to understand and solve this problem.
In GPU programming, it is best to have adjacent work-items read or write
adjacent memory addresses. This is one way to avoid channel conflicts.
When the application has complete control of the access pattern and address
generation, the developer must arrange the data structures to minimize bank
conflicts. Accesses that differ in the lower bits can run in parallel; those that differ
only in the upper bits can be serialized.
In this example:
for (ptr=base; ptr 0)
{
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
Sleep(0);
// be nice to other threads, allow scheduler to find
other work if possible

6-38
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

// Choose your favorite way to yield, SwitchToThread() for example,
in place of Sleep(0)
}
}

6.7.5

GPU and CPU Kernels
While OpenCL provides functional portability so that the same kernel can run on
any device, peak performance for each device is typically obtained by tuning the
OpenCL kernel for the target device.
Code optimized for the Cypress device (the ATI Radeon™ HD 5870 GPU)
typically runs well across other members of the Evergreen family. There are
some differences in cache size and LDS bandwidth that might impact some
kernels (see Appendix D, “Device Parameters”). The Cedar ASIC has a smaller
wavefront width and fewer registers (see Section 6.6.4, “Optimizing for Cedar,”
page 6-32, for optimization information specific to this device).
As described in Section 6.9, “Clause Boundaries,” page 6-46, CPUs and GPUs
have very different performance characteristics, and some of these impact how
one writes an optimal kernel. Notable differences include:

•

The Vector ALU floating point resources in a CPU (SSE) require the use of
vectorized types (float4) to enable packed SSE code generation and extract
good performance from the Vector ALU hardware. The GPU VLIW hardware
is more flexible and can efficiently use the floating-point hardware even
without the explicit use of float4. See Section 6.8.4, “VLIW and SSE
Packing,” page 6-43, for more information and examples; however, code that
can use float4 often generates hi-quality code for both the CPU and the AMD
GPUs.

•

The AMD OpenCL CPU implementation runs work-items from the same
work-group back-to-back on the same physical CPU core. For optimally
coalesced memory patterns, a common access pattern for GPU-optimized
algorithms is for work-items in the same wavefront to access memory
locations from the same cache line. On a GPU, these work-items execute in
parallel and generate a coalesced access pattern. On a CPU, the first workitem runs to completion (or until hitting a barrier) before switching to the next.
Generally, if the working set for the data used by a work-group fits in the CPU
caches, this access pattern can work efficiently: the first work-item brings a
line into the cache hierarchy, which the other work-items later hit. For large
working-sets that exceed the capacity of the cache hierarchy, this access
pattern does not work as efficiently; each work-item refetches cache lines
that were already brought in by earlier work-items but were evicted from the
cache hierarchy before being used. Note that AMD CPUs typically provide
512k to 2 MB of L2+L3 cache for each compute unit.

•

6.7 Using Multiple OpenCL Devices
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-39

AMD ACCELERATED PARALLEL PROCESSING

performance. One notable exception is when local memory is used to pack
values to avoid non-coalesced memory patterns.

•

For a balanced solution that runs reasonably well on both devices, developers
are encouraged to write the algorithm using float4 vectorization. The GPU is
more sensitive to algorithm tuning; it also has higher peak performance potential.
Thus, one strategy is to target optimizations to the GPU and aim for reasonable
performance on the CPU. For peak performance on all devices, developers can
choose to use conditional compilation for key code loops in the kernel, or in some
cases even provide two separate kernels. Even with device-specific kernel
optimizations, the surrounding host code for allocating memory, launching
kernels, and interfacing with the rest of the program generally only needs to be
written once.
Another approach is to leverage a CPU-targeted routine written in a standard
high-level language, such as C++. In some cases, this code path may already
exist for platforms that do not support an OpenCL device. The program uses
OpenCL for GPU devices, and the standard routine for CPU devices. Loadbalancing between devices can still leverage the techniques described in
Section 6.7.3, “Partitioning Work for Multiple Devices,” page 6-35.

6.7.6

Contexts and Devices
The AMD OpenCL program creates at least one context, and each context can
contain multiple devices. Thus, developers must choose whether to place all
devices in the same context or create a new context for each device. Generally,
it is easier to extend a context to support additional devices rather than
duplicating the context for each device: buffers are allocated at the context level
(and automatically across all devices), programs are associated with the context,
and kernel compilation (via clBuildProgram) can easily be done for all devices
in a context. However, with current OpenCL implementations, creating a separate
context for each device provides more flexibility, especially in that buffer
allocations can be targeted to occur on specific devices. Generally, placing the
devices in the same context is the preferred solution.

6-40
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

6.8 Instruction Selection Optimizations
6.8.1

Instruction Bandwidths
Table 6.10 lists the throughput of instructions for GPUs.
Table 6.10

Instruction Throughput (Operations/Cycle for Each Stream
Processor)

Instruction

Rate (Operations/Cycle) for each Stream
Processor
Double-PrecisionNon-Double-PrecisionCapable Devices
Capable (Evergreen and
(Evergreen and
later) Devices
later)

SPFP FMA

SPFP MAD

ADD

Single Precision
MUL
FP Rates
INV

RQSRT

LOG

FMA

MAD

Double Precision ADD
FP Rates
MUL

INV (approx.)

RQSRT (approx.)

MAD

ADD

MUL

Bit-shift

Bitwise XOR

Float-to-Int
Int-to-Float

1
1

MAD

ADD

MUL

Integer
Instruction
Rates

Conversion
24-Bit Integer
Inst Rates

Note that single precision MAD operations have five times the throughput of the
double-precision rate, and that double-precision is only supported on the AMD
Radeon™ HD69XX devices. The use of single-precision calculation is
encouraged, if that precision is acceptable. Single-precision data is also half the
size of double-precision, which requires less chip bandwidth and is not as
demanding on the cache structures.

6.8 Instruction Selection Optimizations
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-41

AMD ACCELERATED PARALLEL PROCESSING

Generally, the throughput and latency for 32-bit integer operations is the same
as for single-precision floating point operations.
24-bit integer MULs and MADs have five times the throughput of 32-bit integer
multiplies. 24-bit unsigned integers are natively supported only on the Evergreen
family of devices and later. Signed 24-bit integers are supported only on the
Northern Island family of devices and later. The use of OpenCL built-in functions
for mul24 and mad24 is encouraged. Note that mul24 can be useful for array
indexing operations.
Packed 16-bit and 8-bit operations are not natively supported; however, in cases
where it is known that no overflow will occur, some algorithms may be able to
effectively pack 2 to 4 values into the 32-bit registers natively supported by the
hardware.
The MAD instruction is an IEEE-compliant multiply followed by an IEEEcompliant add; it has the same accuracy as two separate MUL/ADD operations.
No special compiler flags are required for the compiler to convert separate
MUL/ADD operations to use the MAD instruction.
Table 6.10 shows the throughput for each stream processing core. To obtain the
peak throughput for the whole device, multiply the number of stream cores and
the engine clock (see Appendix D, “Device Parameters”). For example, according
to Table 6.10, a Cypress device can perform two double-precision ADD
operations/cycle in each stream core. From Appendix D, “Device Parameters,” a
ATI Radeon™ HD 5870 GPU has 320 Stream Cores and an engine clock of
850 MHz, so the entire GPU has a throughput rate of (2*320*850 MHz) = 544
GFlops for double-precision adds.

6.8.2

AMD Media Instructions
AMD provides a set of media instructions for accelerating media processing.
Notably, the sum-of-absolute differences (SAD) operation is widely used in
motion estimation algorithms. For a brief listing and description of the AMD media
operations, see the third bullet in Section A.8, “AMD Vendor-Specific Extensions,”
page A-4.

6.8.3

Math Libraries
OpenCL supports two types of math library operation: native_function() and
function(). Native_functions are generally supported in hardware and can run
substantially faster, although at somewhat lower accuracy. The accuracy for the
non-native functions is specified in section 7.4 of the OpenCL Specification. The
accuracy for the native functions is implementation-defined. Developers are
encouraged to use the native functions when performance is more important than
precision. Table 6.11 lists the native speedup factor for certain functions.

6-42
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

6.8.4

Table 6.11

Native Speedup Factor

Function

Native Speedup Factor

sin()

27.1x

cos()

34.2x

tan()

13.4x

exp()

4.0x

exp2()

3.4x

exp10()

5.2x

log()

12.3x

log2()

11.3x

log10()

12.8x

sqrt()

1.8x

rsqrt()

6.4x

powr()

28.7x

divide()

4.4x

VLIW and SSE Packing
Each stream core in the AMD GPU is programmed with a five-wide (or four-wide,
depending on the GPU type) VLIW instruction. Efficient use of the GPU hardware
requires that the kernel contain enough parallelism to fill all five processing
elements; serial dependency chains are scheduled into separate instructions. A
classic technique for exposing more parallelism to the compiler is loop unrolling.
To assist the compiler in disambiguating memory addresses so that loads can be
combined, developers should cluster load and store operations. In particular, reordering the code to place stores in adjacent code lines can improve
performance. Figure 6.7 shows an example of unrolling a loop and then
clustering the stores.
__kernel void loopKernel1A(int loopCount,
global float *output,
global const float * input)
{
uint gid = get_global_id(0);
for (int i=0; i> B) & C ==> [u]bit_extract
where

•

–

B and C are compile time constants,

–

A is a 8/16/32bit integer type, and

–

C is a mask.

Bitfield insert on signed/unsigned integers
((A & B) << C) | ((D & E) << F ==> ubit_insert
where
–

B and E have no conflicting bits (B^E == 0),

–

B, C, E, and F are compile-time constants, and

–

B and E are masks.

–

The first bit set in B is greater than the number of bits in E plus the first
bit set in E, or the first bit set in E is greater than the number of bits in
B plus the first bit set in B.

–

If B, C, E, or F are equivalent to the value 0, this optimization is also
supported.

6.8 Instruction Selection Optimizations
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-45

AMD ACCELERATED PARALLEL PROCESSING

6.9 Clause Boundaries
AMD GPUs groups instructions into clauses. These are broken at control-flow
boundaries when:

•

the instruction type changes (for example, from FETCH to ALU), or

•

if the clause contains the maximum amount of operations (the maximum size
for an ALU clause is 128 operations).

ALU and LDS access instructions are placed in the same clause. FETCH,
ALU/LDS, and STORE instructions are placed into separate clauses.
The GPU schedules a pair of wavefronts (referred to as the “even” and “odd”
wavefront). The even wavefront executes for four cycles (each cycle executes a
quarter-wavefront); then, the odd wavefront executes for four cycles. While the
odd wavefront is executing, the even wavefront accesses the register file and
prepares operands for execution. This fixed interleaving of two wavefronts allows
the hardware to efficiently hide the eight-cycle register-read latencies.
With the exception of the special treatment for even/odd wavefronts, the GPU
scheduler only switches wavefronts on clause boundaries. Latency within a
clause results in stalls on the hardware. For example, a wavefront that generates
an LDS bank conflict stalls on the compute unit until the LDS access completes;
the hardware does not try to hide this stall by switching to another available
wavefront.
ALU dependencies on memory operations are handled at the clause level.
Specifically, an ALU clause can be marked as dependent on a FETCH clause.
All FETCH operations in the clause must complete before the ALU clause begins
execution.
Switching to another clause in the same wavefront requires approximately 40
cycles. The hardware immediately schedules another wavefront if one is
available, so developers are encouraged to provide multiple wavefronts/compute
unit. The cost to switch clauses is far less than the memory latency; typically, if
the program is designed to hide memory latency, it hides the clause latency as
well.
The address calculations for FETCH and STORE instructions execute on the
same hardware in the compute unit as do the ALU clauses. The address
calculations for memory operations consumes the same executions resources
that are used for floating-point computations.

•

The ISA dump shows the clause boundaries. See the example shown below.

For more information on clauses, see the AMD Evergreen-Family ISA Microcode
And Instructions (v1.0b) and the AMD R600/R700/Evergreen Assembly
Language Format documents.
The following is an example disassembly showing clauses. There are 13 clauses
in the kernel. The first clause is an ALU clause and has 6 instructions.

6-46
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

00 ALU_PUSH_BEFORE: ADDR(32) CNT(13) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15)
0

x: MOV

R3.x,

KC0[0].x

y: MOV

R2.y,

KC0[0].y

z: MOV

R2.z,

KC0[0].z

w: MOV

R2.w,

KC0[0].w

x: MOV

R4.x,

KC0[2].x

y: MOV

R2.y,

KC0[2].y

z: MOV

R2.z,

KC0[2].z

w: MOV

R2.w,

KC0[2].w

t: SETGT_INT

R5.x,

PV0.x,

t: MULLO_INT

____,

R1.x,

KC1[1].x

y: ADD_INT

____,

R0.x,

PS2

x: ADD_INT

R0.x,

PV3.y,

x: PREDNE_INT

____,

R5.x,

01 JUMP

0.0f

KC1[6].x
0.0f

UPDATE_EXEC_MASK UPDATE_PRED

POP_CNT(1) ADDR(12)

02 ALU: ADDR(45) CNT(5) KCACHE0(CB1:0-15)
6

z: LSHL

____,

R0.x,

(0x00000002, 2.802596929e-45f).x

y: ADD_INT

____,

KC0[1].x,

x: LSHR

R1.x,

PV7.y,

PV6.z

(0x00000002, 2.802596929e-45f).x

03 LOOP_DX10 i0 FAIL_JUMP_ADDR(11)
04 ALU: ADDR(50) CNT(4)
9

x: ADD_INT

R3.x,

-1,

y: LSHR

R0.y,

R4.x,

(0x00000002, 2.802596929e-45f).x

R4.x,

(0x00000004, 5.605193857e-45f).y

t: ADD_INT
05 WAIT_ACK:

R3.x

Outstanding_acks <= 0

06 TEX: ADDR(64) CNT(1)
10

VFETCH R0.x___, R0.y, fc156

MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)
07 ALU: ADDR(54) CNT(3)
11 x: MULADD_e
t: SETE_INT

R0.x, R0.x, (0x40C00000, 6.0f).y, (0x41880000, 17.0f).x
R2.x,

R3.x,

0.0f

08 MEM_RAT_CACHELESS_STORE_RAW_ACK: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM
09 ALU_BREAK: ADDR(57) CNT(1)
12

x: PREDE_INT

____,

R2.x,

0.0f

UPDATE_EXEC_MASK UPDATE_PRED

10 ENDLOOP i0 PASS_JUMP_ADDR(4)
11 POP (1) ADDR(12)
12 NOP NO_BARRIER
END_OF_PROGRAM

6.9 Clause Boundaries
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-47

AMD ACCELERATED PARALLEL PROCESSING

6.10 Additional Performance Guidance
This section is a collection of performance tips for GPU compute and AMDspecific optimizations.

6.10.1

Partial unrolling example:
#pragma unroll 4
for (int i = 0; i < 128; i++) {
...
}

6.10.2

Memory Tiling
There are many possible physical memory layouts for images. AMD Accelerated
Parallel Processing devices can access memory in a tiled or in a linear
arrangement.

6-48
Devices

•

Tiled – A tiled layout format has a pre-defined sequence of element blocks
arranged in sequential memory addresses (see Figure 6.11 for a conceptual
illustration). A microtile consists of ABIJ; a macrotile consists of the top-left
16 squares for which the arrows are red. Only images can use this format.
Translating from user address space to the tiled arrangement is transparent
to the user. Tiled memory layouts provide an optimized memory access

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

pattern to make more efficient use of the RAM attached to the GPU compute
device. This can contribute to lower latency.
Physical

A B C D E F G H
I J K L M N O P
Q R S T U V W X

Logical

A B C D I

J K L

Q R S T E F G H
M N O P U V W X

Figure 6.11 One Example of a Tiled Layout Format
Memory Access Pattern –
Memory access patterns in compute kernels are usually different from those in
the pixel shaders. Whereas the access pattern for pixel shaders is in a
hierarchical, space-filling curve pattern and is tuned for tiled memory
performance (generally for textures), the access pattern for a compute kernel is
linear across each row before moving to the next row in the global id space. This
has an effect on performance, since pixel shaders have implicit blocking, and
compute kernels do not. If accessing a tiled image, best performance is achieved
if the application tries to use workgroups as a simple blocking strategy.

6.10.3

General Tips
•

Avoid declaring global arrays on the kernel’s stack frame as these typically
cannot be allocated in registers and require expensive global memory
operations.

•

Replace this with:
6.10 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-49

AMD ACCELERATED PARALLEL PROCESSING

int factor = (A>B) ? 1:-1;
C += factor*D;

In the first block of code, this translates into an IF/ELSE/ENDIF sequence of
CF clauses, each taking ~40 cycles. The math inside the control flow adds
two cycles if the control flow is divergent, and one cycle if it is not. This code
executes in ~120 cycles.
In the second block of code, the ?: operator executes in an ALU clause, so
no extra CF instructions are generated. Since the instructions are sequentially
dependent, this block of code executes in three cycles, for a ~40x speed
improvement. To see this, the first cycle is the (A>B) comparison, the result
of which is input to the second cycle, which is the cmov_logical factor, bool,
1, -1. The final cycle is a MAD instruction that: mad C, factor, D, C. If the ratio
between CF clauses and ALU instructions is low, this is a good pattern to
remove the control flow.

•

Loop Unrolling
–

–

The AMD Accelerated Parallel Processing OpenCL compiler performs
simple loop unrolling optimizations; however, for more complex loop
unrolling, it may be beneficial to do this manually.

•

The profiler and analysis tools report statistics on a per-kernel granularity. To
narrow the problem further, it might be useful to remove or comment-out
sections of code, then re-run the timing and profiling tool.

•

Writing code with dynamic pointer assignment should be avoided on the
GPU. For example:
kernel void dyn_assign(global int* a, global int* b, global int* c)
{
global int* d;
size_t idx = get_global_id(0);
if (idx & 1) {
d = b;
} else {
d = c;
}
a[idx] = d[idx];
}

6-50
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

•

is equivalent, for the AMD OpenCL runtime, to:
clEnqueueWriteBuffer(myCQ, buff, CL_TRUE, 0, buffSize, input, 0, NULL,
NULL);

6.10.4

Guidance for CUDA Programmers Using OpenCL
•

Porting from CUDA to OpenCL is relatively straightforward. Multiple vendors
have documents describing how to do this, including AMD:

http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx#four

•

Some specific performance recommendations which differ from other GPU
architectures:
–

Use a workgroup size that is a multiple of 64. CUDA code can use a
workgroup size of 32; this uses only half the available compute resources
on an ATI Radeon™ HD 5870 GPU.

–

Vectorization can lead to substantially greater efficiency. The ALUPacking
counter provided by the Profiler can track how well the kernel code is
using the five-wide (or four-wide, depending on the GPU type) VLIW unit.
Values below 70 percent may indicate that dependencies are preventing
the full use of the processor. For some kernels, vectorization can be used
to increase efficiency and improve kernel performance.

–

AMD GPUs have a very high single-precision flops capability (2.72
teraflops in a single ATI Radeon™ HD 5870 GPU). Algorithms that benefit
from such throughput can deliver excellent performance on AMD
Accelerated Parallel Processing hardware.

6.10 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-51

AMD ACCELERATED PARALLEL PROCESSING

6.10.5

•

Study the local memory (LDS) optimizations. These greatly affect the GPU
performance. Note the difference in the organization of local memory on the
GPU as compared to the CPU cache. Local memory is shared by many
work-items (64 on Cypress). This contrasts with a CPU cache that normally
is dedicated to a single work-item. GPU kernels run well when they
collaboratively load the shared memory.

•

GPUs have a large amount of raw compute horsepower, compared to
memory bandwidth and to “control flow” bandwidth. This leads to some highlevel differences in GPU programming strategy.

•

6.10.6

–

A CPU-optimized algorithm may test branching conditions to minimize
the workload. On a GPU, it is frequently faster simply to execute the
workload.

–

Optimizing Kernel Code

6.10.6.1 Using Vector Data Types
The CPU contains a vector unit, which can be efficiently used if the developer is
writing the code using vector data types.
For architectures before Bulldozer, the instruction set is called SSE, and the
vector width is 128 bits. For Bulldozer, there the instruction set is called AVX, for
which the vector width is increased to 256 bits.
Using four-wide vector types (int4, float4, etc.) is preferred, even with Bulldozer.
6.10.6.2 Local Memory
The CPU does not benefit much from local memory; sometimes it is detrimental
to performance. As local memory is emulated on the CPU by using the caches,
6-52
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

accessing local memory and global memory are the same speed, assuming the
information from the global memory is in the cache.
6.10.6.3 Using Special CPU Instructions
The Bulldozer family of CPUs supports FMA4 instructions, exchanging
instructions of the form a*b+c with fma(a,b,c) or mad(a,b,c) allows for the use
of the special hardware instructions for multiplying and adding.
There also is hardware support for OpenCL functions that give the new hardware
implementation of rotating.
For example:
sum.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x +
tempA0.w * tempB3.x;

can be written as a composition of mad instructions which use fused multiple add
(FMA):
sum.x += mad(tempA0.x, tempB0.x, mad(tempA0.y, tempB1.x, mad(tempA0.z,
tempB2.x, tempA0.w*tempB3.x)));

6.10.6.4 Avoid Barriers When Possible
Using barriers in a kernel on the CPU causes a significant performance penalty
compared to the same kernel without barriers. Use a barrier only if the kernel
requires it for correctness, and consider changing the algorithm to reduce
barriers usage.

6.10.7

Optimizing Kernels for Evergreen and 69XX-Series GPUs

6.10.7.1 Clauses
The architecture for the 69XX series of GPUs is clause-based. A clause is similar
to a basic block, a sequence of instructions that execute without flow control or
I/O. Processor efficiency is determined in large part by the number of instructions
in a clause, which is determined by the frequency of branching and I/O at the
source-code level. An efficient kernel averages at least 16 or 32 instructions per
clause.
The AMD APP KernelAnalyzer assembler listing lets you view clauses. Try the
optimizations listed here from inside the AMD APP KernelAnalyzer to see the
improvements in performance.
6.10.7.2 Remove Conditional Assignments
A conditional of the form “if-then-else” generates branching and thus generates
one or more clauses. Use the select() function to replace these structures with
conditional assignments that do not cause branching. For example:
if(x==1) r=0.5;
if(x==2) r=1.0;

6.10 Additional Performance Guidance
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

6-53

AMD ACCELERATED PARALLEL PROCESSING

becomes
r = select(r, 0.5, x==1);
r = select(r, 1.0, x==2);

Note that if the body of the if statement contains an I/O, the if statement cannot
be eliminated.
6.10.7.3 Bypass Short-Circuiting
A conditional expression with many terms can compile into a number of clauses
due to the C-language requirement that expressions must short circuit. To
prevent this, move the expression out of the control flow statement. For example:
if(a&&b&&c&&d){…}

becomes
bool cond = a&&b&&c&&d;
if(cond){…}

The same applies to conditional expressions used in loop constructs (do, while,
for).
6.10.7.4 Unroll Small Loops
If the loop bounds are known, and the loop is small (less than 16 or 32
instructions), unrolling the loop usually increases performance.
6.10.7.5 Avoid Nested ifs
Because the GPU is a Vector ALU architecture, there is a cost to executing an
if-then-else block because both sides of the branch are evaluated, then one
result is retained while the other is discarded. When if blocks are nested, the
results are twice as bad; in general, if blocks are nested k levels deep, there 2^k
clauses are generated. In this situation, restructure the code to eliminate nesting.
6.10.7.6 Experiment With do/while/for Loops
for loops can generate more clauses than equivalent do or while loops.
Experiment with these different loop types to find the one with best performance.
6.10.7.7 Do I/O With 4-Word Data
The native hardware I/O transaction size is four words (float4, int4 types). Avoid
I/Os with smaller data, and rewrite the kernel to use the native size data. Kernel
performance increases, and only 25% as many work items need to be
dispatched.

6-54
Devices

Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands

AMD ACCELERATED PARALLEL PROCESSING

Chapter 7
OpenCL Static C++
Programming Language
7.1 Overview
This extension defines the OpenCL Static C++ kernel language, which is a form
of the ISO/IEC Programming languages C++ specification1. This language
supports overloading and templates that can be resolved at compile time (hence
static), while restricting the use of language features that require dynamic/runtime
resolving. The language also is extended to support most of the features
described in Section 6 of OpenCL spec: new data types (vectors, images,
samples, etc.), OpenCL Built-in functions, and more.

7.1.1

Supported Features
The following list contains the major C++ features supported by this extension.

•

Kernel and function overloading.

•

Inheritance:

•

–

Strict inheritance.

–

Friend classes.

–

Multiple inheritance.

Templates:
–

Kernel templates.

–

Member templates.

–

Template default argument.

–

Limited class templates (the virtual. keyword is not exposed).

–

Partial template specialization

•

Namespaces.

•

References.

•

this operator.

Note that supporting templates and overloading highly improve the efficiency of
writing code: it allows developers to avoid replication of code when not
necessary.

1. Programming languages C++. International Standard ISO/IEC 14881, 1998.
AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

7-1

AMD ACCELERATED PARALLEL PROCESSING

Using kernel template and kernel overloading requires support from the runtime
API as well. AMD provides a simple extension to clCreateKernel, which
enables the user to specify the desired kernel.

7.1.2

Unsupported Features
C++ features not supported by this extension are:

7.1.3

•

Virtual functions (methods marked with the virtual keyword).

•

Abstract classes (a class defined only of pure virtual functions).

•

Dynamic memory allocation (non-placement new/delete support is not
provided).

•

Exceptions (no support for throw/catch).

•

The :: operator.

•

STL and other standard C++ libraries.

•

The language specified in this extension can be easily expanded to support
these features.

Relations with ISO/IEC C++
This extension focuses on documenting the differences between the OpenCL
Static C++ kernel language and the ISO/IEC Programming languages C++
specification. Where possible, this extension leaves technical definitions to the
ISO/IEC specification.

7.2 Additions and Changes to Section 5 - The OpenCL C Runtime
7.2.1

Additions and Changes to Section 5.7.1 - Creating Kernel Objects
In the static C++ kernel language, a kernel can be overloaded, templated, or
both. The syntax explaining how to do it is defined in Sections 7.3.4 and 7.3.5,
below.
To support these cases, the following error codes were added; these can be
returned by clCreateKernel.

7-2

•

CL_INVALID_KERNEL_TEMPLATE_TYPE_ARGUMENT_AMD if a kernel template
argument is not a valid type (is neither a valid OpenCL C type or a user
defined type in the same source file).

•

CL_INVALID_KERNEL_TYPE_ARGUMENT_AMD if a kernel type argument, used for
overloading resolution, is not a valid type (is neither a valid OpenCL C type
or user-defined type in the same source program).

Chapter 7: OpenCL Static C++ Programming Language
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

7.2.2

Passing Classes between Host and Device
This extension allows a developer to pass classes between the host and the
device. The mechanism used to pass the class to the device and back are the
existing buffer object APIs. The class that is passed maintains its state (public
and private members), and the compiler implicitly changes the class to use either
the host-side or device-side methods.
On the host side, the application creates the class and an equivalent memory
object with the same size (using the sizeof function). It then can use the class
methods to set or change values of the class members. When the class is ready,
the application uses a standard buffer API to move the class to the device (either
Unmap or Write), then sets the buffer object as the appropriate kernel argument
and enqueues the kernel for execution. When the kernel finishes the execution,
the application can map back (or read) the buffer object into the class and
continue working on it.

7.3 Additions and Changes to Section 6 - The OpenCL C Programming Language
7.3.1

Building C++ Kernels
To compile a program that contains C++ kernels and functions, the application
must add the following compile option to clBuildProgramWithSource:
-x language
where language is defined as one of the following:

7.3.2

•

clc – the source language is considered to be OpenCL C, as defined in the
The OpenCL Programming Language version 1.21.

•

clc++ - the source language is considered to be OpenCL C++, as defined in
the following sections of the this document.

Classes and Derived Classes
OpenCL C is extended to support classes and derived classes as per Sections
9 and 10 of the C++ language specification, with the limitation that virtual
functions and abstracts classes are not supported. The virtual keyword is
reserved, and the OpenCL C++ compiler is required to report a compile time
error if it is used in the input program.
This limitation restricts class definitions to be fully statically defined. There is
nothing prohibiting a future version of OpenCL C++ from relaxing this restriction,
pending performance implications.
A class definition can not contain any address space qualifier, either for members
or for methods:

1. The OpenCL Programming Language 1.2. Rev15, Khronos 2011.
7.3 Additions and Changes to Section 6 - The OpenCL C Programming Language
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

7-3

AMD ACCELERATED PARALLEL PROCESSING

class myClass{
public:
int myMethod1(){ return x;}
void __local myMethod2(){x = 0;}
private:
int x;
__local y; // illegal
};
The class invocation inside a kernel, however, can be either in private or local
address space:
__kernel void myKernel()
{
myClass c1;
__local myClass c2;
...
}
Classes can be passed as arguments to kernels, by defining a buffer object at
the size of the class, and using it. The device invokes the adequate devicespecific methods, and accesses the class members passed from the host.
OpenCL C kernels (defined with __kernel) may not be applied to a class
constructor, destructor, or method, except in the case that the class method is
defined static and thus does not require object construction to be invoked.

7.3.3

Namespaces
Namespaces are support without change as per [1].

7.3.4

Overloading
As defined in of the C++ language specification, when two or more different
declarations are specified for a single name in the same scope, that name is said
to be overloaded. By extension, two declarations in the same scope that declare
the same name but with different types are called overloaded declarations. Only
kernel and function declarations can be overloaded, not object and type
declarations.
As per of the C++ language specification, there are a number of restrictions as
to how functions can be overloaded; these are defined formally in Section 13 of
the C++ language specification. Note that kernels and functions cannot be
overloaded by return type.
Also, the rules for well-formed programs as defined by Section 13 of the C++
language specification are lifted to apply to both kernel and function declarations.
The overloading resolution is per Section 13.1 of the C++ language specification,
but extended to account for vector types. The algorithm for “best viable function”,
Section 13.3.3 of the C++ language specification, is extended for vector types by
inducing a partial-ordering as a function of the partial-ordering of its elements.
Following the existing rules for vector types in the OpenCL 1.2 specification,
explicit conversion between vectors is not allowed. (This reduces the number of

7-4

Chapter 7: OpenCL Static C++ Programming Language
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

possible overloaded functions with respect to vectors, but this is not expected to
be a particular burden to developers because explicit conversion can always be
applied at the point of function evocation.)
For overloaded kernels, the following syntax is used as part of the kernel name:
foo(type1,...,typen)
where type1,...,typen must be either an OpenCL scalar or vector type, or can
be a user-defined type that is allocated in the same source file as the kernel foo.
To allow overloaded kernels, use the following syntax:
__attribute__((mangled_name(myMangledName)))
The kernel mangled_name is used as a parameter to pass to the
clCreateKernel() API. This mechanism is needed to allow overloaded kernels
without changing the existing OpenCL kernel creation API.

7.3.5

Templates
OpenCL C++ provides unrestricted support for C++ templates, as defined in
Section 14 of the C++ language specification. The arguments to templates are
extended to allow for all OpenCL base types, including vectors and pointers
qualified with OpenCL C address spaces (i.e. __global, __local, __private,
and __constant).
OpenCL C++ kernels (defined with __kernel) can be templated and can be
called from within an OpenCL C (C++) program or as an external entry point
(from the host).
For kernel templates, the following syntax is used as part of the kernel name
(assuming a kernel called foo):
foo
where type1,...,typen must be either OpenCL scalar or vector type, or can be
a user-defined type that is allocated in the same source file as the kernel foo. In
this case a kernel is both overloaded and templated:
foo(typen+1,...,typem)
Note that here overloading resolution is done by first matching non-templated
arguments in order of appearance in the definition, then substituting template
parameters. This allows intermixing of template and non-template arguments in
the signature.
To support template kernels, the same mechanism for kernel overloading is used.
Use the following syntax:
__attribute__((mangled_name(myMangledName)))
The kernel mangled_name is used as a parameter to passed to t he
clCreateKernel() API. This mechanism is needed to allow template kernels

7.3 Additions and Changes to Section 6 - The OpenCL C Programming Language
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

7-5

AMD ACCELERATED PARALLEL PROCESSING

without changing the existing OpenCL kernel creation API. An implementation is
not required to detect name collision with the user-specified kernel_mangled
names involved.

7.3.6

Exceptions
Exceptions, as per Section 15 of the C++ language specification, are not
supported. The keywords try, catch, and throw are reserved, and the OpenCL
C++ compiler must produce a static compile time error if they are used in the
input program.

7.3.7

Libraries
Support for the general utilities library, as defined in Sections 20-21 of the C++
language specification, is not provided. The standard C++ libraries and STL
library are not supported.

7.3.8

Dynamic Operation
Features related to dynamic operation are not supported:

•

the virtual modifier.
OpenCL C++ prohibits the use of the virtual modifier. Thus, virtual member
functions and virtual inheritance are not supported.

•

Dynamic cast that requires runtime check.

•

Dynamic storage allocation and deallocation.

7.4 Examples
7.4.1

Passing a Class from the Host to the Device and Back
The class definition must be the same on the host code and the device code,
besides the members’ type in the case of vectors. If the class includes vector
data types, the definition must conform to the table that appears on Section 6.1.2
of the OpenCL Programming Specification 1.2, Corresponding API type for
OpenCL Language types.
Example Kernel Code
Class Test
{
setX (int value);
private:
int x;
}
__kernel foo (__global Test* InClass, ...)
{
If (get_global_id(0) == 0)
InClass->setX(5);
}

7-6

Chapter 7: OpenCL Static C++ Programming Language
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Example Host Code
Class Test
{
setX (int value);
private:
int x;
}

MyFunc ()
{
tempClass = new(Test);
... // Some OpenCL startup code – create context, queue, etc.
cl_mem classObj = clCreateBuffer(context,
CL_USE_HOST_PTR, sizeof(Test),&tempClass,
event);
clEnqueueMapBuffer(...,classObj,...);
tempClass.setX(10);
clEnqueueUnmapBuffer(...,classObj,...); //class is passed to the Device
clEnqueueNDRange(..., fooKernel, ...);
clEnqueueMapBuffer(...,classObj,...); //class is passed back to the Host
}

7.4.2

Kernel Overloading
This example shows how to define and use mangled_name for kernel overloading,
and how to choose the right kernel from the host code. Assume the following
kernels are defined:
__attribute__((mangled_name(testAddFloat4))) kernel void
testAdd(global float4 * src1, global float4 * src2, global float4 * dst)
{
int tid = get_global_id(0);
dst[tid] = src1[tid] + src2[tid];
}
__attribute__((mangled_name(testAddInt8))) kernel void
testAdd(global int8 * src1, global int8 * src2, global int8 * dst)
{
int tid = get_global_id(0);
dst[tid] = src1[tid] + src2[tid];
}
The names testAddFloat4 and testAddInt8 are the external names for the two
kernel instants. When calling clCreateKernel, passing one of these kernel
names leads to the correct overloaded kernel.

7.4 Examples
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

7-7

AMD ACCELERATED PARALLEL PROCESSING

7.4.3

Kernel Template
This example defines a kernel template, testAdd. It also defines two explicit
instants of the kernel template, testAddFloat4 and testAddInt8. The names
testAddFloat4 and testAddInt8 are the external names for the two kernel
template instants that must be used as parameters when calling to the
clCreateKernel API.
template
kernel void testAdd(global T * src1, global T * src2, global T * dst)
{
int tid = get_global_id(0);
dst[tid] = src1[tid] + src2[tid];
}
template __attribute__((mangled_name(testAddFloat4))) kernel void
testAdd(global float4 * src1, global float4 * src2, global float4 *
dst);
template __attribute__((mangled_name(testAddInt8))) kernel void
testAdd(global int8 * src1, global int8 * src2, global int8 * dst);

7-8

Chapter 7: OpenCL Static C++ Programming Language
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Appendix A
OpenCL Optional Extensions

The OpenCL extensions are associated with the devices and can be queried for
a specific device. Extensions can be queried for platforms also, but that means
that all devices in the platform support those extensions.
Table A.1, on page A-15, lists the supported extensions for the Evergreen-family
of devices, as well as for the RV770 and x86 CPUs.

A.1 Extension Name Convention
The name of extension is standardized and must contain the following elements
without spaces in the name (in lower case):

•

cl_khr_ - for extensions approved by Khronos Group.
For example: cl_khr_fp64.

•

cl_ext_ - for extensions provided collectively by multiple
vendors. For example: cl_ext_device_fission.

•

cl__ – for extension provided by a
specific vendor. For example: cl_amd_media_ops.

The OpenCL Specification states that all API functions of the extension must
have names in the form of clKHR, clEXT, or
cl. All enumerated values must be in the form of
CL__KHR, CL__EXT, or
CL__.

A.2 Querying Extensions for a Platform
To query supported extensions for the OpenCL platform, use the
clGetPlatformInfo() function, with the param_name parameter set to the
enumerated value CL_PLATFORM_EXTENSIONS. This returns the extensions as a
character string with extension names separated by spaces. To find out if a
specific extension is supported by this platform, search the returned string for the
required substring.

AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-1

AMD ACCELERATED PARALLEL PROCESSING

A.3 Querying Extensions for a Device
To get the list of devices to be queried for supported extensions, use one of the
following:

•

Query for available platforms using clGetPlatformIDs(). Select one, and
query for a list of available devices with clGetDeviceIDs().

•

For a specific device type, call clCreateContextFromType(), and query a
list of devices by calling clGetContextInfo() with the param_name
parameter set to the enumerated value CL_CONTEXT_DEVICES.

After the device list is retrieved, the extensions supported by each device can be
queried with function call clGetDeviceInfo() with parameter param_name being
set to enumerated value CL_DEVICE_EXTENSIONS.
The extensions are returned in a char string, with extension names separated by
a space. To see if an extension is present, search the string for a specified
substring.

A.4 Using Extensions in Kernel Programs
There are special directives for the OpenCL compiler to enable or disable
available extensions supported by the OpenCL implementation, and, specifically,
by the OpenCL compiler. The directive is defined as follows.
#pragma OPENCL EXTENSION :
#pragma OPENCL EXTENSION all:

The is described in Section A.1, “Extension Name
Convention.”. The second form allows to address all extensions at once.
The token can be either:

•

enable - the extension is enabled if it is supported, or the error is reported
if the specified extension is not supported or token “all” is used.

•

disable - the OpenCL implementation/compiler behaves as if the specified
extension does not exist.

•

all - only core functionality of OpenCL is used and supported, all extensions
are ignored. If the specified extension is not supported then a warning is
issued by the compiler.

The order of directives in #pragma OPENCL EXTENSION is important: a later
directive with the same extension name overrides any previous one.
The initial state of the compiler is set to ignore all extensions as if it was explicitly
set with the following directive:
#pragma OPENCL EXTENSION all : disable

A-2

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

This means that the extensions must be explicitly enabled to be used in kernel
programs.
Each extension that affects kernel code compilation must add a defined macro
with the name of the extension. This allows the kernel code to be compiled
differently, depending on whether the extension is supported and enabled, or not.
For example, for extension cl_khr_fp64 there should be a #define directive for
macro cl_khr_fp64, so that the following code can be preprocessed:
#ifdef cl_khr_fp64
// some code
#else
// some code
#endif

A.5 Getting Extension Function Pointers
Use the following function to get an extension function pointer.
void* clGetExtensionFunctionAddress(const char* FunctionName).

This returns the address of the extension function specified by the FunctionName
string. The returned value must be appropriately cast to a function pointer type,
specified in the extension spec and header file.
A return value of NULL means that the specified function does not exist in the
CL implementation. A non-NULL return value does not guarantee that the
extension function actually exists – queries described in sec. 2 or 3 must be done
to ensure the extension is supported.
The clGetExtensionFunctionAddress() function cannot be used to get core
API function addresses.

A.6 List of Supported Extensions that are Khronos-Approved
Supported extensions approved by the Khronos Group are:

•

cl_khr_global_int32_base_atomics – basic atomic operations on 32-bit
integers in global memory.

•

cl_khr_global_int32_extended_atomics – extended atomic operations on
32-bit integers in global memory.

•

cl_khr_local_int32_base_atomics – basic atomic operations on 32-bit
integers in local memory.

•

cl_khr_local_int32_extended_atomics – extended atomic operations on
32-bit integers in local memory.

•

cl_khr_int64_base_atomics – basic atomic operations on 64-bit integers in
both global and local memory.

•

cl_khr_int64_extended_atomics – extended atomic operations on 64-bit
integers in both global and local memory.

A.5 Getting Extension Function Pointers
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-3

AMD ACCELERATED PARALLEL PROCESSING

•

cl_khr_3d_image_writes – supports kernel writes to 3D images.

•

cl_khr_byte_addressable_store – this eliminates the restriction of not
allowing writes to a pointer (or array elements) of types less than 32-bit wide
in kernel program.

•

cl_khr_gl_sharing – allows association of OpenGL context or share group
with CL context for interoperability.

•

cl_khr_icd – the OpenCL Installable Client Driver (ICD) that lets developers
select from multiple OpenCL runtimes which may be installed on a system.
This extension is automatically enabled as of SDK v2 for AMD Accelerated
Parallel Processing.

•

cl_khr_d3d10_sharing - allows association of D3D10 context or share
group with CL context for interoperability.

A.7 cl_ext Extensions
•

cl_ext_device_fission - Support for device fission in OpenCL™. For more
information about this extension, see:
http://www.khronos.org/registry/cl/extensions/ext/cl_ext_device_fission.txt

A.8 AMD Vendor-Specific Extensions
This section describes the AMD vendor-specific extensions.

A.8.1

cl_amd_fp64
Before using double data types, double-precision floating point operators, and/or
double-precision floating point routines in OpenCL™ C kernels, include the
#pragma OPENCL EXTENSION cl_amd_fp64 : enable directive. See Table A.1
for a list of supported routines.

A.8.2

cl_amd_vec3
This extension adds support for vectors with three elements: float3, short3,
char3, etc. This data type was added to OpenCL 1.1 as a core feature. For more
details, see section 6.1.2 in the OpenCL 1.1 or OpenCL 1.2 spec.

A.8.3

cl_amd_device_persistent_memory
This extension adds support for the new buffer and image creation flag
CL_MEM_USE_PERSISTENT_MEM_AMD. Buffers and images allocated with this flag
reside in host-visible device memory. This flag is mutually exclusive with the flags
CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR.

A-4

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

A.8.4

cl_amd_device_attribute_query
This extension provides a means to query AMD-specific device attributes. To
enable this extension, include the #pragma OPENCL EXTENSION
cl_amd_device_attribute_query : enable directive. Once the extension is
enabled, and the clGetDeviceInfo parameter is set to
CL_DEVICE_PROFILING_TIMER_OFFSET_AMD, the offset in nano-seconds between
an event timestamp and Epoch is returned.

A.8.5

cl_amd_device_profiling_timer_offset
This query enables the developer to get the offset between event timestamps in
nano-seconds. To use it, compile the kernels with the #pragma OPENCL
EXTENSION cl_amd_device_attribute_query : enable directive. For
kernels complied with this pragma, calling clGetDeviceInfo with
set to CL_DEVICE_PROFILING_TIMER_OFFSET_AMD returns the offset in nanoseconds between event timestamps.

A.8.6

cl_amd_device_topology
This query enables the developer to get a description of the topology used to
connect the device to the host. Currently, this query works only in Linux. Calling
clGetDeviceInfo with set to CL_DEVICE_TOPOLOGY_AMD returns
the following 32-bytes union of structures.
typedef union
{
struct { cl_uint type; cl_uint data[5]; } raw;
struct { cl_uint type; cl_char unused[17]; cl_char bus; cl_char
device; cl_char function; } pcie; } cl_device_topology_amd;

The type of the structure returned can be queried by reading the first unsigned
int of the returned data. The developer can use this type to cast the returned
union into the right structure type.
Currently, the only supported type in the structure above is PCIe (type value =
1). The information returned contains the PCI Bus/Device/Function of the device,
and is similar to the result of the lspci command in Linux. It enables the
developer to match between the OpenCL device ID and the physical PCI
connection of the card.

A.8.7

cl_amd_device_board_name
This query enables the developer to get the name of the GPU board and model
of the specific device. Currently, this is only for GPU devices.
Calling clGetDeviceInfo with set to
CL_DEVICE_BOARD_NAME_AMD returns a 128-character value.

A.8 AMD Vendor-Specific Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-5

AMD ACCELERATED PARALLEL PROCESSING

A.8.8

cl_amd_compile_options
This extension adds the following options, which are not part of the OpenCL
specification.

•

-O0 — Specifies to the compiler not to optimize. This is equivalent to the
OpenCL standard option -cl-opt-disable.

•

-f[no-]bin-source — Does [not] generate OpenCL source in the .source
section. For more information, see Appendix E, “OpenCL Binary Image
Format (BIF) v2.0.”

•

-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”

•

-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”

•

-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”

To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.

•

AMD_OCL_BUILD_OPTIONS — Overrides the CL options specified in
clBuildProgram().

•

AMD_OCL_BUILD_OPTIONS_APPEND — Appends options to the options
specified in clBuildProgram().

A.8.9 cl_amd_offline_devices
To generate binary images offline, it is necessary to access the compiler for
every device that the runtime supports, even if the device is currently not installed
on the system. When, during context creation,
CL_CONTEXT_OFFLINE_DEVICES_AMD is passed in the context properties, all
supported devices, whether online or offline, are reported and can be used to
create OpenCL binary images.

A.8.10

cl_amd_event_callback
This extension provides the ability to register event callbacks for states other than
cl_complete. The full set of event states are allowed: cl_queued,
cl_submitted, and cl_running. This extension is enabled automatically and

A-6

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

does not need to be explicitly enabled through #pragma when using the SDK v2
of AMD Accelerated Parallel Processing.

A.8.11

cl_amd_popcnt
This extension introduces a “population count” function called popcnt. This
extension was taken into core OpenCL 1.2, and the function was renamed
popcount. The core 1.2 popcount function (documented in section 6.12.3 of the
OpenCL Specification) is identical to the AMD extension popcnt function.

A.8.12

cl_amd_media_ops
This extension adds the following built-in functions to the OpenCL language.
Note: For OpenCL scalar types, n = 1; for vector types, it is {2, 4, 8, or 16}.
Note:

in the following, n denotes the size, which can be 1, 2, 4, 8, or 16;
[i] denotes the indexed element of a vector, designated 0 to n-1.

Built-in function: amd_pack
uint amd_pack(float4 src)

Return value
((((uint)src[0])
((((uint)src[1])
((((uint)src[2])
((((uint)src[3])

&
&
&
&

0xFF)
0xFF)
0xFF)
0xFF)

<< 0) +
<< 8) +
<< 16) +
<< 24)

Built-in function: amd_unpack0
floatn

amd_unpack0 (uintn src)

Return value for each vector component
(float)(src[i] & 0xFF)

Built-in function: amd_unpack1
floatn

amd_unpack1 (uintn src)

Return value for each vector component
(float)((src[i] >> 8) & 0xFF)

Built-in function: amd_unpack2
floatn

amd_unpack2 (uintn src)

Return value for each vector component
(float)((src[i] >> 16) & 0xFF)

A.8 AMD Vendor-Specific Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-7

AMD ACCELERATED PARALLEL PROCESSING

Built-in function: amd_unpack3
floatn amd_unpack3(uintn src)

Return value for each vector component
(float)((src[i] >> 24) & 0xFF)

Built-in function: amd_bitalign
uintn

amd_bitalign (uintn src0, uintn src1, uintn src2)

Return value for each vector component
(uint) (((((long)src0[i]) << 32) | (long)src1[i]) >> (src2[i] & 31))

Built-in function: amd_bytealign
uintn

amd_bytealign (uintn src0, uintn src1, uintn src2)

Return value for each vector component
(uint) (((((long)src0[i]) << 32) | (long)src1[i]) >> ((src2[i] & 3)*8))

Built-in function: amd_lerp
uintn

amd_lerp (uintn src0, uintn src1, uintn src2)

Return value for each vector component
(((((src0[i] >> 0) & 0xFF) + ((src1[i] >> 0) & 0xFF) + ((src2[i] >> 0) & 1)) >> 1) << 0) +
(((((src0[i] >> 8) & 0xFF) + ((src1[i] >> 8) & 0xFF) + ((src2[i] >> 8) & 1)) >> 1) << 8) +
(((((src0[i] >> 16) & 0xFF) + ((src1[i] >> 16) & 0xFF) + ((src2[i] >> 16) & 1)) >> 1) << 16) +
(((((src0[i] >> 24) & 0xFF) + ((src1[i] >> 24) & 0xFF) + ((src2[i] >> 24) & 1)) >> 1) << 24) ;

Built-in function: amd_sad
uintn

amd_sad (uintn src0, uintn src1, uintn src2)

Return value for each vector component
src2[i] +
abs(((src0[i]
abs(((src0[i]
abs(((src0[i]
abs(((src0[i]

>>
>>
>>
>>

0)
8)
16)
24)

&
&
&
&

0xFF)
0xFF)
0xFF)
0xFF)

((src1[i]
((src1[i]
((src1[i]
((src1[i]

>> 0) &
>> 8) &
>> 16) &
>> 24) &

0xFF)) +
0xFF)) +
0xFF)) +
0xFF));

>> 0) &
>> 8) &
>> 16) &
>> 24) &

0xFF)) +
0xFF)) +
0xFF)) +
0xFF));

Built-in function: amd_sad4
uint amd_sad4 (uint4 a, uint4 b, uint c)

Return value for each vector component
src2[i] +
abs(((src0[i]
abs(((src0[i]
abs(((src0[i]
abs(((src0[i]

A-8

>>
>>
>>
>>

0)
8)
16)
24)

&
&
&
&

0xFF)
0xFF)
0xFF)
0xFF)

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

((src1[i]
((src1[i]
((src1[i]
((src1[i]

AMD ACCELERATED PARALLEL PROCESSING

Built-in function: amd_sadhi
uintn

amd_sadhi (uintn src0, uintn src1, uintn src2)

Return value for each vector component
src2[i] +
(abs(((src0[i]
(abs(((src0[i]
(abs(((src0[i]
(abs(((src0[i]

>>
>>
>>
>>

0)
8)
16)
24)

&
&
&
&

0xFF)
0xFF)
0xFF)
0xFF)

((src1[i]
((src1[i]
((src1[i]
((src1[i]

>> 0) &
>> 8) &
>> 16) &
>> 24) &

0xFF))
0xFF))
0xFF))
0xFF))

<<
<<
<<
<<

16) +
16) +
16) +
16);

For more information, see:
http://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops.txt

A.8.13

cl_amd_media_ops2
This extension adds further built-in functions to those of cl_amd_media_ops.
When enabled, it adds the following built-in functions to the OpenCL language.
Note:

typen denotes an open scalar type { n = 1 } and vector types { n = 2, 4,
8, 16 }.

Built-in Function: uintn amd_msad (uintn src0, uintn src1, uintn src2)
Description:
uchar4 src0u8 = as_uchar4(src0.s0);
uchar4 src1u8 = as_uchar4(src1.s0);
dst.s0 = src2.s0 +
((src1u8.s0 == 0) ? 0 : abs(src0u8.s0 - src1u8.s0)) +
((src1u8.s1 == 0) ? 0 : abs(src0u8.s1 - src1u8.s1)) +
((src1u8.s2 == 0) ? 0 : abs(src0u8.s2 - src1u8.s2)) +
((src1u8.s3
== 0) ? 0 : abs(src0u8.s3 - src1u8.s3));

A similar operation is applied to other components of the vectors.
Built-in Function: ulongn amd_qsad (ulongn src0, uintn src1, ulongn src2)
Description:
uchar8 src0u8 = as_uchar8(src0.s0);
ushort4 src2u16 = as_ushort4(src2.s0);
ushort4 dstu16;
dstu16.s0 = amd_sad(as_uint(src0u8.s0123),
dstu16.s1 = amd_sad(as_uint(src0u8.s1234),
dstu16.s2 = amd_sad(as_uint(src0u8.s2345),
dstu16.s3 = amd_sad(as_uint(src0u8.s3456),
dst.s0
= as_uint2(dstu16);

src1.s0,
src1.s0,
src1.s0,
src1.s0,

src2u16.s0);
src2u16.s1);
src2u16.s2);
src2u16.s3);

A similar operation is applied to other components of the vectors.

A.8 AMD Vendor-Specific Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-9

AMD ACCELERATED PARALLEL PROCESSING

Built-in Function:
ulongn amd_mqsad (ulongn src0, uintn src1, ulongn src2)

Description:
uchar8 src0u8 = as_uchar8(src0.s0);
ushort4 src2u16 = as_ushort4(src2.s0);
ushort4 dstu16;
dstu16.s0 = amd_msad(as_uint(src0u8.s0123),
dstu16.s1 = amd_msad(as_uint(src0u8.s1234),
dstu16.s2 = amd_msad(as_uint(src0u8.s2345),
dstu16.s3 = amd_msad(as_uint(src0u8.s3456),
dst.s0
= as_uint2(dstu16);

src1.s0,
src1.s0,
src1.s0,
src1.s0,

src2u16.s0);
src2u16.s1);
src2u16.s2);
src2u16.s3);

A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_sadw (uintn src0, uintn src1, uintn src2)
Description:
ushort2 src0u16 = as_ushort2(src0.s0);
ushort2 src1u16 = as_ushort2(src1.s0);
dst.s0 = src2.s0 +
abs(src0u16.s0 - src1u16.s0) +
abs(src0u16.s1 - src1u16.s1);

A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_sadd (uintn src0, uintn src1, uintn src2)
Description:
dst.s0 = src2.s0 +

abs(src0.s0 - src1.s0);

A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_bfm (uintn src0, uintn src1)
Description:
dst.s0 = ((1 << (src0.s0 & 0x1f)) - 1) << (src1.s0 & 0x1f);

A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_bfe (uintn src0, uintn src1, uintn src2)
Description:
NOTE: The >> operator represents a logical right shift.
offset = src1.s0 & 31;
width = src2.s0 & 31;
if width = 0
dst.s0 = 0;
else if (offset + width) < 32
dst.s0 = (src0.s0 << (32 - offset - width)) >> (32 - width);
else
dst.s0 = src0.s0 >> offset;

A similar operation is applied to other components of the vectors.
A-10

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

Built-in Function: intn amd_bfe (intn src0, uintn src1, uintn src2)
Description:
NOTE: operator >> represent an arithmetic right shift.
offset = src1.s0 & 31;
width = src2.s0 & 31;
if width = 0
dst.s0 = 0;
else if (offset + width) < 32
dst.s0 = src0.s0 << (32-offset-width) >> 32-width;
else
dst.s0 = src0.s0 >> offset;

A similar operation is applied to other components of the vectors.
Built-in Function:
intn amd_median3 (intn src0, intn src1, intn src2)
uintn amd_median3 (uintn src0, uintn src1, uintn src2)
floatn amd_median3 (floatn src0, floatn src1, floatn src2)

Description:
Returns median of src0, src1, and src2.
Built-in Function:
intn amd_min3 (intn src0, intn src1, intn src2)
uintn amd_min3 (uintn src0, uintn src1, uintn src2)
floatn amd_min3 (floatn src0, floatn src1, floatn src2)

Description:
Returns min of src0, src1, and src2.
Built-in Function:
intn amd_max3 (intn src0, intn src1, intn src2)
uintn amd_max3 (uintn src0, uintn src1, uintn src2)
floatn amd_max3 (floatn src0, floatn src1, floatn src2)

Description:
Returns max of src0, src1, and src2.
For more information, see:
http://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops2.txt

A.8 AMD Vendor-Specific Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-11

AMD ACCELERATED PARALLEL PROCESSING

A.8.14

cl_amd_printf
The OpenCL™ Specification 1.1 adds support for the optional AMD extension
cl_amd_printf, which provides printf capabilities to OpenCL C programs. To
use this extension, an application first must include
#pragma OPENCL EXTENSION cl_amd_printf : enable.
Built-in function:
printf(__constant char * restrict format, …);

This function writes output to the stdout stream associated with the
host application. The format string is a character sequence that:
–

is null-terminated and composed of zero and more directives,

–

ordinary characters (i.e. not %), which are copied directly to the output
stream unchanged, and

–

conversion specifications, each of which can result in fetching zero or
more arguments, converting them, and then writing the final result to the
output stream.

The format string must be resolvable at compile time; thus, it cannot
be dynamically created by the executing program. (Note that the use of
variadic arguments in the built-in printf does not imply its use in other builtins; more importantly, it is not valid to use printf in user-defined functions
or kernels.)

The OpenCL C printf closely matches the definition found as part of the
C99 standard. Note that conversions introduced in the format string with
% are supported with the following guidelines:
• A 32-bit floating point argument is not converted to a 64-bit double,
unless the extension cl_khr_fp64 is supported and enabled, as
defined in section 9.3 of the OpenCL Specification 1.1. This includes
the double variants if cl_khr_fp64 is supported and defined in the
corresponding compilation unit.
• 64-bit integer types can be printed using %ld / %lx / %lu .
• %lld / %llx / %llu are not supported and reserved for 128-bit integer
types (long long).

• All OpenCL vector types (section 6.1.2 of the OpenCL Specification
1.1) can be explicitly passed and printed using the modifier vn, where
n can be 2, 3, 4, 8, or 16. This modifier appears before the original
conversion specifier for the vector’s component type (for example, to
print a float4 %v4f). Since vn is a conversion specifier, it is valid to
apply optional flags, such as field width and precision, just as it
is when printing the component types. Since a vector is an aggregate
type, the comma separator is used between the components:
0:1, … , n-2:n-1.
A-12

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

A.9 cl_amd_predefined_macros
The following macros are predefined when compiling OpenCL™ C kernels.
These macros are defined automatically based on the device for which the code
is being compiled.
GPU devices:
__WinterPark__
__BeaverCreek__
__Turks__
__Caicos__
__Tahiti__
__Pitcairn__
__Capeverde__
__Cayman__
__Barts__
__Cypress__
__Juniper__
__Redwood__
__Cedar__
__ATI_RV770__
__ATI_RV730__
__ATI_RV710__
__Loveland__
__GPU__

CPU devices:
__CPU__
__X86__
__X86_64__

Note that __GPU__ or __CPU__ are predefined whenever a GPU or CPU device
is the compilation target.
An example kernel is provided below.
#pragma OPENCL EXTENSION cl_amd_printf : enable
const char* getDeviceName() {
#ifdef __Cayman__
return "Cayman";
#elif __Barts__
return "Barts";
#elif __Cypress__
return "Cypress";
#elif defined(__Juniper__)
return "Juniper";
#elif defined(__Redwood__)
return "Redwood";
#elif defined(__Cedar__)
return "Cedar";
#elif defined(__ATI_RV770__)
return "RV770";
#elif defined(__ATI_RV730__)
return "RV730";
#elif defined(__ATI_RV710__)
return "RV710";
#elif defined(__Loveland__)
return "Loveland";
#elif defined(__GPU__)
return "GenericGPU";
#elif defined(__X86__)
return "X86CPU";
#elif defined(__X86_64__)
A.9 cl_amd_predefined_macros
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

A-13

AMD ACCELERATED PARALLEL PROCESSING

return "X86-64CPU";
#elif defined(__CPU__)
return "GenericCPU";
#else
return "UnknownDevice";
#endif
}
kernel void test_pf(global int* a)
{
printf("Device Name: %s\n", getDeviceName());
}

A-14

Appendix A: OpenCL Optional Extensions
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

AMD ACCELERATED PARALLEL PROCESSING

A.10 Supported Functions for cl_amd_fp64 / cl_khr_fp64
AMD OpenCL is now cl_khr_fp64-compliant on devices compliant with OpenCL
1.1 and greater (every GPU later than 7xx, and all CPUs). Thus, cl_amd_fp64
is now a synonym for cl_khr_fp64 on all supported devices.

A.11 Extension Support by Device
Table A.1 and Table A.2 list the extension support for selected devices.
Table A.1

Extension Support for AMD GPU Devices 1
A M D APUs

A M D Radeon™ H D
1

Extension

Brazos Llano

Tahiti ,
Pitcairn2,
Trinity Cape Verde3 Turks4 Cayman5 Barts6 Cypress7

cl_khr_*_atomics (32-bit)

Yes

cl_ext_atomic_counters_32

Yes

cl_khr_gl_sharing

Yes

cl_khr_byte_addressable_store

Yes

cl_ext_device_fission

CPU
only

cl_amd_device_attribute_query

Yes

cl_khr_fp64

CPU
only

Yes

cl_amd_fp64

CPU
only

Yes

cl_amd_vec3

Yes

Images

Yes

cl_khr_d3d10_sharing

Yes

cl_amd_media_ops

Yes

cl_amd_printf

Yes

cl_amd_popcnt

Yes

cl_khr_3d_image_writes

Yes

cl_khr_icd

Yes

cl_amd_event_callback

Yes

cl_amd_offline_devices

Yes

Platform Extensions

1.
2.
3.
4.
5.
6.
7.

AMD Radeon™ HD 79XX series.
AMD Radeon™ HD 78XX series.
AMD Radeon™ HD 77XX series.
AMD Radeon™ HD 75XX series and AMD Radeon™ HD 76XX series.
AMD Radeon™ HD 69XX series.
AMD Radeon™ HD 68XX series.
ATI Radeon™ HD 59XX series and 58XX series, AMD FirePro™ V88XX

series and V87XX series.

A-15

AMD ACCELERATED PARALLEL PROCESSING

Note that an atomic counter is a device-level counter that can be added /
decremented by different work-items, where the atomicity of the operation is
guaranteed. The access to the counter is done only through add/dec built-in
functions; thus, no two work-items have the same value returned in the case that
a given kernel only increments or decrements the counter. (Also see
http://www.khronos.org/registry/cl/extensions/ext/cl_ext_atomic_counters_32.txt.)
Table A.2

Extension Support for Older AMD GPUs and CPUs
Juniper1

Redwood2

Cedar3

x86 CPU
with SSE2 or later

cl_khr_*_atomics

Yes

cl_ext_atomic_counters_32

Yes

cl_khr_gl_sharing

Yes

cl_khr_byte_addressable_store

Yes

cl_ext_device_fission

Yes

cl_amd_device_attribute_query

Yes

cl_khr_fp64

Yes

cl_amd_fp644

Yes

cl_amd_vec3

Yes

Images

Yes

Yes5

cl_khr_d3d10_sharing

Yes

cl_amd_media_ops

Yes

cl_amd_media_ops2

Yes

cl_amd_printf

Yes

cl_amd_popcnt

Yes

cl_khr_3d_image_writes

Yes

cl_khr_icd

Yes

cl_amd_event_callback

Yes

cl_amd_offline_devices

Yes

Extension

Platform Extensions

Radeon™

FirePro™

1. ATI
HD 5700 series, AMD Mobility
HD 5800 series, AMD
V5800 series, AMD
Mobility FirePro™ M7820.
2. ATI Radeon™ HD 5600 Series, ATI Radeon™ HD 5600 Series, ATI Radeon™ HD 5500 Series, AMD
Mobility Radeon™ HD 5700 Series, AMD Mobility Radeon™ HD 5600 Series, AMD FirePro™ V4800
Series, AMD FirePro™ V3800 Series, AMD Mobility FirePro™ M5800
3. ATI Radeon™ HD 5400 Series, AMD Mobility Radeon™ HD 5400 Series
4. Available on all devices that have double-precision, including all Southern Island devices.
5. Environment variable CPU_IMAGE_SUPPORT must be set.

A-16

AMD ACCELERATED PARALLEL PROCESSING

Appendix B
The OpenCL Installable Client Driver
(ICD)
The OpenCL Installable Client Driver (ICD) is part of the AMD Accelerated
Parallel Processing software stack. Code written prior to SDK v2.0 must be
changed to comply with OpenCL ICD requirements.

B.1 Overview
The ICD allows multiple OpenCL implementations to co-exist; also, it allows
applications to select between these implementations at runtime.
In releases prior to SDK v2.0, functions such as clGetDeviceIDs() and
clCreateContext() accepted a NULL value for the platform parameter.
Releases from SDK v2.0 no longer allow this; the platform must be a valid one,
obtained by using the platform API. The application now must select which of the
OpenCL platforms present on a system to use.
Use the clGetPlatformIDs() and clGetPlatformInfo() functions to see the
list of available OpenCL implementations, and select the one that is best for your
requirements. It is recommended that developers offer their users a choice on
first run of the program or whenever the list of available platforms changes.
A properly implemented ICD and OpenCL library is transparent to the end-user.

B.2 Using ICD
Sample code that is part of the SDK contains examples showing how to query
the platform API and call the functions that require a valid platform parameter.
This is a pre-ICD code snippet.
context = clCreateContextFromType(
0,
dType,
NULL,
NULL,
&status);

The ICD-compliant version of this code follows.
/*
* Have a look at the available platforms and pick either
* the AMD one if available or a reasonable default.
*/

B-1

AMD ACCELERATED PARALLEL PROCESSING

cl_uint numPlatforms;
cl_platform_id platform = NULL;
status = clGetPlatformIDs(0, NULL, &numPlatforms);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
if (0 < numPlatforms)
{
cl_platform_id* platforms = new cl_platform_id[numPlatforms];
status = clGetPlatformIDs(numPlatforms, platforms, NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
for (unsigned i = 0; i < numPlatforms; ++i)
{
char pbuf[100];
status = clGetPlatformInfo(platforms[i],
CL_PLATFORM_VENDOR,
sizeof(pbuf),
pbuf,
NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformInfo failed."))
{
return SDK_FAILURE;
}
platform = platforms[i];
if (!strcmp(pbuf, "Advanced Micro Devices, Inc."))
{
break;
}
}
delete[] platforms;
}
/*
* If we could find our platform, use it. Otherwise pass a NULL and
get whatever the
* implementation thinks we should be using.
*/
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};
/* Use NULL for backward compatibility */
cl_context_properties* cprops = (NULL == platform) ? NULL : cps;
context = clCreateContextFromType(
cprops,
dType,
NULL,
NULL,
&status);

B-2

AMD ACCELERATED PARALLEL PROCESSING

Another example of a pre-ICD code snippet follows.
status = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_DEFAULT, 0, NULL,
&numDevices);
The ICD-compliant version of the code snippet is:
status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 0, NULL,
&numDevices);
NOTE: It is recommended that the host code look at the platform vendor string
when searching for the desired OpenCL platform, instead of using the platform
name string. The platform name string might change, whereas the platform
vendor string remains constant for a particular vendor’s implementation.

B-3

AMD ACCELERATED PARALLEL PROCESSING

B-4

AMD ACCELERATED PARALLEL PROCESSING

Appendix C
Compute Kernel

C.1 Differences from a Pixel Shader
Differences between a pixel shader and a compute kernel include: location
indexing, amount of resources used on the GPU compute device, memory
access patterns, cache behavior, work-item spawn rate, creation of wavefronts
and groups, and newly exposed hardware features such as Local Data Store and
Shared Registers. Many of these changes are based on the spawn/dispatch
pattern of a compute kernel. This pattern is linear; for a pixel shader, it is a
hierarchical-Z pattern. The following sections discuss the effects of this change.
at the IL level.

C.2 Indexing
A primary difference between a compute kernel and a pixel shader is the
indexing mode. In a pixel shader, indexing is done through the vWinCoord
register and is directly related to the output domain (frame buffer size and
geometry) specified by the user space program. This domain is usually in the
Euclidean space and specified by a pair of coordinates. In a compute kernel,
however, this changes: the indexing method is switched to a linear index between
one and three dimensions, as specified by the user. This gives the programmer
more flexibility when writing kernels.
Indexing is done through the vaTid register, which stands for absolute work-item
id. This value is linear: from 0 to N-1, where N is the number of work-items
requested by the user space program to be executed on the GPU compute
device. Two other indexing variables, vTid and vTgroupid, are derived from
settings in the kernel and vaTid.
In SDK 1.4 and later, new indexing variables are introduced for either 3D spawn
or 1D spawn. The 1D indexing variables are still valid, but replaced with
vAbsTidFlat, vThreadGrpIdFlat, and vTidInGrpFlat, respectively. The 3D versions
are vAbsTid, vThreadGrpId, and vTidInGrp. The 3D versions have their
respective positions in each dimension in the x, y, and z components. The w
component is not used. If the group size for a dimension is not specified, it is an
implicit 1. The 1D version has the dimension replicated over all components.

C-1

AMD ACCELERATED PARALLEL PROCESSING

C.3 Performance Comparison
To show the performance differences between a compute kernel and a pixel
shader, the following subsection show a matrix transpose program written in
three ways:
1. A naïve pixel shader of matrix transpose.
2. The compute kernel equivalent.
3. An optimized matrix transpose using LDS to further improve performance.

C.4 Pixel Shader
The traditional naïve matrix transpose reads in data points from the (j,i)th element
of input matrix in sampler and writes out at the current (i,j)th location, which is
implicit in the output write. The kernel is structured as follows:
il_ps_2_0
dcl_input_position_interp(linear_noperspective) vWinCoord0.xy__
dcl_output_generic o0
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
sample_resource(0)_sampler(0) o0, vWinCoord0.yx
end

Figure C.1 shows the performance results of using a pixel shader for this matrix
transpose.
12
10

GB/s

8
6

Matrix Size

Figure C.1

C-2

Pixel Shader Matrix Transpose

84
19

28
17

00
16

8
44

0
32

2
19

AMD ACCELERATED PARALLEL PROCESSING

C.5 Compute Kernel
For the compute kernel, the kernel is structured as follows:
il_cs_2_0
dcl_num_threads_per_group 64
dcl_cb cb0[1]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
umod r0.x, vAbsTidFlat.x, cb0[0].x
udiv r0.y, vAbsTidFlat.x, cb0[0].x
sample_resource(0)_sampler(0) r1, r0.yx
mov g[vAbsTidFlat.x], r1
end

Figure C.2 shows the performance results using a compute kernel for this matrix
transpose.
7
6

4
CS
3
2
1

84
19

28
17

00
16

16
12

2
83

0
32

0
64

GB/s

Matrix Size

Figure C.2

Compute Kernel Matrix Transpose

C-3

AMD ACCELERATED PARALLEL PROCESSING

C.6 LDS Matrix Transpose
Figure C.3 shows the performance results using the LDS for this matrix
transpose.
100
90
80
70

GB/s

60
50

LDS

40
30
20
10

Matrix Size

Figure C.3

LDS Matrix Transpose

C.7 Results Comparison
Based on the graphs above, it can be seen that in most cases using the LDS to
do on-chip transpose outperforms the similar pixel shader and compute kernel
versions; however, a direct porting of the transpose algorithm from a pixel shader
to a compute kernel does not immediately increase performance. This is because
of the differences mentioned above between the pixel shader and the compute
kernel. Taking advantage of the compute kernel features such as LDS can lead
to a large performance gain in certain programs.

C-4

AMD ACCELERATED PARALLEL PROCESSING

Appendix D
Device Parameters

On the following pages, Table D.2 through Table D.7 provide device-specific
information for AMD GPUs.

D-1

AMD ACCELERATED PARALLEL PROCESSING

Table D.1

Parameters for 7xxx Devices
Verde
PRO

Verde
XT

Pitcairn
PRO

Pitcairn
XT

Tahiti
PRO

Tahiti
XT

Product Name
(AMD Radeon™ HD)

7750

7770

7850

7870

7950

7970

Engine Speed (MHz)

800

1000

860

1000

800

925

Compute Resources
Compute Units
Processing Elements

512

640

1024

1280

1792

2048

Peak Gflops

819

1280

1761

2560

2867

3789

# of 32b Vector Registers/CU

65536

Size of Vector Registers/CU

256 kB

LDS Size/ CU

64 kB

Constant Cache / GPU

64 kB

128 kB

Max Constants / 4 CUs

16 kB

L1 Cache Size / CU

16 kB

L2 Cache Size / GPU

512 kB

768 kB

Cache and Register Sizes

LDS Banks / CU

Peak GPU Bandwidths
Register Read (GB/s)

4915

7680

10568

15360

17203

22733

LDS Read (GB/s)

819

1280

1761

2560

2867

3789

Constant Cache Read (GB/s)

102

160

220

320

358

474

L1 Read (GB/s)

410

640

881

1280

1434

1894

L2 Read (GB/s)

205

256

440

512

614

710

Global Memory (GB/s)

154

240

264

320

400

640

800

1120

1280

Global Limits
Max Wavefronts / GPU
Max Wavefronts / CU (avg)
Max Work-Items / GPU

20480

25600

40960

51200

71680

81920

Memory
Memory Channels
Memory Bus Width (bits)
Memory Type and
Speed (MHz)
Frame Buffer

D-2

128

256

384

GDDR5
1125

GDDR5
1200

GDDR5
1250

GDDR5
1375

1 GB

2 GB

1 GB or
2 GB

3 GB

AMD ACCELERATED PARALLEL PROCESSING

Table D.2

Parameters for 68xx and 69xx Devices
Barts
PRO

Barts
XT

Blackcomb
PRO

Cayman
PRO

Cayman
XT

Cayman
Gemini

Product Name
(AMD Radeon™ HD)

6850

6870

6950M

6950

6970

6990

Engine Speed (MHz)

775

900

580

800

880

830 (BIOS 1,
default)
880 (BIOS 2)

Compute Units

Stream Cores

192

224

192

352

384

768

Compute Resources

Processing Elements

960

1120

960

1408

1536

3072

Peak Gflops

1488

2016

1113.6

2252.8

2703.36

5100-5407

# of Vector Registers/CU

16384

Size of Vector Registers/CU

256 kB

LDS Size/ CU

32 kB

Constant Cache / GPU

32 kB

48 kB

Max Constants / CU

8 kB

L1 Cache Size / CU

8 kB

512 kB

Cache and Register Sizes

LDS Banks / CU

L2 Cache Size / GPU
Peak GPU Bandwidths
Register Read (GB/s)

7142

9677

5345

13517

16220

30597-32440

LDS Read (GB/s)

1190

1613

891

2253

2703

5100-5407

Constant Cache Read (GB/s)

2381

3226

1782

4506

5407

10199-10813

L1 Read (GB/s)

595

806

445

1126

1352

2550-2703

L2 Read (GB/s)

397

461

297

410

451

850-901

Global Memory (GB/s)

128

134

115

160

176

320

496

512

Global Limits
Max Wavefronts / GPU
Max Wavefronts / CU (avg)
Max Work-Items / GPU

41.3

35.4

41.3

23.3

21.3

31744

32768

Memory
Memory Channels
Memory Bus Width (bits)
Memory Type and
Speed (MHz)
Frame Buffer

256

512

GDDR5
1000

GDDR5
1050

GDDR5
900

GDDR5
1250

GDDR5
1375

GDDR5
1250

1 GB

1 GB or
2 GB

2 GB

4 GB total

AMD ACCELERATED PARALLEL PROCESSING

Table D.3

Parameters for 65xx, 66xx, and 67xx Devices
Turks
PRO

Turks
XT

Whistler
LP

Whistler
PRO

Whistler
XT

Barts
LE

Product Name (AMD Radeon™
HD)

6570

6670

6730M

6750M

6770M

6790

Engine Speed (MHz)

650

800

485

600

725

840

Compute Units

Stream Cores

160

Processing Elements

480

800

Peak Gflops

624

768

465.6

576

696

1344

# of Vector Registers/CU

16384

Size of Vector Registers/CU

256 kB

LDS Size/ CU

32 kB

Constant Cache / GPU

16 kB

32 kB

Max Constants / CU

8 kB

L1 Cache Size / CU

8 kB

256 kB

512 kB

Compute Resources

Cache and Register Sizes

LDS Banks / CU

L2 Cache Size / GPU
Peak GPU Bandwidths
Register Read (GB/s)

2995

3686

2235

2765

3341

6451

LDS Read (GB/s)

499

614

372

461

557

1075

Constant Cache Read (GB/s)

998

1229

745

922

1114

2150

L1 Read (GB/s)

250

307

186

230

278

538

L2 Read (GB/s)
Global Memory (GB/s)

166

205

124

154

186

430

29 to 64

26 to 38

29 to 58

134

248

496

Global Limits
Max Wavefronts / GPU
Max Wavefronts / CU (avg)
Max Work-Items / GPU

41.3

49.6

15872

31744

128

256

Memory
Memory Channels
Memory Bus Width (bits)
Memory Type and
Speed (MHz)
Frame Buffer

D-4

128

GDDR5,
1000; or
DDR3, 900

GDDR5
1000

512 MB or
1 GB for
GDDR5; 1 or
2 GB for
DDR3

512 MB or
1 GB

GDDR5
GDDR5,
GDDR5,
GDDR5,
1050
900; or
800; or
600; or
DDR3, 800 DDR3, 900 DDR3, 900
256 MB

1 GB

1 GB or
2 GB

1 GB

AMD ACCELERATED PARALLEL PROCESSING

Table D.4

Parameters for 64xx Devices

Product Name (AMD Radeon™ HD)
Engine Speed (MHz)

Seymour
LP

Seymour
PRO

Caicos

Seymour
XT

Seymour
XTX

6430M

6450M

6450

6470M

6490M

480

600

625 to 750

700 to 750

800

Compute Resources
Compute Units
Stream Cores

Processing Elements

160

153.6

192

200 to 240

224 to 240

256

Peak Gflops
Cache and Register Sizes
# of Vector Registers/CU

16384

Size of Vector Registers/CU

256 kB

LDS Size/ CU

32 kB

Constant Cache / GPU

4 kB

Max Constants / CU

8 kB

L1 Cache Size / CU

8 kB

128 kB

737

922

960 to 1152 1075 to 1152

1229

LDS Read (GB/s)

123

154

160 to 192

179 to 192

205

Constant Cache Read (GB/s)

246

307

320 to 384

358 to 384

410

L1 Read (GB/s)

80 to 96

90 to 96

102

L2 Read (GB/s)

80 to 96

90 to 96

102

Global Memory (GB/s)

13 to 29

14 to 26

LDS Banks / CU

L2 Cache Size / GPU
Peak GPU Bandwidths

Global Limits
Max Wavefronts / GPU

192

Max Wavefronts / CU (avg)

96.0

12288

Max Work-Items / GPU
Memory
Memory Channels

Memory Bus Width (bits)

DDR3
800

512 MB

1 GB

Memory Type and
Speed (MHz)
Frame Buffer

GDDR5,
GDDR5, 800; GDDR5, 800;
800 - 900; or or DDR3, 900 or DDR3, 900
DDR3, 800
512 MB or
1 GB

512 MB or
1 GB

AMD ACCELERATED PARALLEL PROCESSING

Table D.5

Parameters for Zacate and Ontario Devices
Ontario

Ontario

Zacate

Product Name (AMD Radeon HD)

C-30

C-50

E-240

E-350

Engine Speed (MHz)

277

276

500

492

™

Compute Resources
Compute Units

Stream Cores

Processing Elements

44.32

44.16

78.72

8192

Size of Vector Registers/CU

128 kB

LDS Size/ CU

32 kB

Peak Gflops
Cache and Register Sizes
# of Vector Registers/CU

LDS Banks / CU

Constant Cache / GPU

4 kB

Max Constants / CU

8 kB

L1 Cache Size / CU

8 kB

L2 Cache Size / GPU

64 kB

213

212

384

378

LDS Read (GB/s)

Constant Cache Read (GB/s)

128

126

L1 Read (GB/s)

L2 Read (GB/s)

Global Memory (GB/s)

Max Wavefronts / GPU

192

Max Wavefronts / CU (avg)

96.0

Max Work-Items / GPU

6144

Memory Channels

Memory Bus Width (bits)

DDR3
533

Shared
Memory

Peak GPU Bandwidths

Global Limits

Memory

Memory Type and
Speed (MHz)
Frame Buffer

D-6

AMD ACCELERATED PARALLEL PROCESSING

Table D.6

Parameters for 56xx, 57xx, 58xx, Eyfinity6, and 59xx Devices
Redwood
XT

Juniper
LE

Juniper
XT

Cypress
LE

Cypress Cypress
Hemlock
PRO
XT

Product Name (ATI Radeon™
HD)

5670

5750

5770

5830

5850

5870

5970

Engine Speed (MHz)

775

700

850

800

725

850

725

Compute Resources
Compute Units
Stream Cores

144

160

224

288

320

640

Processing Elements

400

720

800

1120

1440

1600

3200

Peak Gflops

620

1008

1360

1792

2088

2720

4640

Cache and Register Sizes
# of Vector Registers/CU

16384

Size of Vector Registers/CU

256 kB

32 k

Constant Cache / GPU

16 k

24 k

32 k

40 k

48 k

96 k

Max Constants / CU

L1 Cache Size / CU

128 k

256 k

512 k

2 x 512 k

2976

4838

6528

8602

10022

13056

22272

LDS Read (GB/s)

248

806

1088

1434

1670

2176

3712

Constant Cache Read (GB/s)

992

1613

2176

2867

3341

4352

7424

L1 Read (GB/s)

248

403

544

717

835

1088

1856

L2 Read (GB/s)

198

179

218

410

371

435

742

Global Memory (GB/s)

128

154

256

LDS Size/ CU
LDS Banks / CU

L2 Cache Size / GPU
Peak GPU Bandwidths

Global Limits
Max Wavefronts / GPU

248

496

992

Max Wavefronts / CU (avg)

49.6

27.6

24.8

35.4

27.6

24.8

15872

31744

63488

Max Work-Items / GPU
Memory
Memory Channels

2x8

128

256

2 x 256

Memory Type and
Speed (MHz)

GDDR5
1000

GDDR5
1150

GDDR5
1200

GDDR5
1000

GDDR5
1200

GDDR5
1000

Frame Buffer

1 GB /
512 MB

1 GB

1GB

1 GB

2 x 1 GB

Memory Bus Width (bits)

AMD ACCELERATED PARALLEL PROCESSING

Table D.7

Parameters for Exxx, Cxx, 54xx, and 55xx Devices
Redwood Redwood
PRO2
PRO

Zacate

Ontario

Cedar

E-350

E-240

C-50

C-30

5450

5550

5570

492

500

276

277

650

550

650

Compute Units

Stream Cores

Processing Elements

320

400

78.72

44.16

44.32

104

352

520

Product Name (ATI Radeon™
HD)
Engine Speed (MHz)
Compute Resources

Peak Gflops
Cache and Register Sizes

8192

16384

Size of Vector Registers/CU

# of Vector Registers/CU

128 kB

256 kB

LDS Size/ CU

32 kB

32k

LDS Banks / CU
Constant Cache / GPU

4 kB

16k

Max Constants / CU

4 kB

L1 Cache Size / CU

8 kB

L2 Cache Size / GPU

64 kB

64k

128k

378

384

212

213

499

1690

2496

LDS Read (GB/s)

141

208

Constant Cache Read (GB/s)

126

128

166

563

832

Peak GPU Bandwidths

L1 Read (GB/s)

141

208

L2 Read (GB/s)

141

166

Global Memory (GB/s)

Global Limits
Max Wavefronts / GPU

192

248

Max Wavefronts / CU (avg)

96.0

62.0

49.6

Max Work-Items / GPU

6144

15872

Memory
Memory Channels

Memory Bus Width (bits)

128

DDR3
533

DDR3
800

DDR3
900

Shared
Memory

1 GB /
512 MB

Memory Type and
Speed (MHz)
Frame Buffer

D-8

AMD ACCELERATED PARALLEL PROCESSING

Appendix E
OpenCL Binary Image Format (BIF)
v2.0
E.1 Overview
OpenCL Binary Image Format (BIF) 2.0 is in the ELF format. BIF2.0 allows the
OpenCL binary to contain the OpenCL source program, the LLVM IR, AMD IL,
and the executable. The BIF defines the following special sections:

•

.source: for storing the OpenCL source program.

•

.llvmir: for storing the OpenCL immediate representation (LLVM IR).

•

.amdil: for storing the AMD IL that is generated from the OpenCL source
program.

•

.comment: for storing the OpenCL version and the driver version that created
the binary.

The BIF can have other special sections for debugging, etc. It also contains
several ELF special sections, such as .text for storing the executable, .rodata
for storing the OpenCL runtime control data, and other ELF special sections
required for forming an ELF (for example: .strtab, .symtab, .shstrtab).
By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the
executable for the GPU (,.llvmir, .amdil, and .text sections), as well as LLVM
IR and the executable for the CPU (.llvmir and .text sections). The BIF binary
always contains a .comment section, which is a readable C string. The default
behavior can be changed with the BIF options described in Section E.2, “BIF
Options,” page E-3.
The LLVM IR enables recompilation from LLVM IR to the target. When a binary
is used to run on a device for which the original program was not generated and
the original device is feature-compatible with the current device, OpenCL
recompiles the LLVM IR to generate a new code for the device. Note that the
LLVM IR is only universal within devices that are feature-compatible in the same
device type, not across different device types. This means that the LLVM IR for
the CPU is not compatible with the LLVM IR for the GPU. The LLVM IR for a
GPU works only for GPU devices that have equivalent feature sets.
BIF2.0 is supported since Stream SDK 2.2.

E-1

AMD ACCELERATED PARALLEL PROCESSING

E.1.1

Executable and Linkable Format (ELF) Header
For the ELF binary to be considered valid, the AMD OpenCL runtime expects
certain values to be specified. The following header fields must be set for all
binaries that are created outside of the OpenCL framework.
Table E.1

ELF Header Fields

Field

Value

Description

e_ident[EI_CLASS]

ELFCLASS32,
ELFCLASS64

BIF can be either 32-bit ELF or
64bit ELF.

e_ident[EI_DATA]

ELFDATA2LSB

BIF is stored in little Endian order.

e_ident[EI_OSABI]

ELFOSABI_NONE

Not used.

e_ident[EI_ABIVERSION] 0

Not used.

e_type

ET_NONE

Not used.

e_machine

oclElfTargets Enum

CPU/GPU machine ID.

E_version

EV_CURRENT

Must be EV_CURRENT.

e_entry

Not used.

E_phoff

Not used.

e_flags

Not used.

E_phentsize

Not used.

E_phnum

Not used.

The fields not shown in Table E.1 are given values according to the ELF
Specification. The e_machine value is defined as one of the oclElfTargets
enumerants; the values for these are:
e_machine =

{

1001 + CaltargetEnum : GPU
2002
: CPU generic without SSE3
2003
: CPU generic with SSE3

typedef enum CALtargetEnum {
CAL_TARGET_600 = 0,
CAL_TARGET_610 = 1,
CAL_TARGET_630 = 2,
CAL_TARGET_670 = 3,
CAL_TARGET_7XX = 4,
CAL_TARGET_770 = 5,
CAL_TARGET_710 = 6,
CAL_TARGET_730 = 7,
CAL_TARGET_CYPRESS = 8,
CAL_TARGET_JUNIPER = 9,
CAL_TARGET_REDWOOD = 10,
CAL_TARGET_CEDAR= 11,
CAL_TARGET_SUMO = 12,
CAL_TARGET_SUPERSUMO =13,
CAL_TARGET_WRESTLER = 14,
CAL_TARGET_CAYMAN =15,
CAL_TARGET_KAUAI = 16,
CAL_TARGET_BARTS = 17 ,
CAL_TARGET_TURKS = 18 ,
CAL_TARGET_CAICOS = 19,
};

E-2

/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<
/**<

R600 GPU ISA */
RV610 GPU ISA */
RV630 GPU ISA */
RV670 GPU ISA */
R700 class GPU ISA */
RV770 GPU ISA */
RV710 GPU ISA */
RV730 GPU ISA */
CYPRESS GPU ISA */
JUNIPER GPU ISA */
REDWOOD GPU ISA */
CEDAR GPU ISA */
SUMO GPU ISA */
SUPERSUMO GPU ISA */
WRESTLER GPU ISA */
CAYMAN GPU ISA */
KAUAI GPU ISA */
BARTS GPU ISA */
TURKS GPU ISA */
CAICOS GPU ISA */

AMD ACCELERATED PARALLEL PROCESSING

E.1.2

Bitness
The BIF can be either 32-bit ELF format or a 64-bit ELF format. For the GPU,
OpenCL generates a 32-bit BIF binary; it can read either 32-bit BIF or 64-bit BIF
binary. For the CPU, OpenCL generates and reads only 32-bit BIF binaries if the
host application is 32-bit (on either 32-bit OS or 64-bit OS). It generates and
reads only 64-bit BIF binary if the host application is 64-bit (on 64-bit OS).

E.2 BIF Options
OpenCL provides the following options to control what is contained in the binary.
-f[no-]bin-source — [not] generate OpenCL source in .source section.
-f[no-]bin-llvmir — [not] generate LLVM IR in .llvmir section.
-f[no-]bin-amdil — [not] generate AMD IL in .amdil section.
-f[no-]bin-exe — [not] generate the executable (ISA) in .text section.
The option syntax follows the GCC option syntax. Note that AMD IL is only valid
for the GPU. It is ignored if the device type is CPU.
By default, OpenCL generates the .llvmir section, .amdil section, and .text
section. The following are examples for using these options:
Example 1: Generate executable for execution:
clBuildProgram(program, 0, NULL, "-fno-bin-llvmir -fno-bin-amdil", NULL,
NULL);

Example 2: Generate only LLVM IR:
clBuildProgram(program, 0, NULL, "-fno-bin-exe -fno-bin-amdil", NULL,
NULL);

This binary can recompile for all the other devices of the same device type.

E-3

AMD ACCELERATED PARALLEL PROCESSING

E-4

AMD ACCELERATED PARALLEL PROCESSING

Appendix F
Open Decode API Tutorial

F.1 Overview
This section provides a basic tutorial for using the sample program for Open
Decode. The Open Decode API provides the ability to access the hardware for
fixed-function decoding using the AMD Unified Video Decoder block on the GPU
for decoding H.264 video.
The AMD sample Open Video Decode, provided at
http://developer.amd.com/zones/OpenCLZone/pages/openclappexamples.aspx
shows how to read in compressed H.264 video elementary stream frames and
supporting parameters, then call the hardware decoder to decompress the video.
The following is an introduction for the Open CL programmer to start using UVD
hardware; it shows how to perform a decode using the Open Video Decode API.
Open Decode allows the decompression to take place on the GPU, where the
Open CL buffers reside. This lets applications perform post-processing
operations on the decompressed data on the GPU prior to rendering the frames.
Figure F.1 diagrams an example of an optional Gaussian Blur operation on raw
decoded-data. The video is operated on using the GPU compute shader units.
The finished video then is displayed using the OpenGL rendering pipeline. Note
that OpenCL post-processing takes place on buffers in GPU memory; these
buffers do not have to be copied to the CPU address space.

Decode

Post Processing

Render

App

OpenDecode

OpenCL

OpenGL

API

UVD

Compute Shaders

Display

Raw YUV Video
after decode

YUV data in OpenCL buffers
are used by Compute Shaders
without need for copying

Post-processed
YUV surface is
used as a texture
for display

Figure F.1

Open Decode with Optional Post-Processing

F-1

AMD ACCELERATED PARALLEL PROCESSING

The five-step process consists of initialization, context creation, session creation,
decode execution, and finally session and context destruction.

F.2 Initializing
The first step in using the Open Decode is to get the Device Info and capabilities
through OVDecodeGetDeviceInfo and OVDecodeGetDeviceCap.
OVDecodeGetDeviceInfo obtains the information about the device(s) and
initializes the UVD hardware and firmware. As a result of the call to
OVDecodeGetDeviceCaps, the deviceInfo data structure provides the supported
output format and the compression profile. The application then can verify that
these values support the requested decode. The following code snippet shows
the use of OVDecodeGetDeviceInfo and OVDecodeGetDeviceCap.
ovdecode_device_info *deviceInfo = new ovdecode_device_info[numDevices];
status = OVDecodeGetDeviceInfo(&numDevices, deviceInfo);
unsigned int ovDeviceID = 0;
for(unsigned int i = 0; i < numDevices; i++)
{
ovdecode_cap *caps = new
ovdecode_cap[deviceInfo[i].decode_cap_size];
status = OVDecodeGetDeviceCap(deviceInfo[i].device_id,
deviceInfo[i].decode_cap_size,
caps);
for(unsigned int j = 0; j < deviceInfo[i].decode_cap_size; j++)
{
if(caps[j].output_format == OVD_NV12_INTERLEAVED_AMD &&
caps[j].profile == OVD_H264_HIGH_41)
{
ovDeviceID = deviceInfo[i].device_id;
break;
}
}
}

F.3 Creating the Context
The second step is to create the context for the decode session within Open CL
using clCreateContext (see following code snippet). The context creation (tied
to OpenCL queue creation) initializes the Open Decode function pointers,
callbacks, etc. Context creation also allocates buffers, such as timestamps and/or
synchronization buffers needed for the decode. (See Example Code 1 on page 121 for clCreateContext usage.)
intptr_t properties[] =
{
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
0
};
ovdContext = clCreateContext(properties,
1,
&clDeviceID,
0,
0,
&err);
F-2

AMD ACCELERATED PARALLEL PROCESSING

F.4 Creating the Session
The third step is to create the decode session using OVDecodeCreateSession.
This function, as shown in the following code snippet, creates the decode driver
session and performs the internal resource allocation.
From the capabilities that you have confirmed above in step 1, you now specify
the decode profile (H.264) and the output format (NV12 Interleaved). The height
and width also are specified. This can be obtained by parsing the data from the
input stream.
ovdecode_profile profile = OVD_H264_HIGH_41;
ovdecode_format oFormat = OVD_NV12_INTERLEAVED_AMD;
oWidth = video_width;
oHeight = video_height;
session = OVDecodeCreateSession(
ovdContext,
ovDeviceID,
profile,
oFormat,
oWidth,
oHeight);

F.5 Decoding
Decode execution goes through OpenCL and starts the UVD decode function.
For each OpenCL command sent to the Command Queue, a unique
Event_Object is returned. These Event_Objects are useful for forming
synchronization points and can be placed in Event Wait Lists by the programmer.
If the programmer creates an Event Wait List, each Event_Object in the list must
be completed, in list order, before this command is executed.
The Create Command Queue call, clCreateCommandQueue, sets up a queue into
which, later in the program, OpenCL commands are sent (for example, to run the
various OpenCL kernels).
cl_cmd_queue = clCreateCommandQueue((cl_context)ovdContext,
clDeviceID,
0,
&err);
This section demonstrates how the Frame info set up can be done with the
information read in from the video frame parameters.
slice_data_control_size = sliceNum * sizeof(ovd_slice_data_control);
slice_data_control
=
(ovd_slice_data_control*)malloc(slice_data_control_size);
pic_parameter_2_size
= sizeof(H264_picture_parameter_2);
num_event_in_wait_list = 0;
bitstream_data_max_size = video_width*video_height*3/2;
bitstream_data
=
(ovd_bitstream_data)malloc(bitstream_data_max_size);

F-3

AMD ACCELERATED PARALLEL PROCESSING

// Size of NV12 format
int host_ptr_size = oHeight * video_pitch * 3/2;
host_ptr = malloc(host_ptr_size);
Create output buffer:
output_surface = clCreateBuffer((cl_context)ovdContext,
CL_MEM_READ_WRITE,
host_ptr_size,
NULL,
&err);
The sample demonstrates how data can be read to provide Open Decode
with the information needed. Details can be obtained by reviewing the
sample routine ‘ReadPictureData’ to fill in the values needed to send into
the OvDecodePicture.
ReadPictureData(iFramesDecoded,
&picture_parameter,
&pic_parameter_2,
pic_parameter_2_size,
bitstream_data,
&bitstream_data_read_size,
bitstream_data_max_size,
slice_data_control,
slice_data_control_size);
This OVDecoePicture call performs the operation of decoding the frame
and placing the output in the output surface buffer. The OVDecodepicture
is called in a loop until the end of the input stream is reached.
OPEventHandle eventRunVideoProgram;
OVresult res = OVDecodePicture(session,
&picture_parameter,
&pic_parameter_2,
pic_parameter_2_size,
&bitstream_data,
bitstream_data_read_size,
slice_data_control,
slice_data_control_size,
output_surface,
num_event_in_wait_list,
NULL,
&eventRunVideoProgram,
0);
Wait until the Decode session completes:
err = clWaitForEvents(1, (cl_event *)&(eventRunVideoProgram));
if(err != CL_SUCCESS)
{
std::cout <<
return false;
}

F.6 Destroying Session and Context
The final step is to release the resources and close the session. This is done by
releasing all the allocated memory and structures, as well as calling
OVDecodeDestroySession and clReleaseContext. These functions cause the
decode session to free allocation of resources needed for the session. This frees
F-4

AMD ACCELERATED PARALLEL PROCESSING

driver session and all internal resources; it also sets the UVD clock to idle state.
The following code snippet shows how this is done.
err = clReleaseMemObject((cl_mem)output_surface);
bool ovdErr = OVDecodeDestroySession(session);
err = clReleaseContext((cl_context)ovdContext);

F-5

AMD ACCELERATED PARALLEL PROCESSING

F-6

AMD ACCELERATED PARALLEL PROCESSING

Appendix G
OpenCL-OpenGL Interoperability

This chapter explains how to establish an association between GL context and
CL context.
Please note the following guidelines.
1. All devices used to create the OpenCL context associated with
command_queue must support acquiring shared CL/GL objects. This
constraint is enforced at context-creation time.
2. clCreateContext and clCreateContextFromType fail context creation if the
device list passed in cannot interoperate with the GL context.
clCreateContext only permits GL-friendly device(s).
clCreateFromContextType can only include GL-friendly device(s).
3. Use clGetGLContextInfoKHR to determine GL-friendly device(s) from the
following parameters:
a. CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR only returns the device that
can interoperate with the GL context.
b. CL_DEVICES_FOR_GL_CONTEXT_KHR includes all GL-context interoperable
devices.
4. While it is possible to create as many GL contexts on a GPU, do not create
concurrently two GL contexts for two GPUs from the same process.

G.1 Under Windows
This sections discusses CL-GL interoperability for single and multiple GPU
systems running under Windows. Please note the following guidelines.
1. All devices used to create the OpenCL context associated with
command_queue must support acquiring shared CL/GL objects. This
constraint is enforced at context-creation time.
2. clCreateContext and clCreateContextFromType fail context creation if the
device list passed in cannot interoperate with the GL context.
clCreateContext only permits GL-friendly device(s).
clCreateFromContextType can only include GL-friendly device(s).
3. Use clGetGLContextInfoKHR to determine GL-friendly device(s) from the
following parameters:
a. CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR only returns the device that
can interoperate with the GL context.
AMD Accelerated Parallel Processing - OpenCL Programming Guide
Copyright © 2012 Advanced Micro Devices, Inc. All rights reserved.

G-1

AMD ACCELERATED PARALLEL PROCESSING

b. CL_DEVICES_FOR_GL_CONTEXT_KHR includes all GL-context interoperable
devices.

G.1.1

Single GPU Environment

G.1.1.1 Creating CL Context from a GL Context
Use GLUT windowing system or Win32 API for event handling.
Using GLUT
1. Use glutInit to initialize the GLUT library and negotiate a session with the
windowing system. This function also processes the command line options,
depending on the windowing system.
2. Use wglGetCurrentContext to get the current rendering GL context
(HGLRC) of the calling thread.
3. Use wglGetCurrentDC to get the device context (HDC) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use the clGetGLContextInfoKHR (See Section 9.7 of the OpenCL
Specification 1.1) function and the
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the device ID of
the CL device associated with OpenGL context.
5. Use clCreateContext (See Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows you how to create an interoperability context
using GLUT on single GPU system.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
HGLRC glCtx = wglGetCurrentContext();
Cl_context_properties cpsGL[] =
{CL_CONTEXT_PLATFORM,(cl_context_properties)platform,
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
CL_GL_CONTEXT_KHR, (intptr_t) glCtx, 0};
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(cpsGL,
1,
&interopDevice,
0,
0,
&status);

G-2

AMD ACCELERATED PARALLEL PROCESSING

Using Win32 API
1. Use CreateWindow for window creation and get the device handle (HWND).
2. Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use CreateDC function to
create a device context (HDC) for the specified device.
3. Use ChoosePixelFormat to match an appropriate pixel format supported by
a device context and to a given pixel format specification.
4. Use SetPixelFormat to set the pixel format of the specified device context
to the format specified.
5. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
6. Use wglMakeCurrent to bind the GL context created in the above step as
the current rendering context.
7. Use clGetGLContextInfoKHR function (see Section 9.7 of the OpenCL
Specification 1.1) and parameter CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
to get the device ID of the CL device associated with OpenGL context.
8. Use clCreateContext function (see Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
WIN32 API for windowing. (Users also can refer to the SimpleGL sample in the
AMD APP SDK samples.)
int pfmt;
PIXELFORMATDESCRIPTOR pfd;
pfd.nSize
= sizeof(PIXELFORMATDESCRIPTOR);
pfd.nVersion
= 1;
pfd.dwFlags
= PFD_DRAW_TO_WINDOW |
PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER ;
pfd.iPixelType
= PFD_TYPE_RGBA;
pfd.cColorBits
= 24;
pfd.cRedBits
= 8;
pfd.cRedShift
= 0;
pfd.cGreenBits
= 8;
pfd.cGreenShift
= 0;
pfd.cBlueBits
= 8;
pfd.cBlueShift
= 0;
pfd.cAlphaBits
= 8;
pfd.cAlphaShift
= 0;
pfd.cAccumBits
= 0;
pfd.cAccumRedBits
= 0;
pfd.cAccumGreenBits = 0;
pfd.cAccumBlueBits = 0;
pfd.cAccumAlphaBits = 0;
pfd.cDepthBits
= 24;
pfd.cStencilBits
= 8;
pfd.cAuxBuffers
= 0;
pfd.iLayerType
= PFD_MAIN_PLANE;
pfd.bReserved
= 0;
pfd.dwLayerMask
= 0;
pfd.dwVisibleMask
= 0;
pfd.dwDamageMask
= 0;
ZeroMemory(&pfd, sizeof(PIXELFORMATDESCRIPTOR));
WNDCLASS windowclass;

G-3

AMD ACCELERATED PARALLEL PROCESSING

windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(reinterpret_cast("SimpleGL"),
reinterpret_cast("SimpleGL"),
WS_CAPTION | WS_POPUPWINDOW | WS_VISIBLE,
0,
0,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
pfmt = ChoosePixelFormat(hDC, &pfd);
ret = SetPixelFormat(hDC, pfmt, &pfd);
hRC = wglCreateContext(hDC);
ret = wglMakeCurrent(hDC, hRC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(properties,
1,
&interopDevice,
0,
0,
&status);

G.1.2

Multi-GPU Environment

G.1.2.1 Creating CL context from a GL context
Do not to use the GLUT windowing system in multi-GPU environment because
it always creates a GL context on the primary display, and it is not possible to
specify which display device to select for a GL context.

G-4

AMD ACCELERATED PARALLEL PROCESSING

To use Win32 API for windowing in multi-GPU environment:
1. Detect each display by using EnumDisplayDevices function. This function
lets you obtain the information about display devices in the current session.
2. To query all display devices in the current session, call this function in a loop,
starting with DevNum set to 0, and incrementing DevNum until the function fails.
To select all display devices in the desktop, use only the display devices that
have the DISPLAY_DEVICE_ATTACHED_TO_DESKTOP flag in the
DISPLAY_DEVICE structure.
3. To get information on the display adapter, call EnumDisplayDevices with
lpDevice set to NULL. For example, DISPLAY_DEVICE.DeviceString
contains the adapter name.
4. Use EnumDisplaySettings to get DEVMODE. dmPosition.x and
dmPosition.y are used to get the x coordinate and y coordinate of the
current display.
5. Try to find the first OpenCL device (winner) associated with the OpenGL
rendering context by using the loop technique of 2., above.
6. Inside the loop:
a. Create a window on a specific display by using the CreateWindow
function. This function returns the window handle (HWND).
b.

Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use the CreateDC function
to create a device context (HDC) for the specified device.

Use ChoosePixelFormat to match an appropriate pixel format supported
by a device context to a given pixel format specification.

d. Use SetPixelFormat to set the pixel format of the specified device
context to the format specified.
e. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
f.

Use wglMakeCurrent to bind the GL context created in the above step
as the current rendering context.

g. Use clGetGLContextInfoKHR (See Section 9.7 of the OpenCL
Specification 1.1) and CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
parameter to get the number of GL associated devices for CL context
creation. If the number of devices is zero go to the next display in the
loop. Otherwise, use clGetGLContextInfoKHR (See Section 9.7 of the
OpenCL Specification 1.1) and the
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the device
ID of the CL device associated with OpenGL context.
h.

Use clCreateContext (See Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).

The following code demonstrates how to use WIN32 Windowing API in CL-GL
interoperability on multi-GPU environment.

G-5

AMD ACCELERATED PARALLEL PROCESSING

int xCoordinate = 0;
int yCoordinate = 0;
for (deviceNum = 0; EnumDisplayDevices(NULL,
deviceNum,
&dispDevice,
0); deviceNum++)
{
if (dispDevice.StateFlags &
DISPLAY_DEVICE_MIRRORING_DRIVER)
{
continue;
}
DEVMODE deviceMode;
EnumDisplaySettings(dispDevice.DeviceName,
ENUM_CURRENT_SETTINGS,
&deviceMode);
xCoordinate = deviceMode.dmPosition.x;
yCoordinate = deviceMode.dmPosition.y;
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(
reinterpret_cast("SimpleGL"),
reinterpret_cast(
"OpenGL Texture Renderer"),
WS_CAPTION | WS_POPUPWINDOW,
xCoordinate,
yCoordinate,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
pfmt = ChoosePixelFormat(hDC, &pfd);
ret = SetPixelFormat(hDC, pfmt, &pfd);
hRC = wglCreateContext(hDC);
ret = wglMakeCurrent(hDC, hRC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
if (!clGetGLContextInfoKHR)
G-6

AMD ACCELERATED PARALLEL PROCESSING

{
clGetGLContextInfoKHR = (clGetGLContextInfoKHR_fn)
clGetExtensionFunctionAddress(
"clGetGLContextInfoKHR");
}
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
if (deviceSize == 0)
{
// no interopable CL device found, cleanup
wglMakeCurrent(NULL, NULL);
wglDeleteContext(hRC);
DeleteDC(hDC);
hDC = NULL;
hRC = NULL;
DestroyWindow(gHwnd);
// try the next display
continue;
}
ShowWindow(gHwnd, SW_SHOW);
//Found a winner
break;
}
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
status = clGetGLContextInfoKHR( properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(properties,
1,
&interopDevice,
0,
0,
&status);

G.1.3

Limitations
•

It is recommended not to use GLUT in a multi-GPU environment.

•

AMD currently supports CL-GL interoperability only in a single-GPU
environment.

G-7

AMD ACCELERATED PARALLEL PROCESSING

G.2 Linux Operating System
G.2.1

Single GPU Environment

G.2.1.1 Creating CL Context from a GL Context
Using GLUT
1. Use glutInit to initialize the GLUT library and to negotiate a session with
the windowing system. This function also processes the command-line
options depending on the windowing system.
2. Use glXGetCurrentContext to get the current rendering context
(GLXContext).
3. Use glXGetCurrentDisplay to get the display (Display *) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use clGetGLContextInfoKHR (see Section 9.7 of the OpenCL Specification
1.1) and the CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the
device ID of the CL device associated with the OpenGL context.
5. Use clCreateContext (see Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
GLUT in Linux.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
gGLXContext glCtx = glXGetCurrentContext();
Cl_context_properties cpsGL[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR,
(intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR,
(

intptr_t) glCtx, 0};

status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(cpsGL,
1,
&interopDevice,
0,
0,
&status);

G-8

AMD ACCELERATED PARALLEL PROCESSING

Using X Window System
1. Use XOpenDisplay to open a connection to the server that controls a display.
2. Use glXChooseFBConfig to get a list of GLX frame buffer configurations that
match the specified attributes.
3. Use glXChooseVisual to get a visual that matches specified attributes.
4. Use XCreateColormap to create a color map of the specified visual type for
the screen on which the specified window resides and returns the colormap
ID associated with it. Note that the specified window is only used to
determine the screen.
5. Use XCreateWindow to create an unmapped sub-window for a specified
parent window, returns the window ID of the created window, and causes the
X server to generate a CreateNotify event. The created window is placed on
top in the stacking order with respect to siblings.
6. Use XMapWindow to map the window and all of its sub-windows that have had
map requests. Mapping a window that has an unmapped ancestor does not
display the window, but marks it as eligible for display when the ancestor
becomes mapped. Such a window is called unviewable. When all its
ancestors are mapped, the window becomes viewable and is visible on the
screen if it is not obscured by another window.
7. Use glXCreateContextAttribsARB to initialize the context to the initial state
defined by the OpenGL specification, and returns a handle to it. This handle
can be used to render to any GLX surface.
8. Use glXMakeCurrent to make argrument3 (GLXContext) the current GLX
rendering context of the calling thread, replacing the previously current
context if there was one, and attaches argument3 (GLXcontext) to a GLX
drawable, either a window or a GLX pixmap.
9. Use clGetGLContextInfoKHR to get the OpenCL-OpenGL interoperability
device corresponding to the window created in step 5.
10. Use clCreateContext to create the context on the interoperable device
obtained in step 9.
The following code snippet shows how to create a CL-GL interoperability context
using the X Window system in Linux.
Display *displayName = XOpenDisplay(0);
int nelements;
GLXFBConfig *fbc = glXChooseFBConfig(displayName,
DefaultScreen(displayName), 0, &nelements);
static int attributeList[] = { GLX_RGBA,
GLX_DOUBLEBUFFER,
GLX_RED_SIZE,
1,
GLX_GREEN_SIZE,
1,
GLX_BLUE_SIZE,
1,
None
};
XVisualInfo *vi = glXChooseVisual(displayName,

G-9

AMD ACCELERATED PARALLEL PROCESSING

DefaultScreen(displayName),
attributeList);
XSetWindowAttributes swa;
swa.colormap = XCreateColormap(displayName,
RootWindow(displayName, vi->screen),
vi->visual,
AllocNone);
swa.border_pixel = 0;
swa.event_mask = StructureNotifyMask;
Window win = XCreateWindow(displayName,
RootWindow(displayName, vi->screen),
10,
10,
WINDOW_WIDTH,
WINDOW_HEIGHT,
0,
vi->depth,
InputOutput,
vi->visual,
CWBorderPixel|CWColormap|CWEventMask,
&swa);
XMapWindow (displayName, win);
std::cout

<< "glXCreateContextAttribsARB "
<< (void*) glXGetProcAddress((const
GLubyte*)"glXCreateContextAttribsARB")
<< std::endl;
GLXCREATECONTEXTATTRIBSARBPROC glXCreateContextAttribsARB =
(GLXCREATECONTEXTATTRIBSARBPROC)
glXGetProcAddress((const
GLubyte*)"glXCreateContextAttribsARB");
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
GLXContext ctx = glXCreateContextAttribsARB(displayName,
*fbc,
0,
true,
attribs);
glXMakeCurrent (displayName,
win,
ctx);
cl_context_properties cpsGL[] = {
CL_CONTEXT_PLATFORM,(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
status = clGetGLContextInfoKHR( cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDeviceId,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(cpsGL,
1,
&interopDeviceId,
0,
0,
&status);
G-10

AMD ACCELERATED PARALLEL PROCESSING

G.2.2

Multi-GPU Configuration

G.2.2.1 Creating CL Context from a GL Context
Using X Window System
1. Use XOpenDisplay to open a connection to the server that controls a display.
2. Use ScreenCount to get the number of available screens.
3. Use XCloseDisplay to close the connection to the X server for the display
specified in the Display structure and destroy all windows, resource IDs
(Window, Font, Pixmap, Colormap, Cursor, and GContext), or other
resources that the client created on this display.
4. Use a FOR loop to enumerate the displays. To change the display, change
the value of the environment variable DISPLAY.
5. Inside the loop:
a. Use putenv to set the environment variable DISPLAY with respect to the
display number.
b. Use OpenDisplay to open a connection to the server that controls a
display.
c.

Use glXChooseFBConfig to get a list of GLX frame buffer configurations
that match the specified attributes.

d. Use glXChooseVisual to get a visual that matches specified attributes.
e. Use XCreateColormap to create a color map of the specified visual type
for the screen on which the specified window resides and returns the
colormap ID associated with it. Note that the specified window is only
used to determine the screen.
f.

Use XCreateWindow to create an unmapped sub-window for a specified
parent window, returns the window ID of the created window, and causes
the X server to generate a CreateNotify event. The created window is
placed on top in the stacking order with respect to siblings.

g. Use XMapWindow to map the window and all of its sub-windows that have
had map requests. Mapping a window that has an unmapped ancestor
does not display the window but marks it as eligible for display when the
ancestor becomes mapped. Such a window is called unviewable. When
all its ancestors are mapped, the window becomes viewable and is
visible on the screen, if it is not obscured by another window.
h. Use glXCreateContextAttribsARB function to initialize the context to
the initial state defined by the OpenGL specification and return a handle
to it. This handle can be used to render to any GLX surface.
i.

Use glXMakeCurrent to make argrument3 (GLXContext) the current
GLX rendering context of the calling thread, replacing the previously
current context, if there was one, and to attach argument3 (GLXcontext)
to a GLX drawable, either a window or a GLX pixmap.

G-11

AMD ACCELERATED PARALLEL PROCESSING

Use clGetGLContextInfoKHR to get the number of OpenCL-OpenGL
interoperability devices corresponding to the window created in f, above.

If the number of interoperable devices is zero, use glXDestroyContext
to destroy the context created at step h, and go to step a; otherwise, exit
from the loop (an OpenCL-OpenGL interoperable device has been
found).

6. Use clGetGLContextInfoKHR to get the OpenCL-OpenGL interoperable
device id.
7. Use clCreateContext to create the context on the interoperable device
obtained in the previous step.
The following code segment shows how to create an OpenCL-OpenGL
interoperability context on a system with multiple GPUs.
displayName = XOpenDisplay(NULL);
int screenNumber = ScreenCount(displayName);
XCloseDisplay(displayName);
for (int i = 0; i < screenNumber; i++)
{
if (isDeviceIdEnabled())
{
if (i < deviceId)
{
continue;
}
}
char disp[100];
sprintf(disp, "DISPLAY=:0.%d", i);
putenv(disp);
displayName = XOpenDisplay(0);
int nelements;
GLXFBConfig *fbc = glXChooseFBConfig(displayName,
DefaultScreen(displayName),
0,
&nelements);
static int attributeList[] = { GLX_RGBA,
GLX_DOUBLEBUFFER,
GLX_RED_SIZE,
1,
GLX_GREEN_SIZE,
1,
GLX_BLUE_SIZE,
1,
None
};
XVisualInfo *vi = glXChooseVisual(displayName,
DefaultScreen(displayName),
attributeList);
XSetWindowAttributes swa;
swa.colormap = XCreateColormap(displayName,
RootWindow(displayName, vi->screen),
vi->visual,
AllocNone);
swa.border_pixel = 0;
swa.event_mask = StructureNotifyMask;
win = XCreateWindow(displayName,
RootWindow(displayName, vi->screen),
10,
10,
width,
height,
G-12

AMD ACCELERATED PARALLEL PROCESSING

0,
vi->depth,
InputOutput,
vi->visual,
CWBorderPixel|CWColormap|CWEventMask,
&swa);
XMapWindow (displayName, win);
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
GLXContext ctx = glXCreateContextAttribsARB(displayName,
*fbc,
0,
true,
attribs);
glXMakeCurrent (displayName,
win,
ctx);
gGlCtx = glXGetCurrentContext();
properties cpsGL[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
int numDevices = (deviceSize / sizeof(cl_device_id));
if(numDevices == 0)
{
glXDestroyContext(glXGetCurrentDisplay(), gGlCtx);
continue;
}
else
{
//Interoperable device found
std::cout<<"Interoperable device found "<
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.6
Linearized : Yes
Encryption : Standard V2.3 (128-bit)
User Access : Print, Copy, Extract, Print high-res
Page Mode : UseOutlines
XMP Toolkit : Adobe XMP Core 4.0-c321 44.398116, Tue Aug 04 2009 14:24:39
Producer : Acrobat Distiller 8.3.1 (Windows)
Creator Tool : FrameMaker 7.2
Modify Date : 2012:07:16 17:17:55-07:00
Create Date : 2012:07:16 13:00:11Z
Metadata Date : 2012:07:16 17:17:55-07:00
Format : application/pdf
Title : OpenCL_Programming_Guide.book
Creator : DR
Document ID : uuid:1292eb27-c331-4db4-8850-7010e95d78a4
Instance ID : uuid:69afe912-3a45-4e5e-88c2-19671594b384
Page Count : 286
Author : DR

EXIF Metadata provided by EXIF.tools

OpenCL_Programming_Guide AMD Accelerated Parallel Processing Open CL Programming Guide

Navigation menu

Versions of this User Manual:

Views

Navigation