Shane Cook CUDA Programming A Developer's Guide To Parallel Computing With GPUs Morgan Kaufmann (2012)

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 591

DownloadShane Cook-CUDA Programming A Developer's Guide To Parallel Computing With GPUs-Morgan Kaufmann (2012)
Open PDF In BrowserView PDF
CUDA Programming
A Developer’s Guide to Parallel
Computing with GPUs

This page intentionally left blank

CUDA Programming
A Developer’s Guide to Parallel
Computing with GPUs

Shane Cook

AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an Imprint of Elsevier

Acquiring Editor: Todd Green
Development Editor: Robyn Day
Project Manager: Andre Cuello
Designer: Kristen Davis
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Ó 2013 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or
mechanical, including photocopying, recording, or any information storage and retrieval system, without
permission in writing from the publisher. Details on how to seek permission, further information about the
Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance
Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher
(other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods or professional practices, may become necessary. Practitioners
and researchers must always rely on their own experience and knowledge in evaluating and using any information
or methods described herein. In using such information or methods they should be mindful of their own safety and
the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any
liability for any injury and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the
material herein.
Library of Congress Cataloging-in-Publication Data
Application submitted
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
ISBN: 978-0-12-415933-4
For information on all MK publications
visit our website at http://store.elsevier.com
Printed in the United States of America
13 14 10 9 8 7 6 5 4 3 2 1

Contents
Preface ................................................................................................................................................ xiii

CHAPTER 1 A Short History of Supercomputing................................................ 1

Introduction ................................................................................................................ 1
Von Neumann Architecture........................................................................................ 2
Cray............................................................................................................................. 5
Connection Machine................................................................................................... 6
Cell Processor............................................................................................................. 7
Multinode Computing ................................................................................................ 9
The Early Days of GPGPU Coding ......................................................................... 11
The Death of the Single-Core Solution ................................................................... 12
NVIDIA and CUDA................................................................................................. 13
GPU Hardware ......................................................................................................... 15
Alternatives to CUDA .............................................................................................. 16
OpenCL ............................................................................................................... 16
DirectCompute .................................................................................................... 17
CPU alternatives.................................................................................................. 17
Directives and libraries ....................................................................................... 18
Conclusion ................................................................................................................ 19

CHAPTER 2 Understanding Parallelism with GPUs ......................................... 21

Introduction .............................................................................................................. 21
Traditional Serial Code ............................................................................................ 21
Serial/Parallel Problems ........................................................................................... 23
Concurrency.............................................................................................................. 24
Locality................................................................................................................ 25
Types of Parallelism ................................................................................................. 27
Task-based parallelism ........................................................................................ 27
Data-based parallelism ........................................................................................ 28
Flynn’s Taxonomy .................................................................................................... 30
Some Common Parallel Patterns.............................................................................. 31
Loop-based patterns ............................................................................................ 31
Fork/join pattern.................................................................................................. 33
Tiling/grids .......................................................................................................... 35
Divide and conquer ............................................................................................. 35
Conclusion ................................................................................................................ 36

CHAPTER 3 CUDA Hardware Overview........................................................... 37
PC Architecture ........................................................................................................ 37
GPU Hardware ......................................................................................................... 42

v

vi

Contents

CPUs and GPUs ....................................................................................................... 46
Compute Levels........................................................................................................ 46
Compute 1.0 ........................................................................................................ 47
Compute 1.1 ........................................................................................................ 47
Compute 1.2 ........................................................................................................ 49
Compute 1.3 ........................................................................................................ 49
Compute 2.0 ........................................................................................................ 49
Compute 2.1 ........................................................................................................ 51

CHAPTER 4

Setting Up CUDA ........................................................................ 53

CHAPTER 5

Grids, Blocks, and Threads......................................................... 69

Introduction .............................................................................................................. 53
Installing the SDK under Windows ......................................................................... 53
Visual Studio ............................................................................................................ 54
Projects ................................................................................................................ 55
64-bit users .......................................................................................................... 55
Creating projects ................................................................................................. 57
Linux......................................................................................................................... 58
Kernel base driver installation (CentOS, Ubuntu 10.4) ..................................... 59
Mac ........................................................................................................................... 62
Installing a Debugger ............................................................................................... 62
Compilation Model................................................................................................... 66
Error Handling.......................................................................................................... 67
Conclusion ................................................................................................................ 68

What it all Means ..................................................................................................... 69
Threads ..................................................................................................................... 69
Problem decomposition....................................................................................... 69
How CPUs and GPUs are different .................................................................... 71
Task execution model.......................................................................................... 72
Threading on GPUs............................................................................................. 73
A peek at hardware ............................................................................................. 74
CUDA kernels ..................................................................................................... 77
Blocks ....................................................................................................................... 78
Block arrangement .............................................................................................. 80
Grids ......................................................................................................................... 83
Stride and offset .................................................................................................. 84
X and Y thread indexes........................................................................................ 85
Warps ........................................................................................................................ 91
Branching ............................................................................................................ 92
GPU utilization.................................................................................................... 93
Block Scheduling ..................................................................................................... 95

Contents

vii

A Practical ExampledHistograms .......................................................................... 97
Conclusion .............................................................................................................. 103
Questions ........................................................................................................... 104
Answers ............................................................................................................. 104

CHAPTER 6 Memory Handling with CUDA .................................................... 107

Introduction ............................................................................................................ 107
Caches..................................................................................................................... 108
Types of data storage ........................................................................................ 110
Register Usage........................................................................................................ 111
Shared Memory ...................................................................................................... 120
Sorting using shared memory ........................................................................... 121
Radix sort .......................................................................................................... 125
Merging lists...................................................................................................... 131
Parallel merging ................................................................................................ 137
Parallel reduction............................................................................................... 140
A hybrid approach............................................................................................. 144
Shared memory on different GPUs................................................................... 148
Shared memory summary ................................................................................. 148
Questions on shared memory............................................................................ 149
Answers for shared memory ............................................................................. 149
Constant Memory ................................................................................................... 150
Constant memory caching................................................................................. 150
Constant memory broadcast.............................................................................. 152
Constant memory updates at runtime ............................................................... 162
Constant question .............................................................................................. 166
Constant answer ................................................................................................ 167
Global Memory ...................................................................................................... 167
Score boarding................................................................................................... 176
Global memory sorting ..................................................................................... 176
Sample sort........................................................................................................ 179
Questions on global memory ............................................................................ 198
Answers on global memory .............................................................................. 199
Texture Memory ..................................................................................................... 200
Texture caching ................................................................................................. 200
Hardware manipulation of memory fetches ..................................................... 200
Restrictions using textures ................................................................................ 201
Conclusion .............................................................................................................. 202

CHAPTER 7 Using CUDA in Practice............................................................ 203

Introduction ............................................................................................................ 203
Serial and Parallel Code......................................................................................... 203
Design goals of CPUs and GPUs ..................................................................... 203

viii

Contents

Algorithms that work best on the CPU versus the GPU.................................. 206
Processing Datasets ................................................................................................ 209
Using ballot and other intrinsic operations....................................................... 211
Profiling .................................................................................................................. 219
An Example Using AES ........................................................................................ 231
The algorithm .................................................................................................... 232
Serial implementations of AES ........................................................................ 236
An initial kernel ................................................................................................ 239
Kernel performance........................................................................................... 244
Transfer performance ........................................................................................ 248
A single streaming version ............................................................................... 249
How do we compare with the CPU .................................................................. 250
Considerations for running on other GPUs ...................................................... 260
Using multiple streams...................................................................................... 263
AES summary ................................................................................................... 264
Conclusion .............................................................................................................. 265
Questions ........................................................................................................... 265
Answers ............................................................................................................. 265
References .............................................................................................................. 266

CHAPTER 8

Multi-CPU and Multi-GPU Solutions .......................................... 267

CHAPTER 9

Optimizing Your Application...................................................... 305

Introduction ............................................................................................................ 267
Locality ................................................................................................................... 267
Multi-CPU Systems................................................................................................ 267
Multi-GPU Systems................................................................................................ 268
Algorithms on Multiple GPUs ............................................................................... 269
Which GPU?........................................................................................................... 270
Single-Node Systems.............................................................................................. 274
Streams ................................................................................................................... 275
Multiple-Node Systems .......................................................................................... 290
Conclusion .............................................................................................................. 301
Questions ........................................................................................................... 302
Answers ............................................................................................................. 302

Strategy 1: Parallel/Serial GPU/CPU Problem Breakdown .................................. 305
Analyzing the problem...................................................................................... 305
Time................................................................................................................... 305
Problem decomposition..................................................................................... 307
Dependencies..................................................................................................... 308
Dataset size........................................................................................................ 311

Contents

ix

Resolution.......................................................................................................... 312
Identifying the bottlenecks................................................................................ 313
Grouping the tasks for CPU and GPU.............................................................. 317
Section summary ............................................................................................... 320
Strategy 2: Memory Considerations ...................................................................... 320
Memory bandwidth ........................................................................................... 320
Source of limit................................................................................................... 321
Memory organization ........................................................................................ 323
Memory accesses to computation ratio ............................................................ 325
Loop and kernel fusion ..................................................................................... 331
Use of shared memory and cache..................................................................... 332
Section summary ............................................................................................... 333
Strategy 3: Transfers .............................................................................................. 334
Pinned memory ................................................................................................. 334
Zero-copy memory............................................................................................ 338
Bandwidth limitations ....................................................................................... 347
GPU timing ....................................................................................................... 351
Overlapping GPU transfers ............................................................................... 356
Section summary ............................................................................................... 360
Strategy 4: Thread Usage, Calculations, and Divergence ..................................... 361
Thread memory patterns ................................................................................... 361
Inactive threads.................................................................................................. 364
Arithmetic density ............................................................................................. 365
Some common compiler optimizations ............................................................ 369
Divergence......................................................................................................... 374
Understanding the low-level assembly code .................................................... 379
Register usage ................................................................................................... 383
Section summary ............................................................................................... 385
Strategy 5: Algorithms ........................................................................................... 386
Sorting ............................................................................................................... 386
Reduction........................................................................................................... 392
Section summary ............................................................................................... 414
Strategy 6: Resource Contentions .......................................................................... 414
Identifying bottlenecks...................................................................................... 414
Resolving bottlenecks ....................................................................................... 427
Section summary ............................................................................................... 434
Strategy 7: Self-Tuning Applications..................................................................... 435
Identifying the hardware ................................................................................... 436
Device utilization .............................................................................................. 437
Sampling performance ...................................................................................... 438
Section summary ............................................................................................... 439
Conclusion .............................................................................................................. 439
Questions on Optimization................................................................................ 439
Answers ............................................................................................................. 440

x

Contents

CHAPTER 10

Libraries and SDK .................................................................. 441

CHAPTER 11

Designing GPU-Based Systems................................................ 503

Introduction.......................................................................................................... 441
Libraries ............................................................................................................... 441
General library conventions ........................................................................... 442
NPP (Nvidia Performance Primitives) ........................................................... 442
Thrust .............................................................................................................. 451
CuRAND......................................................................................................... 467
CuBLAS (CUDA basic linear algebra) library.............................................. 471
CUDA Computing SDK ...................................................................................... 475
Device Query .................................................................................................. 476
Bandwidth test ................................................................................................ 478
SimpleP2P....................................................................................................... 479
asyncAPI and cudaOpenMP........................................................................... 482
Aligned types .................................................................................................. 489
Directive-Based Programming ............................................................................ 491
OpenACC........................................................................................................ 492
Writing Your Own Kernels.................................................................................. 499
Conclusion ........................................................................................................... 502

Introduction.......................................................................................................... 503
CPU Processor ..................................................................................................... 505
GPU Device ......................................................................................................... 507
Large memory support ................................................................................... 507
ECC memory support ..................................................................................... 508
Tesla compute cluster driver (TCC)............................................................... 508
Higher double-precision math ........................................................................ 508
Larger memory bus width .............................................................................. 508
SMI ................................................................................................................. 509
Status LEDs .................................................................................................... 509
PCI-E Bus ............................................................................................................ 509
GeForce cards ...................................................................................................... 510
CPU Memory ....................................................................................................... 510
Air Cooling .......................................................................................................... 512
Liquid Cooling..................................................................................................... 513
Desktop Cases and Motherboards ....................................................................... 517
Mass Storage........................................................................................................ 518
Motherboard-based I/O................................................................................... 518
Dedicated RAID controllers........................................................................... 519
HDSL .............................................................................................................. 520
Mass storage requirements ............................................................................. 521
Networking ..................................................................................................... 521
Power Considerations .......................................................................................... 522

Contents

xi

Operating Systems ............................................................................................... 525
Windows ......................................................................................................... 525
Linux ............................................................................................................... 525
Conclusion ........................................................................................................... 526

CHAPTER 12

Common Problems, Causes, and Solutions............................... 527

Introduction.......................................................................................................... 527
Errors With CUDA Directives............................................................................. 527
CUDA error handling ..................................................................................... 527
Kernel launching and bounds checking ......................................................... 528
Invalid device handles .................................................................................... 529
Volatile qualifiers ............................................................................................ 530
Compute level–dependent functions .............................................................. 532
Device, global, and host functions ................................................................. 534
Kernels within streams ................................................................................... 535
Parallel Programming Issues ............................................................................... 536
Race hazards ................................................................................................... 536
Synchronization .............................................................................................. 537
Atomic operations........................................................................................... 541
Algorithmic Issues ............................................................................................... 544
Back-to-back testing ....................................................................................... 544
Memory leaks ................................................................................................. 546
Long kernels ................................................................................................... 546
Finding and Avoiding Errors ............................................................................... 547
How many errors does your GPU program have?......................................... 547
Divide and conquer......................................................................................... 548
Assertions and defensive programming ......................................................... 549
Debug level and printing ................................................................................ 551
Version control................................................................................................ 555
Developing for Future GPUs............................................................................... 555
Kepler.............................................................................................................. 555
What to think about........................................................................................ 558
Further Resources ................................................................................................ 560
Introduction..................................................................................................... 560
Online courses ................................................................................................ 560
Taught courses ................................................................................................ 561
Books .............................................................................................................. 562
NVIDIA CUDA certification.......................................................................... 562
Conclusion ........................................................................................................... 562
References............................................................................................................ 563

Index ................................................................................................................................................. 565

This page intentionally left blank

Preface

Over the past five years there has been a revolution in computing brought about by a company that for
successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With
the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the
first time these hugely powerful graphics coprocessors could be used by everyday C programmers to
offload computationally expensive work. From the embedded device industry, to home users, to
supercomputers, everything has changed as a result of this.
One of the major changes in the computer software industry has been the move from serial
programming to parallel programming. Here, CUDA has produced great advances. The graphics
processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently
parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming
model without the need for graphics primitives.
In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of
graphics or graphics primitives. You do not have to be a games programmer either. The CUDA
language makes the GPU look just like another programmable device.
Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel
programming. I assume they have only an existing knowledge of the C/C++ programming language.
As we progress and you become more competent with CUDA, we’ll cover more advanced topics,
taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA.
For programmers already familiar with parallel programming concepts and CUDA, we’ll be
discussing in detail the architecture of the GPUs and how to get the most from each, including the latest
Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA
in a few hours given a little training. Getting from novice CUDA programmer, with a several times
speedup to 10 times–plus speedup is what you should be capable of by the end of this book.
The book is very much aimed at learning CUDA, but with a focus on performance, having first
achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text.
This book is a practical guide to using CUDA in real applications, by real practitioners. At the same
time, however, we cover the necessary theory and background so everyone, no matter what their
background, can follow along and learn how to program in CUDA, making this book ideal for both
professionals and those studying GPUs or parallel programming.
The book is set out as follows:
Chapter 1: A Short History of Supercomputing. This chapter is a broad introduction to the
evolution of streaming processors covering some key developments that brought us to GPU
processing today.
Chapter 2: Understanding Parallelism with GPUs. This chapter is an introduction to the
concepts of parallel programming, such as how serial and parallel programs are different and
how to approach solving problems in different ways. This chapter is primarily aimed at existing
serial programmers to give a basis of understanding for concepts covered later in the book.

xiii

xiv

Preface

Chapter 3: CUDA Hardware Overview. This chapter provides a fairly detailed explanation of the
hardware and architecture found around and within CUDA devices. To achieve the best
performance from CUDA programming, a reasonable understanding of the hardware both
within and outside the device is required.
Chapter 4: Setting Up CUDA. Installation and setup of the CUDA SDK under Windows, Mac,
and the Linux variants. We also look at the main debugging environments available for CUDA.
Chapter 5: Grids, Blocks, and Threads. A detailed explanation of the CUDA threading model,
including some examples of how the choices here impact performance.
Chapter 6: Memory Handling with CUDA. Understanding the different memory types and how
they are used within CUDA is the single largest factor influencing performance. Here we take
a detailed explanation, with examples, of how the various memory types work and the pitfalls
of getting it wrong.
Chapter 7: Using CUDA in Practice. Detailed examination as to how central processing units
(CPUs) and GPUs best cooperate with a number of problems and the issues involved in CPU/
GPU programming.
Chapter 8: Multi-CPU and Multi-GPU Solutions. We look at how to program and use multiple
GPUs within an application.
Chapter 9: Optimizing Your Application. A detailed breakdown of the main areas that limit
performance in CUDA. We look at the tools and techniques that are available for analysis of
CUDA code.
Chapter 10: Libraries and SDK. A look at some of the CUDA SDK samples and the libraries
supplied with CUDA, and how you can use these within your applications.
Chapter 11: Designing GPU-Based Systems. This chapter takes a look at some of the issues
involved with building your own GPU server or cluster.
Chapter 12: Common Problems, Causes, and Solutions. A look at the type of mistakes most
programmers make when developing applications in CUDA and how these can be detected and
avoided.

CHAPTER

A Short History of Supercomputing

1

INTRODUCTION
So why in a book about CUDA are we looking at supercomputers? Supercomputers are typically at the
leading edge of the technology curve. What we see here is what will be commonplace on the desktop in
5 to 10 years. In 2010, the annual International Supercomputer Conference in Hamburg, Germany,
announced that a NVIDIA GPU-based machine had been listed as the second most powerful computer
in the world, according to the top 500 list (http://www.top500.org). Theoretically, it had more peak
performance than the mighty IBM Roadrunner, or the then-leader, the Cray Jaguar, peaking at near to 3
petaflops of performance. In 2011, NVIDIA CUDA-powered GPUs went on to claim the title of the
fastest supercomputer in the world. It was suddenly clear to everyone that GPUs had arrived in a very
big way on the high-performance computing landscape, as well as the humble desktop PC.
Supercomputing is the driver of many of the technologies we see in modern-day processors.
Thanks to the need for ever-faster processors to process ever-larger datasets, the industry produces
ever-faster computers. It is through some of these evolutions that GPU CUDA technology has come
about today.
Both supercomputers and desktop computing are moving toward a heterogeneous computing
routedthat is, they are trying to achieve performance with a mix of CPU (Central Processor Unit) and
GPU (Graphics Processor Unit) technology. Two of the largest worldwide projects using GPUs are
BOINC and Folding@Home, both of which are distributed computing projects. They allow ordinary
people to make a real contribution to specific scientific projects. Contributions from CPU/GPU hosts
on projects supporting GPU accelerators hugely outweigh contributions from CPU-only hosts. As of
November 2011, there were some 5.5 million hosts contributing a total of around 5.3 petaflops, around
half that of the world’s fastest supercomputer, in 2011, the Fujitsu “K computer” in Japan.
The replacement for Jaguar, currently the fastest U.S. supercomputer, code-named Titan, is
planned for 2013. It will use almost 300,000 CPU cores and up to 18,000 GPU boards to achieve
between 10 and 20 petaflops per second of performance. With support like this from around the world,
GPU programming is set to jump into the mainstream, both in the HPC industry and also on the
desktop.
You can now put together or purchase a desktop supercomputer with several teraflops of performance. At the beginning of 2000, some 12 years ago, this would have given you first place in the top
500 list, beating IBM ASCI Red with its 9632 Pentium processors. This just shows how much a little
over a decade of computing progress has achieved and opens up the question about where we will be
a decade from now. You can be fairly certain GPUs will be at the forefront of this trend for some time
CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00001-6
Copyright Ó 2013 Elsevier Inc. All rights reserved.

1

2

CHAPTER 1 A Short History of Supercomputing

to come. Thus, learning how to program GPUs effectively is a key skill any good developer needs
to acquire.

VON NEUMANN ARCHITECTURE
Almost all processors work on the basis of the process developed by Von Neumann, considered one of
the fathers of computing. In this approach, the processor fetches instructions from memory, decodes,
and then executes that instruction.
A modern processor typically runs at anything up to 4 GHz in speed. Modern DDR-3 memory, when
paired with say a standard Intel I7 device, can run at anything up to 2 GHz. However, the I7 has at least four
processors or cores in one device, or double that if you count its hyperthreading ability as a real processor.
A DDR-3 triple-channel memory setup on a I7 Nehalem system would produce the theoretical
bandwidth figures shown in Table 1.1. Depending on the motherboard, and exact memory pattern, the
actual bandwidth could be considerably less.
Table 1.1 Bandwidth on I7 Nehalem Processor
QPI Clock

Theoretical Bandwidth

Per Core

4.8 GT/s
(standard part)
6.4 GT/s
(extreme edition)

19.2 GB/s

4.8 GB/s

25.6 GB/s

6.4 GB/s

Note: QPI ¼ Quick Path Interconnect.

You run into the first problem with memory bandwidth when you consider the processor clock
speed. If you take a processor running at 4 GHz, you need to potentially fetch, every cycle, an
instruction (an operator) plus some data (an operand).
Each instruction is typically 32 bits, so if you execute nothing but a set of linear instructions, with no
data, on every core, you get 4.8 GB/s O 4 ¼ 1.2 GB instructions per second. This assumes the processor
can dispatch one instruction per clock on average*. However, you typically also need to fetch and write
back data, which if we say is on a 1:1 ratio with instructions, means we effectively halve our throughput.
The ratio of clock speed to memory is an important limiter for both CPU and GPU throughput and
something we’ll look at later. We find when you look into it, most applications, with a few exceptions on
both CPU and GPU, are often memory bound and not processor cycle or processor clock/load bound.
CPU vendors try to solve this problem by using cache memory and burst memory access. This
exploits the principle of locality. It you look at a typical C program, you might see the following type of
operation in a function:
void some_function
{
int array[100];
int i ¼ 0;
*

The actual achieved dispatch rate can be higher or lower than one, which we use here for simplicity.

3

Von Neumann Architecture

for (i¼0; i<100; iþþ)
{
array[i] ¼ i * 10;
}
}

If you look at how the processor would typically implement this, you would see the address of
loaded into some memory access register. The parameter i would be loaded into another
register. The loop exit condition, 100, is loaded into another register or possibly encoded into the
instruction stream as a literal value. The computer would then iterate around the same instructions,
over and over again 100 times. For each value calculated, we have control, memory, and calculation
instructions, fetched and executed.
This is clearly inefficient, as the computer is executing the same instructions, but with
different data values. Thus, the hardware designers implement into just about all processors
a small amount of cache, and in more complex processors, many levels of cache (Figure 1.1).
When the processor would fetch something from memory, the processor first queries the cache,
and if the data or instructions are present there, the high-speed cache provides them to the
processor.
array

Processor Core

L1 Instruction

L1 Data

L2 Cache

Processor Core

L1 Instruction

L1 Data

Processor Core

L1 Instruction

L2 Cache

L2 Cache

L3 Cache

DRAM

FIGURE 1.1
Typical modern CPU cache organization.

L1 Data

Processor Core

L1 Instruction

L1 Data

L2 Cache

4

CHAPTER 1 A Short History of Supercomputing

If the data is not in the first level (L1) cache, then a fetch from the second or third level (L2 or L3)
cache is required, or from the main memory if no cache line has this data already. The first level cache
typically runs at or near the processor clock speed, so for the execution of our loop, potentially we do
get near the full processor speed, assuming we write cache as well as read cache. However, there is
a cost for this: The size of the L1 cache is typically only 16 K or 32 K in size. The L2 cache is
somewhat slower, but much larger, typically around 256 K. The L3 cache is much larger, usually
several megabytes in size, but again much slower than the L2 cache.
With real-life examples, the loop iterations are much, much larger, maybe many megabytes in size.
Even if the program can remain in cache memory, the dataset usually cannot, so the processor, despite
all this cache trickery, is quite often limited by the memory throughput or bandwidth.
When the processor fetches an instruction or data item from the cache instead of the main memory,
it’s called a cache hit. The incremental benefit of using progressively larger caches drops off quite
rapidly. This in turn means the ever-larger caches we see on modern processors are a less and less
useful means to improve performance, unless they manage to encompass the entire dataset of the
problem.
The Intel I7-920 processor has some 8 MB of internal L3 cache. This cache memory is not free, and
if we look at the die for the Intel I7 processor, we see around 30% of the size of the chip is dedicated to
the L3 cache memory (Figure 1.2).
As cache sizes grow, so does the physical size of the silicon used to make the processors. The
larger the chip, the more expensive it is to manufacture and the higher the likelihood that it will
contain an error and be discarded during the manufacturing process. Sometimes these faulty devices
are sold cheaply as either triple- or dual-core devices, with the faulty cores disabled. However,
the effect of larger, progressively more inefficient caches ultimately results in higher costs to the
end user.

M
i
s
c

M
i
s
c
I
O

Core 1

Core 2

Core 3
Q
u
e
u
e

Q
P
I
1

Core 4

I
O

Q
P
I
Shared L3 Cache

FIGURE 1.2
Layout of I7 Nehalem processor on processor die.

2

Cray

5

CRAY
The computing revolution that we all know today started back in the 1950s with the advent of the first
microprocessors. These devices, by today’s standards, are slow and you most likely have a far more
powerful processor in your smartphone. However, these led to the evolution of supercomputers, which are
machines usually owned by governments, large academic institutions, or corporations. They are thousands of times more powerful than the computers in general use today. They cost millions of dollars to
produce, occupy huge amounts of space, usually have special cooling requirements, and require a team of
engineers to look after them. They consume huge amounts of power, to the extent they are often as
expensive to run each year as they cost to build. In fact, power is one of the key considerations when
planning such an installation and one of the main limiting factors in the growth of today’s supercomputers.
One of the founders of modern supercomputers was Seymour Cray with his Cray-1, produced by
Cray Research back in 1976. It had many thousands of individual cables required to connect everything togetherdso much so they used to employ women because their hands were smaller than those
of most men and they could therefore more easily wire up all the thousands of individual cables.
These machines would typically have an uptime (the actual running time between breakdowns)
measured in hours. Keeping them running for a whole day at a time would be considered a huge

FIGURE 1.3
Wiring inside the Cray-2 supercomputer.

6

CHAPTER 1 A Short History of Supercomputing

achievement. This seems quite backward by today’s standards. However, we owe a lot of what we have
today to research carried out by Seymour Cray and other individuals of this era.
Cray went on to produce some of the most groundbreaking supercomputers of his time under various
Cray names. The original Cray-1 cost some $8.8 million USD and achieved a massive 160 MFLOPS
(million floating-point operations per second). Computing speed today is measured in TFLOPS
(tera floating-point operations per second), a million times larger than the old MFLOPS measurement
(1012 vs. 106). A single Fermi GPU card today has a theoretical peak in excess of 1 teraflop of
performance.
The Cray-2 was a significant improvement on the Cray-1. It used a shared memory architecture,
split into banks. These were connected to one, two, or four processors. It led the way for the creation of
today’s server-based symmetrical multiprocessor (SMP) systems in which multiple CPUs shared the
same memory space. Like many machines of its era, it was a vector-based machine. In a vector
machine the same operation acts on many operands. These still exist today, in part as processor
extensions such as MMX, SSE, and AVX. GPU devices are, at their heart, vector processors that share
many similarities with the older supercomputer designs.
The Cray also had hardware support for scatter- and gather-type primitives, something we’ll see is
quite important in parallel computing and something we look at in subsequent chapters.
Cray still exists today in the supercomputer market, and as of 2010 held the top 500 position with
their Jaguar supercomputer at the Oak Ridge National Laboratory (http://www.nccs.gov/computingresources/jaguar/). I encourage you to read about the history of this great company, which you can
find on Cray’s website (http://www.cray.com), as it gives some insight into the evolution of computers
and as to where we are today.

CONNECTION MACHINE
Back in 1982 a corporation called Thinking Machines came up with a very interesting design, that of
the Connection Machine.
It was a relatively simple concept that led to a revolution in today’s parallel computers. They used
a few simple parts over and over again. They created a 16-core CPU, and then installed some 4096 of
these devices in one machine. The concept was different. Instead of one fast processor churning
through a dataset, there were 64 K processors doing this task.
Let’s take the simple example of manipulating the color of an RGB (red, green, blue) image. Each
color is made up of a single byte, with 3 bytes representing the color of a single pixel. Let’s suppose we
want to reduce the blue level to zero.
Let’s assume the memory is configured in three banks of red, blue, and green, rather than being
interleaved. With a conventional processor, we would have a loop running through the blue memory
and decrement every pixel color level by one. The operation is the same on each item of data, yet each
time we fetch, decode, and execute the instruction stream on each loop iteration.
The Connection Machine used something called SIMD (single instruction, multiple data), which is
used today in modern processors and known by names such as SSE (Streaming SIMD Extensions),
MMX (Multi-Media eXtension), and AVX (Advanced Vector eXtensions). The concept is to define
a data range and then have the processor apply that operation to the data range. However, SSE and
MMX are based on having one processor core. The Connection Machine had 64 K processor cores,
each executing SIMD instructions on its dataset.

Cell Processor

7

Processors such as the Intel I7 are 64-bit processors, meaning they can process up to 64 bits at
a time (8 bytes). The SSE SIMD instruction set extends this to 128 bits. With SIMD instructions on
such a processor, we eliminate all redundant instruction memory fetches, and generate one sixteenth of
the memory read and write cycles compared with fetching and writing 1 byte at a time. AVX extends
this to 256 bits, making it even more effective.
For a high-definition (HD) video image of 1920  1080 resolution, the data size is 2,073,600 bytes,
or around 2 MB per color plane. Thus, we generate around 260,000 SIMD cycles for a single
conventional processor using SSE/MMX. By SIMD cycle, we mean one read, compute, and write
cycle. The actual number of processor clocks may be considerably different than this, depending on the
particular processor architecture.
The Connection Machine used 64 K processors. Thus, the 2 MB frame would have resulted in about
32 SIMD cycles for each processor. Clearly, this type of approach is vastly superior to the modern
processor SIMD approach. However, there is of course a caveat. Synchronizing and communication
between processors becomes the major issue when moving from a rather coarse-threaded approach of
today’s CPUs to a hugely parallel approach used by such machines.

CELL PROCESSOR
Another interesting development in supercomputers stemmed from IBM’s invention of the Cell
processor (Figure 1.4). This worked on the idea of having a regular processor act as a supervisory

R
A
M
B
U
S
I
N
T
E
R
F
A
C
E

M
E
M
O
R
Y
C
O
N
T
R
O
L
L
E
R

L2 Cache
(512K)

SPE

L
O
C
A
L

SPE

L
O
C
A
L

SPE

M
E
M
O
R
Y

M
E
M
O
R
Y

L
O
C
A
L

SPE

L
O
C
A
L

M
E
M
O
R
Y

M
E
M
O
R
Y

L
O
C
A
L

L
O
C
A
L

Interconnect Bus

Power PC Core

SPE

FIGURE 1.4
IBM cell processor die layout (8 SPE version).

L
O
C
A
L
M
E
M
O
R
Y

SPE

L
O
C
A
L
M
E
M
O
R
Y

SPE

M
E
M
O
R
Y

SPE

M
E
M
O
R
Y

I/O

C
O
N
T
R
O
L
L
E
R

8

CHAPTER 1 A Short History of Supercomputing

processor, connected to a number of high-speed stream processors. The regular PowerPC (PPC)
processor in the Cell acts as an interface to the stream processors and the outside world. The
stream SIMD processors, or SPEs as IBM called them, would process datasets managed by the
regular processor.
The Cell is a particularly interesting processor for us, as it’s a similar design to what NVIDIA later
used in the G80 and subsequent GPUs. Sony also used it in their PS3 console machines in the games
industry, a very similar field to the main use of GPUs.
To program the Cell, you write a program to execute on the PowerPC core processor. It then
invokes a program, using an entirely different binary, on each of the stream processing elements
(SPEs). Each SPE is actually a core in itself. It can execute an independent program from its own local
memory, which is different from the SPE next to it. In addition, the SPEs can communicate with one
another and the PowerPC core over a shared interconnect. However, this type of hybrid architecture is
not easy to program. The programmer must explicitly manage the eight SPEs, both in terms of
programs and data, as well as the serial program running on the PowerPC core.
With the ability to talk directly to the coordinating processor, a series of simple steps can be
achieved. With our RGB example earlier, the PPC core fetches a chunk of data to work on. It allocates
these to the eight SPEs. As we do the same thing in each SPE, each SPE fetches the byte, decrements it,
and writes its bit back to its local memory. When all SPEs are done, the PC core fetches the data from
each SPE. It then writes its chunk of data (or tile) to the memory area where the whole image is being
assembled. The Cell processor is designed to be used in groups, thus repeating the design of the
Connection Machine we covered earlier.
The SPEs could also be ordered to perform a stream operation, involving multiple steps, as each
SPE is connected to a high-speed ring (Figure 1.5).
The problem with this sort of streaming or pipelining approach is it runs only as fast as the slowest
node. It mirrors a production line in a factory. The whole line can only run as fast as the slowest point.
Each SPE (worker) only has a small set of tasks to perform, so just like the assembly line worker, it can
do this very quickly and efficiently. However, just like any processor, there is a bandwidth limit and
overhead of passing data to the next stage. Thus, while you gain efficiencies from executing
a consistent program on each SPE, you lose on interprocessor communication and are ultimately

SPE 0
(Clamp)

SPE 1
(DCT)

SPE 2
(Filter 1)

SPE 5
(Restore)

SPE 4
(IDCT)

SPE 3
(Filter 2)

Power PC Core

FIGURE 1.5
Example routing stream processor routing on Cell.

Multinode Computing

9

limited by the slowest process step. This is a common problem with any pipeline-based model of
execution.
The alternative approach of putting everything on one SPE and then having each SPE process a small
chunk of data is often a more efficient approach. This is the equivalent to training all assembly line
workers to assemble a complete widget. For simple tasks, this is easy, but each SPE has limits on available
program and data memory. The PowerPC core must now also deliver and collect data from eight SPEs,
instead of just two, so the management overhead and communication between host and SPEs increases.
IBM used a high-powered version of the Cell processor in their Roadrunner supercomputer,
which as of 2010 was the third fastest computer on the top 500 list. It consists of 12,960 PowerPC
cores, plus a total of 103,680 stream processors. Each PowerPC board is supervised by a dual-core
AMD (Advanced Micro Devices) Opteron processor, of which there are 6912 in total. The Opteron
processors act as coordinators among the nodes. Roadrunner has a theoretical throughput of 1.71
petaflops, cost $125 million USD to build, occupies 560 square meters, and consumes 2.35 MW of
electricity when operating!

MULTINODE COMPUTING
As you increase the requirements (CPU, memory, storage space) needed on a single machine, costs
rapidly increase. While a 2.6 GHz processor may cost you $250 USD, the same processor at 3.4 GHz
may be $1400 for less than a 1 GHz increase in clock speed. A similar relationship is seen for both
speed and size memory, and storage capacity.
Not only do costs scale as computing requirements scale, but so do the power requirements and the
consequential heat dissipation issues. Processors can hit 4–5 GHz, given sufficient supply of power and
cooling.
In computing you often find the law of diminishing returns. There is only so much you can put into
a single case. You are limited by cost, space, power, and heat. The solution is to select a reasonable
balance of each and to replicate this many times.
Cluster computing became popular in 1990s along with ever-increasing clock rates. The concept
was a very simple one. Take a number of commodity PCs bought or made from off-the-shelf parts and
connect them to an off-the-shelf 8-, 16-, 24-, or 32-port Ethernet switch and you had up to 32 times the
performance of a single box. Instead of paying $1600 for a high performance processor, you paid $250
and bought six medium performance processors. If your application needed huge memory capacity, the
chances were that maxing out the DIMMs on many machines and adding them together was more than
sufficient. Used together, the combined power of many machines hugely outperformed any single
machine you could possible buy with a similar budget.
All of a sudden universities, schools, offices, and computer departments could build machines
much more powerful than before and were not locked out of the high-speed computing market due to
lack of funds. Cluster computing back then was like GPU computing todayda disruptive technology
that changed the face of computing. Combined with the ever-increasing single-core clock speeds it
provided a cheap way to achieve parallel processing within single-core CPUs.
Clusters of PCs typically ran a variation of LINUX with each node usually fetching its boot
instructions and operating system (OS) from a central master node. For example, at CudaDeveloper we
have a tiny cluster of low-powered, atom-based PCs with embedded CUDA GPUs. It’s very cheap to

10

CHAPTER 1 A Short History of Supercomputing

buy and set up a cluster. Sometimes they can simply be made from a number of old PCs that are being
replaced, so the hardware is effectively free.
However, the problem with cluster computing is it’s only as fast as the amount of internode
communication that is necessary for the problem. If you have 32 nodes and the problem breaks down into
32 nice chunks and requires no internode communication, you have an application that is ideal for
a cluster. If every data point takes data from every node, you have a terrible problem to put into a cluster.
Clusters are seen inside modern CPUs and GPUs. Look back at Figure 1.1, the CPU cache hierarchy. If we consider each CPU core as a node, the L2 cache as DRAM (Dynamic Random Access
Memory), the L3 cache as the network switch, and the DRAM as mass storage, we have a cluster in
miniature (Figure 1.6).
The architecture inside a modern GPU is really no different. You have a number of streaming
multiprocessors (SMs) that are akin to CPU cores. These are connected to a shared memory/L1
cache. This is connected to an L2 cache that acts as an inter-SM switch. Data can be held in global
memory storage where it’s then extracted and used by the host, or sent via the PCI-E switch directly
to the memory on another GPU. The PCI-E switch is many times faster than any network’s
interconnect.
The node may itself be replicated many times, as shown in Figure 1.7. This replication within
a controlled environment forms a cluster. One evolution of the cluster designs are distributed

Processor Node

DRAM

Local
Storage

Network Interface

Processor Node

DRAM

Processor Node

Local
Storage

DRAM

Network Interface

Network Interface

Network Switch

Network Storage

FIGURE 1.6
Typical cluster layout.

Local
Storage

Processor Node

DRAM

Local
Storage

Network Interface

The Early Days of GPGPU Coding

SM

SM

SM

SM

L1

L1

L1

L1

11

L2 Cache
GPU

GPU

GPU

GMEM

PCI-E Interface

PCI-E Switch

Host Memory / CPU

FIGURE 1.7
GPUs compared to a cluster.

applications. Distributed applications run on many nodes, each of which may contain many processing
elements including GPUs. Distributed applications may, but do not need to, run in a controlled
environment of a managed cluster. They can connect arbitrary machines together to work on some
common problem, BOINC and Folding@Home being two of the largest examples of such applications
that connect machines together over the Internet.

THE EARLY DAYS OF GPGPU CODING
Graphics processing units (GPUs) are devices present in most modern PCs. They provide a number of
basic operations to the CPU, such as rendering an image in memory and then displaying that image
onto the screen. A GPU will typically process a complex set of polygons, a map of the scene to be
rendered. It then applies textures to the polygons and then performs shading and lighting calculations.
The NVIDIA 5000 series cards brought for the first time photorealistic effects, such as shown in the
Dawn Fairy demo from 2003.
Have a look at http://www.nvidia.com/object/cool_stuff.html#/demos and download some of
the older demos and you’ll see just how much GPUs have evolved over the past decade. See
Table 1.2.
One of the important steps was the development of programmable shaders. These were effectively
little programs that the GPU ran to calculate different effects. No longer was the rendering fixed in the
GPU; through downloadable shaders, it could be manipulated. This was the first evolution of general-

12

CHAPTER 1 A Short History of Supercomputing

Table 1.2 GPU Technology Demonstrated over the Years
Demo

Card

Dawn
Dusk Ultra
Nalu
Luna
Froggy
Human Head
Medusa
Supersonic Sled
A New Dawn

GeForce
GeForce
GeForce
GeForce
GeForce
GeForce
GeForce
GeForce
GeForce

Year
FX
FX
6
7
8
8
200
400
600

2003
2003
2004
2005
2006
2007
2008
2010
2012

purpose graphical processor unit (GPGPU) programming, in that the design had taken its first steps in
moving away from fixed function units.
However, these shaders were operations that by their very nature took a set of 3D points that
represented a polygon map. The shaders applied the same operation to many such datasets, in a hugely
parallel manner, giving huge throughput of computing power.
Now although polygons are sets of three points, and some other datasets such as RGB photos can be
represented by sets of three points, a lot of datasets are not. A few brave researchers made use of GPU
technology to try and speed up general-purpose computing. This led to the development of a number of
initiatives (e.g., BrookGPU, Cg, CTM, etc.), all of which were aimed at making the GPU a real
programmable device in the same way as the CPU. Unfortunately, each had its own advantages and
problems. None were particularly easy to learn or program in and were never taught to people in large
numbers. In short, there was never a critical mass of programmers or a critical mass of interest from
programmers in this hard-to-learn technology. They never succeeded in hitting the mass market,
something CUDA has for the first time managed to do, and at the same time provided programmers
with a truly general-purpose language for GPUs.

THE DEATH OF THE SINGLE-CORE SOLUTION
One of the problems with today’s modern processors is they have hit a clock rate limit at around 4 GHz.
At this point they just generate too much heat for the current technology and require special and
expensive cooling solutions. This is because as we increase the clock rate, the power consumption
rises. In fact, the power consumption of a CPU, if you fix the voltage, is approximately the cube of its
clock rate. To make this worse, as you increase the heat generated by the CPU, for the same clock rate,
the power consumption also increases due to the properties of the silicon. This conversion of power
into heat is a complete waste of energy. This increasingly inefficient use of power eventually means
you are unable to either power or cool the processor sufficiently and you reach the thermal limits of the
device or its housing, the so-called power wall.
Faced with not being able to increase the clock rate, making forever-faster processors, the processor
manufacturers had to come up with another game plan. The two main PC processor manufacturers, Intel

NVIDIA and CUDA

13

and AMD, have had to adopt a different approach. They have been forced down the route of adding more
cores to processors, rather than continuously trying to increase CPU clock rates and/or extract more
instructions per clock through instruction-level parallelism. We have dual, tri, quad, hex, 8, 12, and soon
even 16 and 32 cores and so on. This is the future of where computing is now going for everyone, the
GPU and CPU communities. The Fermi GPU is effectively already a 16-core device in CPU terms.
There is a big problem with this approachdit requires programmers to switch from their traditional
serial, single-thread approach, to dealing with multiple threads all executing at once. Now the
programmer has to think about two, four, six, or eight program threads and how they interact and
communicate with one another. When dual-core CPUs arrived, it was fairly easy, in that there were
usually some background tasks being done that could be offloaded onto a second core. When quadcore CPUs arrived, not many programs were changed to support it. They just carried on being sold as
single-thread applications. Even the games industry didn’t really move to quad-core programming
very quickly, which is the one industry you’d expect to want to get the absolute most out of today’s
technology.
In some ways the processor manufacturers are to blame for this, because the single-core application
runs just fine on one-quarter of the quad-core device. Some devices even increase the clock rate
dynamically when only one core is active, encouraging programmers to be lazy and not make use of
the available hardware.
There are economic reasons too. The software development companies need to get the product to
market as soon as possible. Developing a better quad-core solution is all well and good, but not if the
market is being grabbed by a competitor who got there first. As manufacturers still continue to make
single- and dual-core devices, the market naturally settles on the lowest configuration, with the widest
scope for sales. Until the time that quad-core CPUs are the minimum produced, market forces work
against the move to multicore programming in the CPU market.

NVIDIA AND CUDA
If you look at the relative computational power in GPUs and CPUs, we get an interesting graph
(Figure 1.8). We start to see a divergence of CPU and GPU computational power until 2009 when we
see the GPU finally break the 1000 gigaflops or 1 teraflop barrier. At this point we were moving from
the G80 hardware to the G200 and then in 2010 to the Fermi evolution. This is driven by the introduction of massively parallel hardware. The G80 is a 128 CUDA core device, the G200 is a 256 CUDA
core device, and the Fermi is a 512 CUDA core device.
We see NVIDIA GPUs make a leap of 300 gigaflops from the G200 architecture to the Fermi
architecture, nearly a 30% improvement in throughput. By comparison, Intel’s leap from their core 2
architecture to the Nehalem architecture sees only a minor improvement. Only with the change to
Sandy Bridge architecture do we see significant leaps in CPU performance. This is not to say one is
better than the other, for the traditional CPUs are aimed at serial code execution and are extremely
good at it. They contain special hardware such as branch prediction units, multiple caches, etc., all of
which target serial code execution. The GPUs are not designed for this serial execution flow and only
achieve their peak performance when fully utilized in a parallel manner.
In 2007, NVIDIA saw an opportunity to bring GPUs into the mainstream by adding an easy-to-use
programming interface, which it dubbed CUDA, or Compute Unified Device Architecture. This

14

CHAPTER 1 A Short History of Supercomputing

3500

3090

3000
2500
2000
1581

1581

1500
1062
1000
518

576

648

42.6

51.2

55

58

86

2006

2007

2008

2009

2010

500

187

243

2011

2012

0

GPU

CPU

FIGURE 1.8
CPU and GPU peak performance in gigaflops.

opened up the possibility to program GPUs without having to learn complex shader languages, or to
think only in terms of graphics primitives.
CUDA is an extension to the C language that allows GPU code to be written in regular C. The code
is either targeted for the host processor (the CPU) or targeted at the device processor (the GPU). The
host processor spawns multithread tasks (or kernels as they are known in CUDA) onto the GPU device.
The GPU has its own internal scheduler that will then allocate the kernels to whatever GPU hardware is
present. We’ll cover scheduling in detail later. Provided there is enough parallelism in the task, as the
number of SMs in the GPU grows, so should the speed of the program.
However, herein hides a big problem. You have to ask what percentage of the code can be run in
parallel. The maximum speedup possible is limited by the amount of serial code. If you have an infinite
amount of processing power and could do the parallel tasks in zero time, you would still be left with the
time from the serial code part. Therefore, we have to consider at the outset if we can indeed parallelize
a significant amount of the workload.
NVIDIA is committed to providing support to CUDA. Considerable information, examples, and
tools to help with development are available from its website at http://www.nvidia.com under CudaZone.
CUDA, unlike its predecessors, has now actually started to gain momentum and for the first time it
looks like there will be a programming language that will emerge as the one of choice for GPU
programming. Given that the number of CUDA-enabled GPUs now number in the millions, there is
a huge market out there waiting for CUDA-enabled applications.
There are currently many CUDA-enabled applications and the list grows monthly. NVIDIA showcases
many of these on its community website at http://www.nvidia.com/object/cuda_apps_flash_new.html.
In areas where programs have to do a lot of computational workdfor example, making a DVD
from your home movies (video transcoding)dwe see most mainstream video packages now supporting CUDA. The average speedup is 5 to 10 times in this domain.

GPU Hardware

15

Along with the introduction of CUDA came the Tesla series of cards. These cards are not graphics
cards, and in fact they have no DVI or VGA connectors on them. They are dedicated compute cards
aimed at scientific computing. Here we see huge speedups in scientific calculations. These cards can
either be installed in a regular desktop PC or in dedicated server racks. NVIDIA provides such
a system at http://www.nvidia.com/object/preconfigured_clusters.html, which claims to provide up to
30 times the power of a conventional cluster. CUDA and GPUs are reshaping the world of highperformance computing.

GPU HARDWARE
The NVIDIA G80 series processor and beyond implemented a design that is similar to both the
Connection Machine and IBM’s Cell processor. Each graphics card consists of a number of SMs. To
each SM is attached eight or more SPs (Stream Processors). The original 9800 GTX card has eight
SMs, giving a total of 128 SPs. However, unlike the Roadrunner, each GPU board can be purchased
for a few hundred USD and it doesn’t take 2.35 MW to power it. Power considerations are not to be
overlooked, as we’ll discuss later when we talk about building GPU servers.
The GPU cards can broadly be considered as accelerator or coprocessor cards. A GPU card, currently,
must operate in conjunction with a CPU-based host. In this regard it follows very much the approach of
the Cell processor with the regular serial core and N SIMD SPE cores. Each GPU device contains a set of
SMs, each of which contain a set of SPs or CUDA cores. The SPs execute work as parallel sets of up to
32 units. They eliminate a lot of the complex circuitry needed on CPUs to achieve high-speed serial
execution through instruction-level parallelism. They replace this with a programmer-specified explicit
parallelism model, allowing more compute capacity to be squeezed onto the same area of silicon.
The overall throughput of GPUs is largely determined by the number of SPs present, the bandwidth
to the global memory, and how well the programmer makes use of the parallel architecture he or she is
working with. See Table 1.3 for a listing of current NVIDIA GPU cards.
Which board is correct for a given application is a balance between memory and GPU processing
power needed for a given application. Note the 9800 GX2, 295, 590, 690, and K10 cards are actually
dual cards, so to make full use of these they need to be programmed as two devices not one. The one
caveat GPU here is that the figures quoted are for single-precision (32-bit) floating-point performance,
not double-precision (64-bit) precision. Also be careful with the GF100 (Fermi) series, as the Tesla
variant has double the number of double-precision units found in the standard desktop units, so
achieves significantly better double-precision throughput. The Kepler K 20, yet to be released, will
also have significant double precision performance over and above its already released K10 cousin.
Note also, although not shown here, as the generations have evolved, the power consumption, clock
for clock, per SM has come down. However, the overall power consumption has increased considerably and this is one of the key considerations in any multi-GPU-based solution. Typically, we see
dual-GPU-based cards (9800 GX2, 295, 590, 690) having marginally lower power consumption
figures than the equivalent two single cards due to the use of shared circuitry and/or reduced clock
frequencies.
NVIDIA provides various racks (the M series computing modules) containing two to four Tesla cards
connected on a shared PCI-E bus for high-density computing. It’s quite possible to build your own GPU
cluster or microsupercomputer from standard PC parts, and we show you how to do this later in the book.

16

CHAPTER 1 A Short History of Supercomputing

Table 1.3 Current Series of NVIDIA GPU Cards
GPU
Series

Device

Number of
SPs

Max
Memory

GFlops
(FMAD)

Bandwidth
(GB/s)

Power
(Watts)

9800 GT
9800 GTX
9800 GX2

G92
G92
G92

96
128
256

2GB
2GB
1GB

504
648
1152

57
70
2 x 64

125
140
197

260
285
295

G200
G200
G200

216
240
480

2GB
2GB
1.8GB

804
1062
1788

110
159
2 x 110

182
204
289

470
480
580
590

GF100
GF100
GF110
GF110

448
448
512
1024

1.2GB
1.5GB
1.5GB
3GB

1088
1344
1581
2488

134
177
152
2 x 164

215
250
244
365

680
690

GK104
GK104

1536
3072

2GB
4GB

3090
5620

192
2 x 192

195
300

Tesla
C870
Tesla
C1060
Tesla
C2070
Tesla K10

G80

128

1.5GB

518

77

171

G200

240

4GB

933

102

188

GF100

448

6GB

1288

144

247

GK104

3072

8GB

5184

2 x 160

250

The great thing about CUDA is that, despite all the variability in hardware, programs written for the
original CUDA devices can run on today’s CUDA devices. The CUDA compilation model applies the
same principle as used in Javadruntime compilation of a virtual instruction set. This allows modern
GPUs to execute code from even the oldest generation GPUs. In many cases they benefit significantly
from the original programmer reworking the program for the features of the newer GPUs. In fact, there
is considerable scope for tuning for the various hardware revisions, which we’ll cover toward the end
of the book.

ALTERNATIVES TO CUDA
OpenCL
So what of the other GPU manufacturers, ATI (now AMD) being the prime example? AMD’s product
range is as impressive as the NVIDIA range in terms of raw computer power. However, AMD brought

Alternatives to CUDA

17

its stream computing technology to the marketplace a long time after NVIDIA brought out CUDA. As
a consequence, NVIDA has far more applications available for CUDA than AMD/ATI does for its
competing stream technology.
OpenCL and Direct compute is not something we’ll cover in this book, but they deserve a mention
in terms of alternatives to CUDA. CUDA is currently only officially executable on NVIDIA hardware.
While NVIDIA has a sizeable chunk of the GPU market, its competitors also hold a sizeable chunk. As
developers, we want to develop products for as large a market as possible, especially if we’re talking
about the consumer market. As such, people should be aware there are alternatives to CUDA, which
support both NVIDIA’s and others’ hardware.
OpenCL is an open and royalty-free standard supported by NVIDIA, AMD, and others. The
OpenCL trademark is owned by Apple. It sets out an open standard that allows the use of compute
devices. A compute device can be a GPU, CPU, or other specialist device for which an OpenCL driver
exists. As of 2012, OpenCL supports all major brands of GPU devices, including CPUs with at least
SSE3 support.
Anyone who is familiar with CUDA can pick up OpenCL relatively easily, as the fundamental
concepts are quite similar. However, OpenCL is somewhat more complex to use than CUDA, in that much
of the work the CUDA runtime API does for the programmer needs to be explicitly performed in OpenCL.
You can read more about OpenCL at http://www.khronos.org/opencl/. There are also now a number
of books written on OpenCL. I’d personally recommend learning CUDA prior to OpenCL as CUDA is
somewhat of a higher-level language extension than OpenCL.

DirectCompute
DirectCompute is Microsoft’s alternative to CUDA and OpenCL. It is a proprietary product linked to
the Windows operating system, and in particular, the DirectX 11 API. The DirectX API was a huge
leap forward for any of those who remember programming video cards before it. It meant the
developers had to learn only one library API to program all graphics cards, rather than write or license
drivers for each major video card manufacturer.
DirectX 11 is the latest standard and supported under Windows 7. With Microsoft’s name behind
the standard, you might expect to see some quite rapid adoption among the developer community. This
is especially the case with developers already familiar with DirectX APIs. If you are familiar with
CUDA and DirectCompute, then it is quite an easy task to port a CUDA application over to DirectCompute. According to Microsoft, this is something you can typically do in an afternoon’s work if you
are familiar with both systems. However, being Windows centric, we’ll exclude DirectCompute from
many high-end systems where the various flavors of UNIX dominate.
Microsoft are also set to launch Cþþ AMP, an additional set of standard template libraries (STLs),
which may appeal more to programmers already familiar with Cþþ-style STLs.

CPU alternatives
The main parallel processing languages extensions are MPI, OpenMP, and pthreads if you are
developing for Linux. For Windows there is the Windows threading model and OpenMP. MPI and
pthreads are supported as various ports from the Unix world.

18

CHAPTER 1 A Short History of Supercomputing

MPI (Message Passing Interface) is perhaps the most widely known messaging interface. It is
process-based and generally found in large computing labs. It requires an administrator to
configure the installation correctly and is best suited to controlled environments. Parallelism is
expressed by spawning hundreds of processes over a cluster of nodes and explicitly exchanging
messages, typically over high-speed network-based communication links (Ethernet or
InfiniBand). MPI is widely used and taught. It’s a good solution within a controlled cluster
environment.
OpenMP (Open Multi-Processing) is a system designed for parallelism within a node or
computer system. It works entirely differently, in that the programmer specifies various
parallel directives through compiler pragmas. The compiler then attempts to automatically
split the problem into N parts, according to the number of available processor cores. OpenMP
support is built into many compilers, including the NVCC compiler used for CUDA. OpenMP
tends to hit problems with scaling due to the underlying CPU architecture. Often the memory
bandwidth in the CPU is just not large enough for all the cores continuously streaming data to
or from memory.
Pthreads is a library that is used significantly for multithread applications on Linux. As with
OpenMP, pthreads uses threads and not processes as it is designed for parallelism within
a single node. However, unlike OpenMP, the programmer is responsible for thread management
and synchronization. This provides more flexibility and consequently better performance for
well-written programs.
ZeroMQ (0MQ) is also something that deserves a mention. This is a simple library that you link to,
and we will use it later in the book for developing a multinode, multi-GPU example. ZeroMQ
supports thread-, process-, and network-based communications models with a single crossplatform API. It is also available on both Linux and Windows platforms. It’s designed for
distributed computing, so the connections are dynamic and nodes fail gracefully.
Hadoop is also something that you may consider. Hadoop is an open-source version of Google’s
MapReduce framework. It’s aimed primarily at the Linux platform. The concept is that you take
a huge dataset and break (or map) it into a number of chunks. However, instead of sending the
data to the node, the dataset is already split over hundreds or thousands of nodes using a parallel
file system. Thus, the program, the reduce step, is instead sent to the node that contains the data.
The output is written to the local node and remains there. Subsequent MapReduce programs take
the previous output and again transform it in some way. As data is in fact mirrored to multiple
nodes, this allows for a highly fault-tolerant as well as high-throughput system.

Directives and libraries
There are a number of compiler vendors, PGI, CAPS, and Cray being the most well-known, that
support the recently announced OpenACC set of compiler directives for GPUs. These, in essence,
replicate the approach of OpenMP, in that the programmer inserts a number of compiler directives
marking regions as “to be executed on the GPU.” The compiler then does the grunt work of moving
data to or from the GPU, invoking kernels, etc.
As with the use of pthreads over OpenMP, with the lower level of control pthreads provides, you
can achieve higher performance. The same is true of CUDA versus OpenACC. This extra level of
control comes with a much higher level of required programming knowledge, a higher risk of errors,

Conclusion

19

and the consequential time impact that may have on a development schedule. Currently, OpenACC
requires directives to specify what areas of code should be run on the GPU, but also in which type of
memory data should exist. NVIDIA claims you can get in the order of 5-plus speedup using such
directives. It’s a good solution for those programmers who need to get something working quickly. It’s
also great for those people for whom programming is a secondary consideration who just want the
answer to their problem in a reasonable timeframe.
The use of libraries is also another key area where you can obtain some serious productivity gains,
as well as execution time speedups. Libraries like SDK provide Thrust, which provides common
functions implemented in a very efficient way. Libraries like CUBLAS are some of the best around for
linear algebra. Libraries exist for many well-known applications such as Matlab and Mathematica.
Language bindings exist for Python, Perl, Java, and many others. CUDA can even be integrated with
Excel.
As with many aspects of software development in the modern age, the chances are that someone
has done what you are about to develop already. Search the Internet and see what is already there
before you spend weeks developing a library that, unless you are a CUDA expert, is unlikely to be
faster than one that is already available.

CONCLUSION
So maybe you’re thinking, why develop in CUDA? The answer is that CUDA is currently the easiest
language to develop in, in terms of support, debugging tools, and drivers. CUDA has a head start on
everything else and has a huge lead in terms of maturity. If your application needs to support hardware
other than NVIDIA’s, then the best route currently is to develop under CUDA and then port the
application to one of the other APIs. As such, we’ll concentrate on CUDA, for if you become an expert
with CUDA, it’s easy to pick up alternative APIs should you need to. Understanding how CUDA works
will allow you to better exploit and understand the limitations of any higher-level API.
The journey from a single-thread CPU programmer to a fully fledged parallel programmer on
GPUs is one that I hope you will find interesting. Even if you never program a GPU in the future, the
insight you gain will be of tremendous help in allowing you to design multithread programs. If you,
like us, see the world changing to a parallel programming model, you’ll want to be at the forefront of
that wave of innovation and technological challenge. The single-thread industry is one that is slowly
moving to obsolescence. To be a valuable asset and an employable individual, you need to have skills
that reflect where the computing world is headed to, not those that are becoming progressively
obsolete.
GPUs are changing the face of computing. All of a sudden the computing power of supercomputers
from a decade ago can be slotted under your desk. No longer must you wait in a queue to submit work
batches and wait months for a committee to approve your request to use limited computer resources at
overstretched computing installations. You can go out, spend up to 5000–10,000 USD, and have
a supercomputer on your desk, or a development machine that runs CUDA for a fraction of that. GPUs
are a disruptive technological change that will make supercomputer-like levels of performance
available for everyone.

This page intentionally left blank

CHAPTER

Understanding Parallelism
with GPUs

2

INTRODUCTION
This chapter aims to provide a broad introduction to the concepts of parallel programming and
how these relate to GPU technology. It’s primarily aimed at those people reading this text with
a background in serial programming, but a lack of familiarity with parallel processing concepts. We
look at these concepts in the primary context of GPUs.

TRADITIONAL SERIAL CODE
A significant number of programmers graduated when serial programs dominated the landscape
and parallel programming attracted just a handful of enthusiasts. Most people who go to
university get a degree related to IT because they are interested in technology. However, they also
appreciate they need to have a job or career that pays a reasonable salary. Thus, in specializing, at
least some consideration is given to the likely availability of positions after university. With the
exception of research or academic posts, the number of commercial roles in parallel programming
has always been, at best, small. Most programmers developed applications in a simple serial
fashion based broadly on how universities taught them to program, which in turn was driven by
market demand.
The landscape of parallel programming is scattered, with many technologies and languages that
never quite made it to the mainstream. There was never really the large-scale market need for parallel
hardware and, as a consequence, significant numbers of parallel programmers. Every year or two the
various CPU vendors would bring out a new processor generation that executed code faster than the
previous generation, thereby perpetuating serial code.
Parallel programs by comparison were often linked closely to the hardware. Their goal was to
achieve faster performance and often that was at the cost of portability. Feature X was implemented
differently, or was not available in the next generation of parallel hardware. Periodically a revolutionary new architecture would appear that required a complete rewrite of all code. If your
knowledge as a programmer was centered around processor X, it was valuable in the marketplace
only so long as processor X was in use. Therefore, it made a lot more commercial sense to learn to
program x86-type architecture than some exotic parallel architecture that would only be around for
a few years.
CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00002-8
Copyright Ó 2013 Elsevier Inc. All rights reserved.

21

22

CHAPTER 2 Understanding Parallelism with GPUs

However, over this time, a couple of standards did evolve that we still have today. The OpenMP
standard addresses parallelism within a single node and is designed for shared memory machines that
contain multicore processors. It does not have any concept of anything outside a single node or box.
Thus, you are limited to problems that fit within a single box in terms of processing power, memory
capacity, and storage space. Programming, however, is relatively easy as most of the low-level
threading code (otherwise written using Windows threads or POSIX threads) is taken care of for you by
OpenMP.
The MPI (Message Passing Interface) standard addresses parallelism between nodes and is aimed
at clusters of machines within well-defined networks. It is often used in supercomputer installations
where there may be many thousands of individual nodes. Each node holds a small section of the
problem. Thus, common resources (CPU, cache, memory, storage, etc.) are multiplied by the number
of nodes in the network. The Achilles’ heel of any network is the various interconnects, the parts that
connect the networked machines together. Internode communication is usually the dominating factor
determining the maximum speed in any cluster-based solution.
Both OpenMP and MPI can be used together to exploit parallelism within nodes as well as across
a network of machines. However, the APIs and the approaches used are entirely different, meaning
they are often not used together. The OpenMP directives allow the programmer to take a high-level
view of parallelism via specifying parallel regions. MPI by contrast uses an explicit interprocess
communication model making the programmer do a lot more work.
Having invested the time to become familiar with one API, programmers are often loathe to
learn another. Thus, problems that fit within one computer are often implemented with OpenMP
solutions, whereas really large problems are implemented with cluster-based solutions such as
MPI.
CUDA, the GPU programming language we’ll explore in this text, can be used in conjunction
with both OpenMP and MPI. There is also an OpenMP-like directive version of CUDA (OpenACC)
that may be somewhat easier for those familiar with OpenMP to pick up. OpenMP, MPI, and
CUDA are increasingly taught at undergraduate and graduate levels in many university computer
courses.
However, the first experience most serial programmers had with parallel programming was the
introduction of multicore CPUs. These, like the parallel environments before them, were largely
ignored by all but a few enthusiasts. The primary use of multicore CPUs was for OS-based parallelism.
This is a model based on task parallelism that we’ll look at a little later.
As it became obvious that technology was marching toward the multicore route, more and more
programmers started to take notice of the multicore era. Almost all desktops ship today with either
a dual- or quad-core processor. Thus, programmers started using threads to allow the multiple cores on
the CPU to be exploited.
A thread is a separate execution flow within a program that may diverge and converge as and when
required with the main execution flow. Typically, CPU programs will have no more than twice the
number of threads active than the number of physical processor cores. As with single-core processors,
typically each OS task is time-sliced, given a small amount of time in turn, to give the illusion of
running more tasks than there are physical CPU cores.
However, as the number of threads grows, this becomes more obvious to the end user. In the
background the OS is having to context switch (swap in and out a set of registers) every time it
needs to switch between tasks. As context switching is an expensive operation, typically

Serial/Parallel Problems

23

thousands of cycles, CPU applications tend to have a fairly low number of threads compared
with GPUs.

SERIAL/PARALLEL PROBLEMS
Threads brought with them many of the issues of parallel programming, such as sharing resources.
Typically, this is done with a semaphore, which is simply a lock or token. Whoever has the token can
use the resource and everyone else has to wait for the user of the token to release it. As long as there is
only a single token, everything works fine.
Problems occur when there are two or more tokens that must be shared by the same threads. In such
situations, thread 0 grabs token 0, while thread 1 grabs token 1. Thread 0 now tries to grab token 1,
while thread 1 tries to grab token 0. As the tokens are unavailable, both thread 0 and thread 1 sleep until
the token becomes available. As neither thread ever releases the one token they already own, all
threads wait forever. This is known as a deadlock, and it is something that can and will happen without
proper design.
The opposite also happensdsharing of resources by chance. With any sort of locking system, all
parties to a resource must behave correctly. That is, they must request the token, wait if necessary, and,
only when they have the token, perform the operation. This relies on the programmer to identify shared
resources and specifically put in place mechanisms to coordinate updates by multiple threads.
However, there are usually several programmers in any given team. If just one of them doesn’t follow
this convention, or simply does not know this is a shared resource, you may appear to have a working
program, but only by chance.
One of the projects I worked on for a large company had exactly this problem. All threads requested
a lock, waited, and updated the shared resource. Everything worked fine and the particular code passed
quality assurance and all tests. However, in the field occasionally users would report the value of
a certain field being reset to 0, seemingly randomly. Random bugs are always terrible to track down,
because being able to consistently reproduce a problem is often the starting point of tracking down the
error.
An intern who happened to be working for the company actually found the issue. In a completely
unrelated section of the code a pointer was not initialized under certain conditions. Due to the way the
program ran, some of the time, depending on the thread execution order, the pointer would point to our
protected data. The other code would then initialize “its variable” by writing 0 to the pointer, thus
eliminating the contents of our “protected” and thread-shared parameter.
This is one of the unfortunate areas of thread-based operations; they operate with a shared memory
space. This can be both an advantage in terms of not having to formally exchange data via messages,
and a disadvantage in the lack of protection of shared data.
The alternative to threads is processes. These are somewhat heavier in terms of OS load in that both
code and data contexts must be maintained by the OS. A thread by contrast needs to only maintain
a code context (the program/instruction counter plus a set of registers) and shares the same data space.
Both threads and processes may be executing entirely different sections of a program at any point
in time.
Processes by default operate in an independent memory area. This usually is enough to ensure one
process is unable to affect the data of other processes. Thus, the stray pointer issue should result in an

24

CHAPTER 2 Understanding Parallelism with GPUs

exception for out-of-bounds memory access, or at the very least localize the bug to the particular
process. Data consequently has to be transferred by formally passing messages to or from processes.
In many respects the threading model sits well with OpenMP, while the process model sits well
with MPI. In terms of GPUs, they map to a hybrid of both approaches. CUDA uses a grid of blocks.
This can be thought of as a queue (or a grid) of processes (blocks) with no interprocess communication. Within each block there are many threads which operate cooperatively in batches called warps.
We will look at this further in the coming chapters.

CONCURRENCY
The first aspect of concurrency is to think about the particular problem, without regard for any
implementation, and consider what aspects of it could run in parallel.
If possible, try to think of a formula that represents each output point as some function of the input data.
This may be too cumbersome for some algorithms, for example, those that iterate over a large number of
steps. For these, consider each step or iteration individually. Can the data points for the step be represented
as a transformation of the input dataset? If so, then you simply have a set of kernels (steps) that run in
sequence. These can simply be pushed into a queue (or stream) that the hardware will schedule sequentially.
A significant number of problems are known as “embarrassingly parallel,” a term that rather
underplays what is being achieved. If you can construct a formula where the output data points can be
represented without relation to each otherdfor example, a matrix multiplicationdbe very happy.
These types of problems can be implemented extremely well on GPUs and are easy to code.
If one or more steps of the algorithm can be represented in this way, but maybe one stage cannot,
also be very happy. This single stage may turn out to be a bottleneck and may require a little thought,
but the rest of the problem will usually be quite easy to code on a GPU.
If the problem requires every data point to know about the value of its surrounding neighbors then
the speedup will ultimately be limited. In such cases, throwing more processors at the problem works
up to a point. At this point the computation slows down due to the processors (or threads) spending
more time sharing data than doing any useful work. The point at which you hit this will depend largely
on the amount and cost of the communication overhead.
CUDA is ideal for an embarrassingly parallel problem, where little or no interthread or interblock
communication is required. It supports interthread communication with explicit primitives using onchip resources. Interblock communication is, however, only supported by invoking multiple kernels in
series, communicating between kernel runs using off-chip global memory. It can also be performed in
a somewhat restricted way through atomic operations to or from global memory.
CUDA splits problems into grids of blocks, each containing multiple threads. The blocks may run
in any order. Only a subset of the blocks will ever execute at any one point in time. A block must
execute from start to completion and may be run on one of N SMs (symmetrical multiprocessors).
Blocks are allocated from the grid of blocks to any SM that has free slots. Initially this is done on
a round-robin basis so each SM gets an equal distribution of blocks. For most kernels, the number of
blocks needs to be in the order of eight or more times the number of physical SMs on the GPU.
To use a military analogy, we have an army (a grid) of soldiers (threads). The army is split into
a number of units (blocks), each commanded by a lieutenant. The unit is split into squads of 32 soldiers
(a warp), each commanded by a sergeant (See Figure 2.1).

Concurrency

25

Grid

Block
N-1

Warp
N-1

Warp
N

Block
N

Warp
N+1

Warp
N-1

Warp
N

Block
N+1

Warp
N+1

Warp
N-1

Warp
N

Warp
N+1

FIGURE 2.1
GPU-based view of threads.

To perform some action, central command (the kernel/host program) must provide some action
plus some data. Each soldier (thread) works on his or her individual part of the problem. Threads may
from time to time swap data with one another under the coordination of either the sergeant (the warp)
or the lieutenant (the block). However, any coordination with other units (blocks) has to be performed
by central command (the kernel/host program).
Thus, it’s necessary to think of orchestrating thousands of threads in this very hierarchical manner
when you think about how a CUDA program will implement concurrency. This may sound quite
complex at first. However, for most embarrassingly parallel programs it’s just a case of thinking of one
thread generating a single output data point. A typical GPU has on the order of 24 K active threads. On
Fermi GPUs you can define 65,535  65,535  1536 threads in total, 24 K of which are active at any
time. This is usually enough to cover most problems within a single node.

Locality
Computing has, over the last decade or so, moved from one limited by computational throughput of the
processor, to one where moving the data is the primary limiting factor. When designing a processor in
terms of processor real estate, compute units (or ALUsdalgorithmic logic units) are cheap. They can
run at high speed, and consume little power and physical die space. However, ALUs are of little use
without operands. Considerable amounts of power and time are consumed in moving the operands to
and from these functional units.
In modern computer designs this is addressed by the use of multilevel caches. Caches work on the
principle of either spatial (close in the address space) or temporal (close in time) locality. Thus, data
that has been accessed before, will likely be accessed again (temporal locality), and data that is close to
the last accessed data will likely be accessed in the future (spatial locality).
Caches work well where the task is repeated many times. Consider for the moment a tradesperson,
a plumber with a toolbox (a cache) that can hold four tools. A number of the jobs he will attend are
similar, so the same four tools are repeatedly used (a cache hit).
However, a significant number of jobs require additional tools. If the tradesperson does not know in
advance what the job will entail, he arrives and starts work. Partway through the job he needs an
additional tool. As it’s not in his toolbox (L1 cache), he retrieves the item from the van (L2 cache).

26

CHAPTER 2 Understanding Parallelism with GPUs

Occasionally he needs a special tool or part and must leave the job, drive down to the local
hardware store (global memory), fetch the needed item, and return. Neither the tradesperson nor the
client knows how long (the latency) this operation will actually take. There may be congestion on the
freeway and/or queues at the hardware store (other processes competing for main memory access).
Clearly, this is not a very efficient use of the tradesperson’s time. Each time a different tool or part is
needed, it needs to be fetched by the tradesperson from either the van or the hardware store. While
fetching new tools the tradesperson is not working on the problem at hand.
While this might seem bad, fetching data from a hard drive or SSD (solid-state drive) is akin to
ordering an item at the hardware store. In comparative form, data from a hard drive arrives by regular
courier several days later. Data from the SSD may arrive by overnight courier, but it’s still very slow
compared to accessing data in global memory.
In some more modern processor designs we have hardware threads. Some Intel processors feature
hyperthreading, with two hardware threads per CPU core. To keep with the same analogy, this is
equivalent to the tradesperson having an assistant and starting two jobs. Every time a new tool/part is
required, the assistant is sent to fetch the new tool/part and the tradesperson switches to the alternate
job. Providing the assistant is able to return with the necessary tool/part before the alternate job also
needs an additional tool/part, the tradesperson continues to work.
Although an improvement, this has not solved the latency issuedhow long it takes to fetch new
tools/parts from the hardware store (global memory). Typical latencies to global memory are in the
order of hundreds of clocks. Increasingly, the answer to this problem from traditional processor design
has been to increase the size of the cache. In effect, arrive with a bigger van so fewer trips to the
hardware store are necessary.
There is, however, an increasing cost to this approach, both in terms of capital outlay for a larger
van and the time it takes to search a bigger van for the tool/part. Thus, the approach taken by most
designs today is to arrive with a van (L2 cache) and a truck (L3 cache). In the extreme case of the server
processors, a huge 18-wheeler is brought in to try to ensure the tradesperson is kept busy for just that
little bit longer.
All of this work is necessary because of one fundamental reason. The CPUs are designed to run
software where the programmer does not have to care about locality. Locality is an issue, regardless of
whether the processor tries to hide it from the programmer or not. The denial that this is an issue is
what leads to the huge amount of hardware necessary to deal with memory latency.
The design of GPUs takes a different approach. It places the GPU programmer in charge of dealing
with locality and instead of an 18-wheeler truck gives him or her a number of small vans and a very
large number of tradespeople.
Thus, in the first instance the programmer must deal with locality. He or she needs to think in
advance about what tools/parts (memory locations/data structures) will be needed for a given job.
These then need to be collected in a single trip to the hardware store (global memory) and placed in the
correct van (on chip memory) for a given job at the outset. Given that this data has been collected, as
much work as possible needs to be performed with the data to avoid having to fetch and return it only
to fetch it again later for another purpose.
Thus, the continual cycle of work-stall-fetch from global memory, work-stall-fetch from global
memory, etc. is broken. We can see the same analogy on a production line. Workers are supplied with
baskets of parts to process, rather than each worker individually fetching widgets one at a time from the
store manager’s desk. To do otherwise is simply a hugely inefficient use of the available workers’ time.

Types of Parallelism

27

This simple process of planning ahead allows the programmer to schedule memory loads into the
on-chip memory before they are needed. This works well with both an explicit local memory model
such as the GPU’s shared memory as well as a CPU-based cache. In the shared memory case you tell
the memory management unit to request this data and then go off and perform useful work on another
piece of data. In the cache case you can use special cache instructions that allow prefilling of the cache
with data you expect the program to use later.
The downside of the cache approach over the shared memory approach is eviction and dirty data.
Data in a cache is said to be dirty if it has been written by the program. To free up the space in the cache
for new useful data, the dirty data has to be written back to global memory before the cache space can
be used again. This means instead of one trip to global memory of an unknown latency, we now have
twodone to write the old data and one to get the new data.
The big advantage of the programmer-controlled on-chip memory is that the programmer is in control of
when the writes happen. If you are performing some local transformation of the data, there may be no need
to write the intermediate transformation back to global memory. With a cache, the cache controller does not
know what needs to be written and what can be discarded. Thus, it writes everything, potentially creating
lots of useless memory traffic that may in turn cause unnecessary congestion on the memory interface.
Although many do, not every algorithm lends itself to this type of “known in advance” memory
pattern that the programmer can optimize for. At the same time, not every programmer wants to deal
with locality issues, either initially or sometimes at all. It’s a perfectly valid approach to develop
a program, prove the concept, and then deal with locality issues.
To facilitate such an approach and to deal with the issues of algorithms that did not have
a well-defined data/execution pattern, later generations of GPUs (compute 2.x onward) have both L1
and L2 caches. These can be configured with a preference toward cache or shared memory, allowing
the programmer flexibility to configure the hardware for a given problem.

TYPES OF PARALLELISM
Task-based parallelism
If we look at a typical operating system, we see it exploit a type of parallelism called task parallelism.
The processes are diverse and unrelated. A user might be reading an article on a website while playing
music from his or her music library in the background. More than one CPU core can be exploited by
running each application on a different core.
In terms of parallel programming, this can be exploited by writing a program as a number of
sections that “pipe” (send via messages) the information from one application to another. The Linux
pipe operator (the j symbol) does just this, via the operating system. The output of one program, such
as grep, is the input of the next, such as sort. Thus, a set of input files can be easily scanned for
a certain set of characters (the grep program) and that output set then sorted (the sort program). Each
program can be scheduled to a separate CPU core.
This pattern of parallelism is known as pipeline parallelism. The output on one program provides
the input for the next. With a diverse set of components, such as the various text-based tools in Linux,
a huge variety of useful functions can be performed by the user. As the programmer cannot know at the
outset everyone’s needs, by providing components that operate together and can be connected easily,
the programmer can target a very wide and diverse user base.

28

CHAPTER 2 Understanding Parallelism with GPUs

This type of parallelism is very much geared toward coarse-grained parallelism. That is, there are
a number of powerful processors, each of which can perform a significant chunk of work.
In terms of GPUs we see coarse-grained parallelism only in terms of a GPU card and the execution
of GPU kernels. GPUs support the pipeline parallelism pattern in two ways. First, kernels can be
pushed into a single stream and separate streams executed concurrently. Second, multiple GPUs can
work together directly through either passing data via the host or passing data via messages directly to
one another over the PCI-E bus. This latter approach, the peer-to-peer (P2P) mechanism, was introduced in the CUDA 4.x SDK and requires certain OS/hardware/driver-level support.
One of the issues with a pipeline-based pattern is, like any production line, it can only run as fast as
the slowest component. Thus, if the pipeline consists of five elements, each of which takes one second,
we can produce one output per second. However, if just one of these elements takes two seconds, the
throughput of the entire pipeline is reduced to one output every two seconds.
The approach to solving this is twofold. Let’s consider the production line analogy for a moment.
Fred’s station takes two seconds because his task is complex. If we provide Fred with an assistant, Tim,
and split his task in half with Tim, we’re back to one second per stage. We now have six stages instead
of five, but the throughput of the pipeline is now again one widget per second.
You can put up to four GPUs into a desktop PC with some thought and care about the design (see
Chapter 11 on designing GPU systems). Thus, if we have a single GPU and it’s taking too long to
process a particular workflow, we can simply add another one and increase the overall processing
power of the node. However, we then have to think about the division of work between the two GPUs.
There may not be an easy 50/50 split. If we can only extract a 70/30 split, clearly the maximum benefit
will be 7/10 (70%) of the existing runtime. If we could introduce another GPU and then maybe move
another task, which occupied say 20% of the time, we’d end up with a 50/30/20 split. Again the
speedup compared to one GPU would be 1/2 or 50% of the original time. We’re still left with the worstcase time dominating the overall execution time.
The same issue applies to providing a speedup when using a single CPU/GPU combination. If we
move 80% of the work off the CPU and onto the GPU, with the GPU computing this in just 10% of the
time, what is the speedup? Well the CPU now takes 20% of the original time and the GPU 10% of the
original time, but in parallel. Thus, the dominating factor is still the CPU. As the GPU is running in
parallel and consumes less time than the CPU fraction, we can discount this time entirely. Thus, the
maximum speedup is one divided by the fraction of the program that takes the longest time to execute.
This is known as Amdahl’s law and is often quoted as the limiting factor in any speedup. It allows you to
know at the outset what the maximum speedup achievable is, without writing a single line of code. Ultimately, you will have serial operations. Even if you move everything onto the GPU, you will still have to use
the CPU to load and store data to and from storage devices. You will also have to transfer data to and from
the GPU to facilitate input and output (I/O). Thus, maximum theoretical speedup is determined by the
fraction of the program that performs the computation/algorithmic part, plus the remaining serial fraction.

Data-based parallelism
Computation power has been greatly increasing over the past couple of decades. We now have teraflopcapable GPUs. However, what has not kept pace with this evolution of compute power is the access
time for data. The idea of data-based parallelism is that instead of concentrating on what tasks have to
be performed, we look first to the data and how it needs to be transformed.

Types of Parallelism

29

Task-based parallelism tends to fit more with coarse-grained parallelism approaches. Let’s use an
example of performing four different transformations on four separate, unrelated, and similarly sized
arrays. We have four CPU cores, and a GPU with four SMs. In a task-based decomposition of the
problem, we would assign one array to each of the CPU cores or SMs in the GPU. The parallel
decomposition of the problem is driven by thinking about the tasks or transformations, not the data.
On the CPU side we could create four threads or processes to achieve this. On the GPU side we
would need to use four blocks and pass the address of every array to every block. On the newer Fermi
and Kepler devices, we could also create four separate kernels, one to process each array and run it
concurrently.
A data-based decomposition would instead split the first array into four blocks and assign one CPU
core or one GPU SM to each section of the array. Once completed, the remaining three arrays would be
processed in a similar way. In terms of the GPU implementation, this would be four kernels, each of
which contained four or more blocks. The parallel decomposition here is driven by thinking about the
data first and the transformations second.
As our CPU has only four cores, it makes a lot of sense to decompose the data into four blocks. We
could have thread 0 process element 0, thread 1 process element 1, thread 2 process element 2, thread
3 process element 3, and so on. Alternatively, the array could be split into four parts and each thread
could start processing its section of the array.
In the first case, thread 0 fetches element 0. As CPUs contain multiple levels of cache, this brings the
data into the device. Typically the L3 cache is shared by all cores. Thus, the memory access from the
first fetch is distributed to all cores in the CPU. By contrast in the second case, four separate memory
fetches are needed and four separate L3 cache lines are utilized. The latter approach is often better
where the CPU cores need to write data back to memory. Interleaving the data elements by core means
the cache has to coordinate and combine the writes from different cores, which is usually a bad idea.
If the algorithm permits, we can exploit a certain type of data parallelism, the SIMD (single
instruction, multiple data) model. This would make use of special SIMD instructions such as MMX,
SSE, AVX, etc. present in many x86-based CPUs. Thus, thread 0 could actually fetch multiple adjacent
elements and process them with a single SIMD instruction.
If we consider the same problem on the GPU, each array needs to have a separate transformation
performed on it. This naturally maps such that one transformation equates to a single GPU kernel (or
program). Each SM, unlike a CPU core, is designed to run multiple blocks of data with each block split
into multiple threads. Thus, we need a further level of decomposition to use the GPU efficiently. We’d
typically allocate, at least initially, a combination of blocks and threads such that a single thread
processed a single element of data. As with the CPU, there are benefits from processing multiple
elements per thread. This is somewhat limited on GPUs as only load/store/move explicit SIMD
primitives are supported, but this in turn allows for enhanced levels of instruction-level parallelism
(ILP), which we’ll see later is actually quite beneficial.
With a Fermi and Kepler GPUs, we have a shared L2 cache that replicates the L3 cache function on
the CPU. Thus, as with the CPU, a memory fetch from one thread can be distributed to other threads
directly from the cache. On older hardware, there is no cache. However, on GPUs adjacent memory
locations are coalesced (combined) together by the hardware, resulting in a single and more efficient
memory fetch. We look at this in detail in Chapter 6 on memory.
One important distinction between the caches found in GPUs and CPUs is cache coherency. In
a cache-coherent system a write to a memory location needs to be communicated to all levels of cache

30

CHAPTER 2 Understanding Parallelism with GPUs

in all cores. Thus, all processor cores see the same view of memory at any point in time. This is one of
the key factors that limits the number of cores in a processor. Communication becomes increasingly
more expensive in terms of time as the processor core count increases. The worst case in a cachecoherent system is where each core writes adjacent memory locations as each write forces a global
update to every core’s cache.
A non cache-coherent system by comparison does not automatically update the other core’s caches.
It relies on the programmer to write the output of each processor core to separate areas/addresses. This
supports the view of a program where a single core is responsible for a single or small set of outputs.
CPUs follow the cache-coherent approach whereas the GPU does not and thus is able to scale to a far
larger number of cores (SMs) per device.
Let’s assume for simplicity that we implement a kernel as four blocks. Thus, we have four kernels
on the GPU and four processes or threads on the CPU. The CPU may support mechanisms such as
hyperthreading to enable processing of additional threads/processes due to a stall event, a cache miss,
for example. Thus, we could increase this number to eight and we might see an increase in performance. However, at some point, sometimes even at less than the number of cores, the CPU hits a point
where there are just too many threads.
At this point the memory bandwidth becomes flooded and cache utilization drops off, resulting in
less performance, not more.
On the GPU side, four blocks is nowhere near enough to satisfy four SMs. Each SM can actually
schedule up to eight blocks (16 on Kepler). Thus, we’d need 8  4 ¼ 32 blocks to load the four SMs
correctly. As we have four independent operations, we can launch four simultaneous kernels on Fermi
hardware via the streams feature (see Chapter 8 on using multiple GPUs). Consequently, we can
launch 16 blocks in total and work on the four arrays in parallel. As with the CPU, however, it would be
more efficient to work on one array at a time as this would likely result in better cache utilization. Thus,
on the GPU we need to ensure we always have enough blocks (typically a minimum of 8 to 16 times
the number of SMs on the GPU device).

FLYNN’S TAXONOMY
We mentioned the term SIMD earlier. This classification comes from Flynn’s taxonomy, a classification of different computer architectures. The various types are as follows:
•
•
•
•

SIMDdsingle instruction, multiple data
MIMDdmultiple instructions, multiple data
SISDdsingle instruction, single data
MISDdmultiple instructions, single data

The standard serial programming most people will be familiar with follows the SISD model. That is,
there is a single instruction stream working on a single data item at any one point in time. This equates
to a single-core CPU able to perform one task at a time. Of course it’s quite possible to provide the
illusion of being able to perform more than a single task by simply switching between tasks very
quickly, so-called time-slicing.
MIMD systems are what we see today in dual- or quad-core desktop machines. They have a work
pool of threads/processes that the OS will allocate to one of N CPU cores. Each thread/process has an

Some Common Parallel Patterns

31

independent stream of instructions, and thus the hardware contains all the control logic for decoding
many separate instruction streams.
SIMD systems try to simplify this approach, in particular with the data parallelism model. They
follow a single instruction stream at any one point in time. Thus, they require a single set of logic inside
the device to decode and execute the instruction stream, rather than multiple-instruction decode paths.
By removing this silicon real estate from the device, they can be smaller, cheaper, consume less power,
and run at higher clock rates than their MIMD cousins.
Many algorithms make use of a small number of data points in one way or another. The data points
can often be arranged as a SIMD instruction. Thus, all data points may have some fixed offset added,
followed by a multiplication, a gain factor for example. This can be easily implemented as SIMD
instructions. In effect, you are programming “for this range of data, perform this operation” instead of
“for this data point, perform this operation.” As the data operation or transformation is constant for all
elements in the range, it can be fetched and decoded from the program memory only once. As the range is
defined and contiguous, the data can be loaded en masse from the memory, rather than one word at a time.
However, algorithms where one element has transformation A applied while another element has
transformation B applied, and all others have transformation C applied, are difficult to implement using
SIMD. The exception is where this algorithm is hard-coded into the hardware because it’s very common.
Such examples include AES (Advanced Encryption Standard) and H.264 (a video compression standard).
The GPU takes a slightly different approach to SIMD. It implements a model NVIDIA calls SIMT
(single instruction, multiple thread). In this model the instruction side of the SIMD instruction is not
a fixed function as it is within the CPU hardware. The programmer instead defines, through a kernel,
what each thread will do. Thus, the kernel will read the data uniformly and the kernel code will execute
transformation A, B, or C as necessary. In practice, what happens is that A, B, and C are executed in
sequence by repeating the instruction stream and masking out the nonparticipating threads. However,
conceptually this is a much easier model to work with than one that only supports SIMD.

SOME COMMON PARALLEL PATTERNS
A number of parallel problems can be thought of as patterns. We see patterns in many software
programs, although not everyone is aware of them. Thinking in terms of patterns allows us to broadly
deconstruct or abstract a problem, and therefore more easily think about how to solve it.

Loop-based patterns
Almost anyone who has done any programming is familiar with loops. They vary primarily in terms of
entry and exit conditions (for, do.while, while), and whether they create dependencies between loop
iterations or not.
A loop-based iteration dependency is where one iteration of the loop depends on one or more
previous iterations. We want to remove these if at all possible as they make implementing parallel
algorithms more difficult. If in fact this can’t be done, the loop is typically broken into a number of
blocks that are executed in parallel. The result from block 0 is then retrospectively applied to block 1,
then to block 2, and so on. There is an example later in this text where we adopt just such an approach
when handling the prefix-sum algorithm.

32

CHAPTER 2 Understanding Parallelism with GPUs

Loop-based iteration is one of the easiest patterns to parallelize. With inter-loop dependencies
removed, it’s then simply a matter of deciding how to split, or partition, the work between the available
processors. This should be done with a view to minimizing communication between processors and
maximizing the use of on-chip resources (registers and shared memory on a GPU; L1/L2/L3 cache on
a CPU). Communication overhead typically scales badly and is often the bottleneck in poorly designed
systems.
The macro-level decomposition should be based on the number of logical processing units
available. For the CPU, this is simply the number of logical hardware threads available. For the GPU,
this is the number of SMs multiplied by the maximum load we can give to each SM, 1 to 16 blocks
depending on resource usage and GPU model. Notice we use the term logical and not physical
hardware thread. Some Intel CPUs in particular support more than one logical thread per physical CPU
core, so-called hyperthreading. GPUs run multiple blocks on a single SM, so we have to at least
multiply the number of SMs by the maximum number of blocks each SM can support.
Using more than one thread per physical device maximizes the throughput of such devices, in terms
of giving them something to do while they may be waiting on either a memory fetch or I/O-type
operation. Selecting some multiple of this minimum number can also be useful in terms of load
balancing on the GPU and allows for improvements when new GPUs are released. This is particularly
the case when the partition of the data would generate an uneven workload, where some blocks take
much longer than others. In this case, using many times the number of SMs as the basis of the partitioning of the data allows slack SMs to take work from a pool of available blocks.
However, on the CPU side, over subscribing the number of threads tends to lead to poor performance. This is largely due to context switching being performed in software by the OS. Increased
contention for the cache and memory bandwidth also contributes significantly should you try to run too
many threads. Thus, an existing multicore CPU solution, taken as is, typically has far too large
a granularity for a GPU. You will almost always have to repartition the data into many smaller blocks
to solve the same problem on the GPU.
When considering loop parallelism and porting an existing serial implementation, be critically
aware of hidden dependencies. Look carefully at the loop to ensure one iteration does not calculate
a value used later. Be wary of loops that count down as opposed to the standard zero to max value
construct, which is the most common type of loop found. Why did the original programmer count
backwards? It is likely this may be because there is some dependency in the loop and parallelizing it
without understanding the dependencies will likely break it.
We also have to consider loops where we have an inner loop and one or more outer loops. How
should these be parallelized? On a CPU the approach would be to parallelize only the outer loop as you
have only a limited number of threads. This works well, but as before it depends on there being no loop
iteration dependencies.
On the GPU the inner loop, provided it is small, is typically implemented by threads within
a single block. As the loop iterations are grouped, adjacent threads usually access adjacent memory
locations. This often allows us to exploit locality, something very important in CUDA programming.
Any outer loop(s) are then implemented as blocks of the threads. These are concepts we cover in
detail in Chapter 5.
Consider also that most loops can be flattened, thus reducing an inner and outer loop to a single
loop. Think about an image processing algorithm that iterates along the X pixel axis in the inner loop
and the Y pixel axis in the outer loop. It’s possible to flatten this loop by considering all pixels as

Some Common Parallel Patterns

33

a single-dimensional array and iterating over pixels as opposed to image coordinates. This requires
a little more thought on the programming side, but it may be useful if one or more loops contain a very
small number of iterations. Such small loops present considerable loop overhead compared to the work
done per iteration. They are, thus, typically not efficient.

Fork/join pattern
The fork/join pattern is a common pattern in serial programming where there are synchronization
points and only certain aspects of the program are parallel. The serial code runs and at some point hits
a section where the work can be distributed to P processors in some manner. It then “forks” or spawns
N threads/processes that perform the calculation in parallel. These then execute independently and
finally converge or join once all the calculations are complete. This is typically the approach found in
OpenMP, where you define a parallel region with pragma statements. The code then splits into
N threads and later converges to a single thread again.
In Figure 2.2, we see a queue of data items. As we have three processing elements (e.g., CPU
cores), these are split into three queues of data, one per processing element. Each is processed
independently and then written to the appropriate place in the destination queue.
The fork/join pattern is typically implemented with static partitioning of the data. That is, the serial
code will launch N threads and divide the dataset equally between the N threads. If each packet of data
takes the same time to process, then this works well. However, as the overall time to execute is the time
of the slowest thread, giving one thread too much work means it becomes the single factor determining
the total time.
1
2
3
4
5
6

4
1

5
2

1
2
3
4
5
6

FIGURE 2.2
A queue of data processed by N threads.

6
3

34

CHAPTER 2 Understanding Parallelism with GPUs

Systems such as OpenMP also have dynamic scheduling allocation, which mirrors the approach taken
by GPUs. Here a thread pool is created (a block pool for GPUs) and only once one task is completed is
more work allocated. Thus, if 1 task takes 10x time and 20 tasks take just 1x time each, they are allocated
only to free cores. With a dual-core CPU, core 1 gets the big 10x task and five of the smaller 1x tasks. Core
2 gets 15 of the smaller 1x tasks, and therefore both CPU core 1 and 2 complete around the same time.
In this particular example, we’ve chosen to fork three threads, yet there are six data items in the
queue. Why not fork six threads? The reality is that in most problems there can actually be millions of
data items and attempting to fork a million threads will cause almost all OSs to fail in one way or another.
Typically an OS will apply a “fair” scheduling policy. Thus, each of the million threads would need
to be processed in turn by one of perhaps four available processor cores. Each thread also requires its
own memory space. In Windows a thread can come with a 1 MB stack allocation, meaning we’d
rapidly run out of memory prior to being able to fork enough threads.
Therefore on CPUs, typically programmers and many multithreaded libraries will use the number
of logical processor threads available as the number of processes to fork. As CPU threads are typically
also expensive to create and destroy, and also to limit maximum utilization, often a thread pool of
workers is used who then fetch work from a queue of possible tasks.
On GPUs we have the opposite problem, in that we in fact need thousands or tens of thousands of
threads. We have exactly the thread pool concept we find on more advanced CPU schedulers, except
it’s more like a block pool than a thread pool. The GPU has an upper limit on the number of concurrent
blocks it can execute. Each block contains a number of threads. Both the number of threads per block
and the overall number of concurrently running blocks vary by GPU generation.
The fork/join pattern is often used when there is an unknown amount of concurrency in a problem.
Traversing a tree structure or a path exploration type algorithm may spawn (fork) additional threads
when it encounters another node or path. When the path has been fully explored these threads may then
join back into the pool of threads or simply complete to be respawned later.
This pattern is not natively supported on a GPU, as it uses a fixed number of blocks/threads at
kernel launch time. Additional blocks cannot be launched by the kernel, only the host program. Thus,
such algorithms on the GPU side are typically implemented as a series of GPU kernel launches, each of
which needs to generate the next state. An alternative is to coordinate or signal the host and have it
launch additional, concurrent kernels. Neither solution works particularly well, as GPUs are designed
for a static amount of concurrency. Kepler introduces a concept, dynamic parallelism, which addresses
this issue. See chapter 12 for more information on this.
Within a block of threads on a GPU there are a number of methods to communication between
threads and to coordinate a certain amount of problem growth or varying levels of concurrency within
a kernel. For example, if you have an 8  8 matrix you may have many places where just 64 threads are
active. However, there may be others where 256 threads can be used. You can launch 256 threads and
leave most of them idle until such time as needed. Such idle threads occupy resources and may limit
the overall throughput, but do not consume any execution time on the GPU whilst idle. This allows the
use of shared memory, fast memory close to the processor, rather than creating a number of distinct
steps that need to be synchronized by using the much slower global memory and multiple kernel
launches. We look at memory types in Chapter 6.
Finally, the later-generation GPUs support fast atomic operations and synchronization primitives
that communicate data between threads in addition to simply synchronizing. We look at some
examples of this later in the text.

Some Common Parallel Patterns

35

Tiling/grids
The approach CUDA uses with all problems is to require the programmer to break the problem into
smaller parts. Most parallel approaches make use of this concept in one way or another. Even in huge
supercomputers problems such climate models must be broken down into hundreds of thousands of
blocks, each of which is then allocated to one of the thousands of processing elements present in the
machine. This type of parallel decomposition has the huge advantage that it scales really well.
A GPU is in many ways similar to a symmetrical multiprocessor system on a single processor. Each
SM is a processor in its own right, capable of running up multiple blocks of threads, typically 256 or
512 threads per block. A number of SMs exist on a single GPU and share a common global memory
space. Together as a single GPU they can operate at peak speeds of up to 3 teraflops/s (GTX680).
While peak performance may be impressive, achieving anything like this is not possible without
specially crafted programs, as this peak performance does not include things such as memory access,
which is somewhat key to any real program. To achieve good performance on any platform requires
a good knowledge of the hardware and the understanding of two key conceptsdconcurrency and locality.
There is concurrency in many problems. It’s just that as someone who may come from a serial
background, you may not immediately see the concurrency in a problem. The tiling model is thus an
easy model to conceptualize. Imagine the problem in two dimensionsda flat arrangement of
datadand simply overlay a grid onto the problem space. For a three-dimensional problem imagine the
problem as a Rubik’s Cubeda set of blocks that map onto the problem space.
CUDA provides the simple two-dimensional grid model. For a significant number of problems this
is entirely sufficient. If you have a linear distribution of work within a single block, you have an ideal
decomposition into CUDA blocks. As we can assign up to sixteen blocks per SM and we can have up to
16 SMs (32 on some GPUs), any number of blocks of 256 or larger is fine. In practice, we’d like to limit
the number of elements within the block to 128, 256, or 512, so this in itself may drive much larger
numbers of blocks with a typical dataset.
When considering concurrency, consider also if there is ILP that can be exploited. Conceptually it’s
easier to think about a single thread being associated with a single output data item. If, however, we can
fill the GPU with threads on this basis and there is still more data that could be processed, can we still
improve the throughput? The answer is yes, but only through the use of ILP.
ILP exploits the fact that instruction streams can be pipelined within the processor. Thus, it is more
efficient to push four add operations into the queue, wait, and then collect them one at a time
(push-push-push-push-wait), rather than perform them one at a time (push-wait-push-wait-push-waitpush-wait). For most GPUs, you’ll find an ILP level of four operations per thread works best. There are
some detailed studies and examples of this in Chapter 9. Thus, if possible we’d like to process
N elements per thread, but not to the extent that it reduces the overall number of active threads.

Divide and conquer
The divide-and-conquer pattern is also a pattern for breaking down large problems into smaller
sections, each of which can be conquered. Taken together these individual computations allow a much
larger problem to be solved.
Typically you see divide-and-conquer algorithms used with recursion. Quick sort is a classic
example of this. It recursively partitions the data into two sets, those above a pivot point and those below
the pivot point. When the partition finally consists of just two items, they are compared and swapped.

36

CHAPTER 2 Understanding Parallelism with GPUs

Most recursive algorithms can also be represented as an iterative model, which is usually somewhat
easier to map onto the GPU as it fits better into the primary tile-based decomposition model of
the GPU.
Recursive algorithms are also supported on Fermi-class GPUs, although as with the CPU you have
to be aware of the maximum call depth and translate this into stack usage. The available stack can be
queried with API call cudaDeviceGetLimit(). It can also be set with the API call
cudaDeviceSetLimit(). Failure to allocate enough stack space, as with CPUs, will result in the
program failing. Some debugging tools such as Parallel Nsight and CUDA-GDB can detect such stack
overflow issues.
In selecting a recursive algorithm be aware that you are making a tradeoff of development time
versus performance. It may be easier to conceptualize and therefore code a recursive algorithm than to
try to convert such an approach to an iterative one. However, each recursive call causes any formal
parameters to be pushed onto the stack along with any local variables. GPUs and CPUs implement
a stack in the same way, simply an area of memory from the global memory space. Although CPUs and
the Fermi-class GPUs cache this area, compared to passing values using registers, this is slow. Use
iterative solutions where possible as they will generally perform much better and run on a wider range
of GPU hardware.

CONCLUSION
We’ve looked here at a broad overview of some parallel processing concepts and how these are applied
to the GPU industry in particular. It’s not the purpose of this text to write a volume on parallel processing, for there are entire books devoted to just this subject. We want readers to have some feeling for
the issues that parallel programming bring to the table that would not otherwise be thought about in
a serial programming environment.
In subsequent chapters we cover some of these concepts in detail in terms of practical examples.
We also look at parallel prefix-sum, an algorithm that allows multiple writers of data to share
a common array without writing over one another’s data. Such algorithms are never needed for serial
based programming.
With parallelism comes a certain amount of complexity and the need for a programmer to think and
plan ahead to consider the key issues of concurrency and locality. Always keep these two key concepts
in mind when designing any software for the GPU.

CHAPTER

CUDA Hardware Overview

3

PC ARCHITECTURE
Let’s start by looking at the typical Core 2 architecture we still find today in many PCs and how it
impacts our usage of GPU accelerators (Figure 3.1).
Notice that all GPU devices are connected to the processor via the PCI-E bus. In this case we’ve
assumed a PCI-E 2.0 specification bus, which is currently the fastest bus available, giving a transfer
rate of 5 GB/s. PCI-E 3.0 has become available at the time of this writing and should significantly
improve the bandwidth available.
However, to get data from the processor, we need to go through the Northbridge device over the
slow FSB (front-side bus). The FSB can run anything up to 1600 MHz clock rate, although in many
designs it is much slower. This is typically only one-third of the clock rate of a fast processor.
Memory is also accessed through the Northbridge, and peripherals through the Northbridge and
Southbridge chipset. The Northbridge deals with all the high-speed components like memory, CPU,
PCI-E bus connections, etc. The Southbridge chip deals with the slower devices such as hard disks,
USB, keyboard, network connections, etc. Of course, it’s quite possible to connect a hard-disk
controller to the PCI-E connection, and in practice, this is the only true way of getting RAID highspeed data access on such a system.
PCI-E (Peripheral Communications Interconnect Express) is an interesting bus as, unlike its
predecessor, PCI (Peripheral Component Interconnect), it’s based on guaranteed bandwidth. In the
old PCI system each component could use the full bandwidth of the bus, but only one device at
a time. Thus, the more cards you added, the less available bandwidth each card would receive. PCI-E
solved this problem by the introduction of PCI-E lanes. These are high-speed serial links that can be
combined together to form X1, X2, X4, X8, or X16 links. Most GPUs now use at least the PCI-E
2.0, X16 specification, as shown in Figure 3.1. With this setup, we have a 5 GB/s full-duplex bus,
meaning we get the same upload and download speed, at the same time. Thus, we can transfer 5 GB/
s to the card, while at the same time receiving 5 GB/s from the card. However, this does not mean
we can transfer 10 GB/s to the card if we’re not receiving any data (i.e., the bandwidth is not
cumulative).
In a typical supercomputer environment, or even in a desktop application, we are dealing with
a large dataset. A supercomputer may deal with petabytes of data. A desktop PC may be dealing with
as little as a several GB high-definition video. In both cases, there is considerable data to fetch from the
attached peripherals. A single 100 MB/s hard disk will load 6 GB of data in one minute. At this rate it
takes over two and a half hours to read the entire contents of a standard 1 TB disk.
CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00003-X
Copyright Ó 2013 Elsevier Inc. All rights reserved.

37

38

Southbridge

Ethernet

10/100/1000 Bits/S

FSB

Raid drive

CPU 1 (Athlon, P4, Opteron, Xeon)

DRAM Memory (DDR-2/DDR-3) - 2Ghz, 24MB, 30GB/Sec
DRAM Bank 0

Core 1

Core 2

DRAM Bank 1
DRAM Bank 2

FSB

Northbridge

FSB

DRAM Bank 3
DRAM Bank 4

Core 4

DRAM Bank 5

FSB

Core 3

PCI-E Bus (5GB/S)

GPU 0

FIGURE 3.1
Typical Core 2 series layout.

GPU 1

GPU 2

GPU 3

GPU 4

GPU 5

GPU 6

GPU 7

CHAPTER 3 CUDA Hardware Overview

SATA (300MB/S)

PC Architecture

39

If using MPI (Message Passing Interface), commonly used in clusters, the latency for this
arrangement can be considerable if the Ethernet connections are attached to the Southbridge instead
of the PCI-E bus. Consequently, dedicated high-speed interconnects like InfiniBand or 10 Gigabit
Ethernet cards are often used on the PCI-E bus. This removes slots otherwise available for GPUs.
Previously, as there was no direct GPU MPI interface, all communications in such a system are
routed over the PCI-E bus to the CPU and back again. The GPU-Direct technology, available in the
CUDA 4.0 SDK, solved this issue and it’s now possible for certain InfiniBand cards to talk directly to
the GPU without having to go through the CPU first. This update to the SDK also allows direct GPU
to GPU communication.
We saw a number of major changes with the advent of the Nehalem architecture. The main change
was to replace the Northbridge and the Southbridge chipset with the X58 chipset. The Nehalem
architecture brought us QPI (Quick Path Interconnect), which was actually a huge advance over the
FSB (Front Side Bus) approach and is similar to AMD’s HyperTransport. QPI is a high-speed interconnect that can be used to talk to other devices or CPUs. In a typical Nehalem system it will connect
to the memory subsystem, and through an X58 chipset, the PCI-E subsystem (Figure 3.2). The QPI
runs at either 4.8 GT/s or 6.4 GT/s in the Extreme/Xeon processor versions.
With the X58 and 1366 processor socket, a total of 36 PCI-E lanes are available, which means up to
two cards are supported at X16, or four cards at X8. Prior to the introduction of the LGA2011 socket,
this provided the best bandwidth solution for a GPU machine to date.
The X58 design is also available in a lesser P55 chipset where you get only 16 lanes. This means
one GPU card at X16, or two cards at X8.
From the I7/X58 chipset design, Intel moved onto the Sandybridge design, shown in Figure 3.3.
One of the most noticeable improvements was the support for the SATA-3 standard, which supports
600 MB/s transfer rates. This, combined with SSDs, allows for considerable input/output (I/O)
performance with loading and saving data.
The other major advance with the Sandybridge design was the introduction of the AVX (Advanced
Vector Extensions) instruction set, also supported by AMD processors. AVX allows for vector
instructions that provide up to four double-precision (256 bit/32 byte) wide vector operations. It’s
a very interesting development and something that can be used to considerably speed up computebound applications on the CPU.
Notice, however, the big downside of socket 1155 Sandybridge design: It supports only 16 PCI-E
lanes, limiting the PCI-E bandwidth to 16 GB/s theoretical, 10 GB/s actual bandwidth. Intel has gone
down the route of integrating more and more into the CPU with their desktop processors. Only the
socket 2011 Sandybridge-E, the server offering, has a reasonable number of PCI-E lanes (40).
So how does AMD compare with the Intel designs? Unlike Intel, which has gradually moved away
from large numbers of PCI-E lanes, in all but their server line, AMD have remained fairly constant.
Their FX chipset, provides for either two X16 devices or four X8 PCI-E devices. The AMD3þ socket
paired with the 990FX chipset makes for a good workhorse, as it provides SATA 6 GB/s ports paired
with up to four X16 PCI-E slots (usually running at X8 speed).
One major difference between Intel and AMD is the price point for the number of cores. If you
count only real processor cores and ignore logical (hyperthreaded) ones, for the same price point, you
typically get more cores on the AMD device. However, the cores on the Intel device tend to perform
better. Therefore, it depends a lot on the number of GPUs you need to support and the level of loading
of the given cores.

40

I/O Controller
Hub

DMI 2GB/s

Raid drive
10/100/1000 Bits/S

DRAM Memory (/DDR-3) - 2Ghz, 24MB, 30GB/Sec

CPU 1 (I7)

DRAM Bank 0

Core 1
I/O Hub
(IOH)

Ethernet

Core 2

DRAM Bank 1
DRAM Bank 2

QPI (25GB/s)

25GB/s

DRAM Bank 3
DRAM Bank 4

Core 3

Core 4

PCI-E 2.0
Up to 36
Lanes

DRAM Bank 5

PCI-E Bus (5GB/S)

GPU 0

FIGURE 3.2
Nehalem/X58 system.

GPU 1

GPU 2

GPU 3

GPU 4

GPU 5

GPU 6

GPU 7

CHAPTER 3 CUDA Hardware Overview

SATA (300MB/S)

SATA (600MB/S)

I/O Controller
Hub

DMI 2GB/s

Raid drive
10/100/1000 Bits/S

DRAM Memory (DDR-2) - 2Ghz, 16MB, 18GB/Sec

CPU 1 (I5/I3 SandyBridge)

DRAM Bank 0

Core 1
I/O Hub
(IOH)

Ethernet

Core 2

DRAM Bank 1
DRAM Bank 2

18 GB/s

DMI (20 GB/s)

Core 4

PCI-E2.0
16 Lanes
10 GB/s

Core 3

DRAM Bank 3

PCI-E Bus (5GB/S)

FIGURE 3.3

GPU 2

GPU 3

GPU 4

GPU 5

GPU 6

GPU 7

41

Sandybridge design.

GPU 1

PC Architecture

GPU 0

42

CHAPTER 3 CUDA Hardware Overview

As with the Intel design, you see similar levels of bandwidth around the system, with the exception
of bandwidth to main memory. Intel uses triple or quad channel memory on their top-end systems and
dual-channel memory on the lower-end systems. AMD uses only dual-channel memory, leading to
significantly less CPU host-memory bandwidth being available (Figure 3.4).
One significant advantage of the AMD chipsets over the Intel ones is the support for up to six SATA
(Serial ATA) 6 GB/s ports. If you consider that the slowest component in any system usually limits the
overall throughput, this is something that needs some consideration. However, SATA3 can very
quickly overload the bandwidth of Southbridge when using multiple SSDs (solid state drives). A PCI-E
bus solution may be a better one, but it obviously requires additional costs.

GPU HARDWARE
GPU hardware is radically different than CPU hardware. Figure 3.5 shows how a multi-GPU system
looks conceptually from the other side of the PCI-E bus.
Notice the GPU hardware consists of a number of key blocks:
• Memory (global, constant, shared)
• Streaming multiprocessors (SMs)
• Streaming processors (SPs)
The main thing to notice here is that a GPU is really an array of SMs, each of which has N cores (8
in G80 and GT200, 32–48 in Fermi, 8 plus in Kepler; see Figure 3.6). This is the key aspect that allows
scaling of the processor. A GPU device consists of one or more SMs. Add more SMs to the device and
you make the GPU able to process more tasks at the same time, or the same task quicker, if you have
enough parallelism in the task.
Like CPUs, if the programmer writes code that limits the processor usage to N cores, let’s say dualcore, when the CPU manufacturers bring out a quad-core device, the user sees no benefit. This is
exactly what happened in the transition from dual- to quad-core CPUs, and lots of software then had to
be rewritten to take advantage of the additional cores. NVIDIA hardware will increase in performance
by growing a combination of the number of SMs and number of cores per SM. When designing
software, be aware that the next generation may double the number of either.
Now let’s take a closer look at the SMs themselves. There are number of key components making
up each SM, however, not all are shown here for reasons of simplicity. The most significant part is that
there are multiple SPs in each SM. There are 8 SPs shown here; in Fermi this grows to 32–48 SPs and
in Kepler to 192. There is no reason to think the next hardware revision will not continue to increase
the number of SPs/SMs.
Each SM has access to something called a register file, which is much like a chunk of memory that
runs at the same speed as the SP units, so there is effectively zero wait time on this memory. The size of
this memory varies from generation to generation. It is used for storing the registers in use within the
threads running on an SP. There is also a shared memory block accessible only to the individual SM;
this can be used as a program-managed cache. Unlike a CPU cache, there is no hardware evicting
cache data behind your backdit’s entirely under programmer control.
Each SM has a separate bus into the texture memory, constant memory, and global memory
spaces. Texture memory is a special view onto the global memory, which is useful for data where

SATA (600MB/S)

Southbridge

A-Link 2GB/s

Raid drive
10/100/1000 Bits/S

Ethernet

DRAM Memory (DDR-3) - 2Ghz, 16MB, 17GB/Sec

CPU 1 (AMD3+)

DRAM Bank 0

Core 1

Core 2

Core 3

DRAM Bank 1
DRAM Bank 2

Northbridge

17 GB/s

HT (25GB/s)

Core 5

DRAM Bank 3

Core 6

PCI-E 2.0
Up to 38
Lanes

Core 4

PCI-E Bus (5GB/S)

FIGURE 3.4
AMD.

GPU 1

GPU 2

GPU 3

GPU 4

GPU 5

GPU 6

GPU 7

GPU Hardware

GPU 0

43

44

GPU #0

GPU
#1
Constant Shared
Memory across
all MPs (64K)

Optional
+64 Bits

Global Memory MMU (448 / 512 Bit) - 120GB/S - 256K to 4GB

Global Memory Bus

Constant Memory Bus

SM0

Bus

SP0

SP1

SP2

SP3

SP4

SP5

SP6

SP7

Crossbar

0

1

2

3

4

5

6

7

8

9

A

Shared Memory (16x 1K)

FIGURE 3.5
Block diagram of a GPU (G80/GT200) card.

B

C

D

E

F

SM1

SM2

SM3

Optional
SMn

GPU
#2

GPU
#3

CHAPTER 3 CUDA Hardware Overview

PCI-E 2.0 Interconnect (5GB/S)

GPU Hardware

45

Global Memory

L2 Cache

Texture Cache

Constant Cache

L1 Cache

Simertric Multi-processor (SM)

SP

SP

SP

SP
Register
File

Shared
Memory

SP

SP

SP

SP

SPU

SPU

FIGURE 3.6
Inside an SM.

there is interpolation, for example, with 2D or 3D lookup tables. It has a special feature of
hardware-based interpolation. Constant memory is used for read-only data and is cached on all
hardware revisions. Like texture memory, constant memory is simply a view into the main global
memory.
Global memory is supplied via GDDR (Graphic Double Data Rate) on the graphics card. This is
a high-performance version of DDR (Double Data Rate) memory. Memory bus width can be up to 512 bits
wide, giving a bandwidth of 5 to 10 times more than found on CPUs, up to 190 GB/s with the Fermi
hardware.
Each SM also has two or more special-purpose units (SPUs), which perform special hardware
instructions, such as the high-speed 24-bit sin/cosine/exponent operations. Double-precision units are
also present on GT200 and Fermi hardware.

46

CHAPTER 3 CUDA Hardware Overview

CPUS AND GPUS
Now that you have some idea what the GPU hardware looks like, you might say that this is all very
interesting, but what does it mean for us in terms of programming?
Anyone who has ever worked on a large project will know it’s typically partitioned into sections
and allocated to specific groups. There may be a specification group, a design group, a coding group,
and a testing group. There are absolutely huge benefits to having people in each team who understand
completely the job of the person before and after them in the chain of development.
Take, for example, testing. If the designer did not consider testing, he or she would not
have included any means to test in software-specific hardware failures. If the test team could only test
hardware failure by having the hardware fail, it would have to physically modify hardware to cause
such failures. This is hard. It’s much easier for the software people to design a flag that inverts the
hardware-based error flag in software, thus allowing the failure functionality to be tested easily.
Working on the testing team you might see how hard it is to do it any other way, but with a blinkered
view of your discipline, you might say that testing is not your role.
Some of the best engineers are those with a view of the processes before and after them. As
software people, it’s always good to know how the hardware actually works. For serial code execution,
it may be interesting to know how things work, but usually not essential. The vast majority of
developers have never taken a computer architecture course or read a book on it, which is a great
shame. It’s one of the main reasons we see such inefficient software written these days. I grew up
learning BASIC at age 11, and was programming Z80 assembly language at 14, but it was only during
my university days that I really started to understand computer architecture to any great depth.
Working in an embedded field gives you a very hands-on approach to hardware. There is no nice
Windows operating system to set up the processor for you. Programming is a very low-level affair. With
embedded applications, there are typically millions of boxes shipped. Sloppy code means poor use of the
CPU and available memory, which could translate into needing a faster CPU or more memory. An
additional 50 cent cost on a million boxes is half a million dollars. This translates into a lot of design and
programming hours, so clearly it’s more cost effective to write better code than buy additional hardware.
Parallel programming, even today, is very much tied to the hardware. If you just want to write code
and don’t care about performance, parallel programming is actually quite easy. To really get performance out of the hardware, you need to understand how it works. Most people can drive a car safely
and slowly in first gear, but if you are unaware that there are other gears, or do not have the knowledge
to engage them, you will never get from point A to point B very quickly. Learning about the hardware
is a little like learning to change gear in a car with a manual gearboxda little tricky at first, but
something that comes naturally after awhile. By the same analogy, you can also buy a car with an
automatic gearbox, akin to using a library already coded by someone who understands the low-level
mechanics of the hardware. However, doing this without understanding the basics of how it works will
often lead to a suboptimal implementation.

COMPUTE LEVELS
CUDA supports a number of compute levels. The original G80 series graphics cards shipped with the
first version of CUDA. The compute capability is fixed into the hardware. To upgrade to a newer

Compute Levels

47

version users had to upgrade their hardware. Although this might sound like NVIDIA trying to force
users to buy more cards, it in fact brings many benefits. When upgrading a compute level, you can
often move from an older platform to a newer one, usually doubling the compute capacity of the card
for a similar price to the original card. Given that NVIDIA typically brings out a new platform at least
every couple of years, we have seen to date a huge increase in available compute power over the few
years CUDA has been available.
A full list of the differences between each compute level can be found in the NVIDIA CUDA
Programming Guide, Appendix G, which is shipped as part of the CUDA SDK. Therefore, we will
only cover the major differences found at each compute level, that is, what you need to know as
a developer.

Compute 1.0
Compute level 1.0 is found on the older graphics cards, for example, the original 8800 Ultras and many
of the 8000 series cards as well as the Tesla C/D/S870s. The main features lacking in compute 1.0
cards are those for atomic operations. Atomic operations are those where we can guarantee a complete
operation without any other thread interrupting. In effect, the hardware implements a barrier point at
the entry of the atomic function and guarantees the completion of the operation (add, sub, min, max,
logical and, or, xor, etc.) as one operation. Compute 1.0 cards are effectively now obsolete, so this
restriction, for all intents and purposes, can be ignored.

Compute 1.1
Compute level 1.1 is found in many of the later shipping 9000 series cards, such as the 9800 GTX,
which were extremely popular. These are based on the G92 hardware as opposed to the G80 hardware
of compute 1.0 devices.
One major change brought in with compute 1.1 devices was support, on many but not all devices,
for overlapped data transfer and kernel execution. The SDK call to cudaGetDeviceProperties()
returns the deviceOverlap property, which defines if this functionality is available. This allows
for a very nice and important optimization called double buffering, which works as shown in
Figure 3.7.
To use this method we require double the memory space we’d normally use, which may well be an
issue if your target market only had a 512 MB card. However, with Tesla cards, used mainly for
scientific computing, you can have up to 6 GB of GPU memory, which makes such techniques very
useful. Let’s look at what happens:
Cycle 0: Having allocated two areas of memory in the GPU memory space, the CPU fills the first
buffer.
Cycle 1: The CPU then invokes a CUDA kernel (a GPU task) on the GPU, which returns
immediately to the CPU (a nonblocking call). The CPU then fetches the next data packet, from
a disk, the network, or wherever. Meanwhile, the GPU is processing away in the background on
the data packet provided. When the CPU is ready, it starts filling the other buffer.
Cycle 2: When the CPU is done filling the buffer, it invokes a kernel to process buffer 1. It then
checks if the kernel from cycle 1, which was processing buffer 0, has completed. If not, it waits

48

Buffer 0

GPU
(Process Buffer 0)

Buffer 1

Buffer 0

CPU
(Fill Buffer 0)

Cycle 0

FIGURE 3.7
Double buffering with a single GPU.

Cycle 1

Buffer 1

Buffer 0

CPU
(Fill Buffer 1)

CPU
(Read Buffer 0
Fill Buffer 0)

GPU
(Process Buffer 1)

GPU
(Process Buffer 0)

Buffer 1

Buffer 0

Buffer 1

CPU
(Read Buffer 1
Fill Buffer 1)

Cycle 2

Cycle n

CHAPTER 3 CUDA Hardware Overview

GPU
(Idle)

Compute Levels

49

until this kernel has finished and then fetches the data from buffer 0 and then loads the next data
block into the same buffer. During this time the kernel kicked off at the start of the cycle is
processing data on the GPU in buffer 1.
Cycle N: We then repeat cycle 2, alternating between which buffer we read and write to on the CPU
with the buffer being processed on the GPU.
GPU-to-CPU and CPU-to-GPU transfers are made over the relatively slow (5 GB/s) PCI-E bus and
this dual-buffering method largely hides this latency and keeps both the CPU and GPU busy.

Compute 1.2
Compute 1.2 devices appeared with the low-end GT200 series hardware. These were the initial
GTX260 and GTX280 cards. With the GT200 series hardware, NVIDIA approximately doubled the
number of CUDA core processors on a single card, through doubling the number of multiprocessors
present on the card. We’ll cover CUDA cores and multiprocessors later. In effect, this doubled the
performance of the cards compared to the G80/G92 range before them.
Along with doubling the number of multiprocessors, NVIDIA increased the number of concurrent
warps a multiprocessor could execute from 24 to 32. Warps are blocks of code that execute within
a multiprocessor, and increasing the amount of available warps per multiprocessor gives us more scope
to get better performance, which we’ll look at later.
Issues with restrictions on coalesced access to the global memory and bank conflicts in the shared
memory found in compute 1.0 and compute 1.1 devices were greatly reduced. This make the GT200
series hardware far easier to program and it greatly improved the performance of many previous,
poorly written CUDA programs.

Compute 1.3
The compute 1.3 devices were introduced with the move from GT200 to the GT200 a/b revisions of the
hardware. This followed shortly after the initial release of the GT200 series. Almost all higher-end
cards from this era were compute 1.3 compatible.
The major change that occurs with compute 1.3 hardware is the introduction of support for
limited double-precision calculations. GPUs are primarily aimed at graphics and here there is
a huge need for fast single-precision calculations, but limited need for double-precision ones.
Typically, you see an order of magnitude drop in performance using double-precision as opposed
to single-precision floating-point operations, so time should be taken to see if there is any way
single-precision arithmetic can be used to get the most out of this hardware. In many cases,
a mixture of single and double-precision operations can be used, which is ideal since it exploits
both the dedicated single-precision and double-precision hardware present.

Compute 2.0
Compute 2.0 devices saw the switch to Fermi hardware. The original guide for tuning applications for
the Fermi architecture can be found on the NVIDIA website at http://developer.nvidia.com/cuda/
nvidia-gpu-computing-documentation.

50

CHAPTER 3 CUDA Hardware Overview

Some of the main changes in compute 2.x hardware are as follows:
• Introduction of 16 K to 48 K of L1 cache memory on each SP.
• Introduction of a shared L2 cache for all SMs.
• Support in Tesla-based devices for ECC (Error Correcting Code)-based memory checking and error
correction.
• Support in Tesla-based devices for dual-copy engines.
• Extension in size of the shared memory from 16 K per SM up to 48 K per SM.
• For optimum coalescing of data, it must be 128-byte aligned.
• The number of shared memory banks increased from 16 to 32.
Let’s look at the implications of some of these changes in detail. First, let’s pick up on the
introduction of the L1 cache and what this means. An L1 (level one) cache is a cache present on
a device and is the fastest cache type available. Compute 1.x hardware has no cache, except for the
texture and constant memory caches. The introduction of a cache makes it much easier for many
programmers to write programs that work well on GPU hardware. It also allows for applications that
do not follow a known memory pattern at compile time. However, to exploit the cache, the application
either needs to have a sequential memory pattern or have at least some data reuse.
The L2 cache is up to 768 K in size on Fermi and, importantly, is a unified cache, meaning it is
shared and provides a consistent view for all the SMs. This allows for much faster interblock
communication through global atomic operations. Compared to having to go out to the global memory
on the GPU, using the shared cache is an order of magnitude faster.
Support for ECC memory is a must for data centers. ECC memory provides for automatic error
detection and correction. Electrical devices emit small amounts of radiation. When in close proximity
to other devices, this radiation can change the contents of memory cells in the other device. Although
the probability of this happening is tiny, as you increase the exposure of the equipment by densely
packing it into data centers, the probability of something going wrong rises to an unacceptable level.
ECC, therefore, detects and corrects single-bit upset conditions that you may find in large data centers.
This reduces the amount of available RAM and negatively impacts memory bandwidth. Because this is
a major drawback on graphics cards, ECC is only available on Tesla products.
Dual-copy engines allow you to extend the dual-buffer example we looked at earlier to use multiple
streams. Streams are a concept we’ll look at in detail later, but basically, they allow for N independent
kernels to be executed in a pipeline fashion as shown in Figure 3.8.

Stream 0

Copy To Device

Stream 1

Stream 2

FIGURE 3.8
Stream pipelining.

Kernel

Copy From
Device

Copy To Device

Kernel

Copy From
Device

Copy To Device

Kernel

Copy From
Device

Copy To Device

Kernel

Copy From
Device

Copy To Device

Kernel

Copy From
Device

Copy To Device

Kernel

Copy From
Device

Compute Levels

51

Notice how the kernel sections run one after another in the figure. The copy operations are hidden
by the execution of a kernel on another stream. The kernels and the copy engines execute concurrently,
thus making the most use of the relevant units.
Note that the dual-copy engines are physically available on almost all the top-end Fermi GPUs,
such as the GTX480 or GTX580 device. However, only the Tesla cards make both engines visible to
the CUDA driver.
Shared memory also changed drastically, in that it was transformed into a combined L1 cache. The
L1 cache size is 64 K. However, to preserve backward compatibility, a minimum of 16 K must be
allocated to the shared memory, meaning the L1 cache is really only 48 K in size. Using a switch,
shared memory and L1 cache usage can be swapped, giving 48 K of shared memory and 16 K of L1
cache. Going from 16 K of shared memory to 48 K of shared memory is a huge benefit for certain
programs.
Alignment requirements for optimal use became more strict than in previous generations, due to the
introduction of the L1 and L2 cache. Both use a cache line size of 128 bytes. A cache line is the
minimum amount of data the memory can fetch. Thus, if your program fetches subsequent elements of
the data, this works really well. This is typically what most CUDA programs do, with groups of threads
fetching adjacent memory addresses. The one requirement that comes out of this change is to have
128-byte alignment of the dataset.
However, if your program has a sparse and distributed memory pattern per thread, you need to
disable this feature and switch to the 32-bit mode of cache operation.
Finally, one of the last major changes we’ll pick up on is the increase of shared memory banks from
16 to 32 bits. This is a major benefit over the previous generations. It allows each thread of the current
warp (32 threads) to write to exactly one bank of 32 bits in the shared memory without causing a shared
bank conflict.

Compute 2.1
Compute 2.1 is seen on certain devices aimed specifically at the games market, such as the GTX460
and GTX560. These devices change the architecture of the device as follows:
• 48 CUDA cores per SM instead of the usual 32 per SM.
• Eight single-precision, special-function units for transcendental per SM instead of the usual four.
• Dual-warp dispatcher instead of the usual single-warp dispatcher.
The x60 series cards have always had a very high penetration into the midrange games market, so if
your application is targeted at the consumer market, it is important to be aware of the implication of
these changes.
Noticeably different on the compute 2.1 hardware is the sacrifice of dual-precision hardware to
increase the number of CUDA cores. For single-precision and integer calculation–dominated kernels,
this is a good tradeoff. Most games make little use of double-precision floating-point data, but
significant use of single-precision floating-point and integer math.
Warps, which we will cover in detail later, are groups of threads. On compute 2.0 hardware, the
single-warp dispatcher takes two clock cycles to dispatch instructions of an entire warp. On compute
2.1 hardware, instead of the usual two instruction dispatchers per two clock cycles, we now have four.
In the hardware, there are three banks of 16 CUDA cores, 48 CUDA cores in total, instead of the usual

52

CHAPTER 3 CUDA Hardware Overview

two banks of 16 CUDA cores. If NVIDIA could have just squeezed in another set of 16 CUDA cores,
you’d have an ideal solution. Maybe we’ll see this in future hardware.
The compute 2.1 hardware is actually a superscalar approach, similar to what is found on CPUs
from the original Pentium CPU onwards. To make use of all the cores, the hardware needs to identify
instruction-level parallelism (ILP) within a single thread. This is a significant divergence from the
universal thread-level parallelism (TLP) approach recommended in the past. For ILP to be present
there need to be instructions that are independent of one another. One of the easiest ways to do this is
via the special vector class covered later in the book.
Performance of compute 2.1 hardware varies. Some well-known applications like Folding at Home
perform really well with the compute 2.1 hardware. Other applications such as video encoding
packages, where it’s harder to extract ILP and memory bandwidth is a key factor, typically perform
much worse.
The final details of Kepler and the new compute 3.0 platform were, at the time of writing, still
largely unreleased. A discussion of the Kepler features already announced can be found in Chapter 12,
under ‘Developing for Future GPUs’.

CHAPTER

Setting Up CUDA

4

INTRODUCTION
This chapter is here for anyone who is completely new to CUDA. We look at how to install CUDA on
the various OSs, what tools you can use, and how CUDA compiles. Finally, we look at how to have the
API help you identify the coding and API errors everyone makes.
CUDA is supported on three major OSs: Windows, Mac, and Linux. By far the easiest platform
to use and learn CUDA with is the OS you are most familiar with using for programming
development. For an absolute beginner, the Windows OS in conjunction with Microsoft Visual
Cþþ is likely to be the best choice. Both the Windows and Mac installations are fairly much point
and click. Both provide fairly standard integrated development environments that work well with
CUDA.

INSTALLING THE SDK UNDER WINDOWS
To install CUDA onto a PC running Windows, you’ll need to download the following components
from the NVIDIA developer portal at http://developer.nvidia.com/cuda-toolkit-41. Note by the time
this book hit the press release 5 of the toolkit was in its release candidate phase. Please check the
NVIDIA website for the latest version.
You will need an already installed version of Microsoft Visual Studio 2005, 2008, or 2010. The first
step is to download and install the latest set of NVIDIA development drivers for your relevant
operating system from the previous link. Then you will need either the 32- or 64-bit version of the
CUDA toolkit and GPU computing and SDK code samples. Make sure you pick the correct version for
your OS. Install them in this order:
1.
2.
3.
4.
5.

NVIDIA development drivers
CUDA toolkit
CUDA SDK
GPU computing SDK
Parallel Nsight debugger

CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00004-1
Copyright Ó 2013 Elsevier Inc. All rights reserved.

53

54

CHAPTER 4 Setting Up CUDA

FIGURE 4.1
“Folder Options” to see hidden files.

Under Windows 7, the SDK installs all of its files into “ProgramData,” which is a hidden directory
of the C drive. To view the files you either need to always go via the CUDA SDK icon created on the
desktop or go to “Folder Options” in Windows and tell it to show hidden files (Figure 4.1).

VISUAL STUDIO
CUDA supports Visual Studio versions from 2005 to 2010 including, for the most part, the express
versions. The express versions are available free of charge from Microsoft. The professional versions
are also available to registered students free of charge via the DreamSpark program at https://www.
dreamspark.com.
To register all you need to do is supply your university or college details and identification
numbers and you can download Visual Studio and many other programming tools. The program
is also not just restricted to U.S.-based academic institutions, but available to students
worldwide.
On the whole, Visual Studio 2008 has the best support for CUDA and compiles somewhat quicker
than Visual Studio 2010. Visual Studio 2010 has, however, one very useful feature, which is automatic

Visual Studio

55

syntax checking of source code. Thus, if you use a type that is not defined, it underlines the error in red,
just as Microsoft Word underlines spelling errors. This is an incredibly useful feature as it saves a lot of
unnecessary compilation cycles for obvious issues. Thus, I’d recommend the 2010 version, especially
if you can download it for free from DreamSpark.

Projects
One quick way of creating a project is to take one of the SDK examples, remove all the unnecessary
project files, and insert your own source files. Note your CUDA source code should have a “.cu”
extension so that it will be compiled by the NVIDIA compiler instead of Visual C. However, as we see
later, you can also simply create a basic project framework using the project template wizard.

64-bit users
When using Windows 64-bit version, be aware that some of the project files are set up to run as 32-bit
applications by default. Thus, when you try to build them you may get the error message: Fatal Error
LNK1181: cannot open input file ‘cutil32D.lib’.
This was not installed, as you most likely installed only the 64-bit version of the SDK along with
the 64-bit version of Windows. To correct this issue all we have to do is set the target from 64 bits to 32
bits, which we do using the Build menu in Visual Studio, and then change the platform to X64 as
shown in Figure 4.2.

FIGURE 4.2
Visual C platform selection.

56

CHAPTER 4 Setting Up CUDA

You may be prompted at the point you initiate a rebuild to save the project. Just add “_X86” to the
end of the project name and save. The project will then build under a 64-bit environment and link in the
correct library files.
You may also find an issue with a missing library, such as “cutil32.lib,” for example. When the SDK
is installed, it sets an environment variable, $(CUDA_LIB_PATH). This is usually set to: C:\Program
Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\lib\X64.
You may find the path setup in the default project files may not have $(CUDA_LIB_PATH) as one of
the entries. To add it, click on the project and then select “Project/Properties.” This brings up the
dialog box shown in Figure 4.3.
Clicking on the “.” button on the far right brings up a dialog where you can add the library path
(Figure 4.4). Simply add “$(CUDA_LIB_PATH)” as a new line and the project should now link.
If you wish to build both 64-bit CUDA applications and 32-bit CUDA applications, both the 32- and
64-bit CUDA toolkits need to be installed. The samples from the SDK also require both the 32- and 64-bit
versions of the SDK to be installed to be able to build both 32- and 64-bit versions of the samples.
You can build the necessary libraries by going to the following directories and building the
solution files:
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\common
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\shared

FIGURE 4.3
Additional library path.

Visual Studio

57

FIGURE 4.4
Adding library directories.

You will find the necessary libraries in
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\common\lib\X64.
You can also add these manually to any project that is missing them. Unfortunately, the SDK
samples are not set up so they automatically build the necessary libraries when needed. The
binaries for the libraries also are not supplied, which makes actually building the SDK samples
a little frustrating.

Creating projects
To create a new CUDA-enabled application, simply create a CUDA application using the
“File/New/Project Wizard” as shown in Figure 4.5. The wizard will then create a single project
containing the file “kernel.cu,” which contains a mix of code, some of which executes on the CPU
and some of which executes on the GPU. The GPU code is contained in the function addKernel.
This function simply takes a pointer to a destination array, c, and a couple of pointers to two input
arrays, a and b. It then adds the contents of the a and b arrays together and stores the result in the
destination array, c. It’s a very simple example of the framework needed to execute a CUDA
program.
Also included is the basic code to copy data to a device, invoke the kernel, and copy data back from
the device to the host. It’s a very useful starter project to get you compiling something under CUDA.
We cover the standard framework needed to get a CUDA program working later in the text. It’s useful
to look at the code and try to understand it if you can. However, don’t worry at this stage if it doesn’t
make sense as we’ll build gradually on how to write programs for CUDA.

58

CHAPTER 4 Setting Up CUDA

FIGURE 4.5
CUDA Project Wizard.

LINUX
CUDA is supported for the following Linux distributions. The supported versions will vary depending
on which version of the CUDA toolkit you are installing.
• Fedora 14
• Redhat 6.0 and 5.5/CentOS 6.2 (the free version of Redhat)
• Ubuntu 11.04
• OpenSUSE 11.2
The first step in installing CUDA on a Linux platform is to make sure you have the latest set of kernel
software. Use the following command from a terminal window to do this:
sudo yum update
The sudo command

will log you in as the administrator. The yum command is the standard
installation tool for the Linux RPM package. You are simply asking it to check for all installed
packages and see if any updates are available. This ensures your system is fully up to date before
installing any drivers. Many of the GUI-based installations also have GUI-based versions of the
software updates that replace the older command line update interface.
Once the kernel has been updated to the latest level, run the following command:

sudo yum install gcc-cþþ kernel-devel

Linux

59

This will install the standard GNU Cþþ environment as well as the kernel source you’ll need to
rebuild the kernel. Be aware that package names are case-sensitive. This will prompt you for around
a 21 MB download and take a couple of minutes to install. Again, if you prefer, you can install the
package via the GUI software installer for the particular OS.
Finally, as you are likely to be drawing some graphical output, you’ll need an OpenGL development environment. Install this with the following command:
sudo yum install freeglut-devel libXi-devel libXmu-devel

Now you’re ready to install the CUDA drivers. Make sure you install at least version 4.1 of the CUDA
toolkit. There are a number of ways to install the updated NVIDIA drivers. NVIDIA does not release the
source code to the drivers, so by default most Linux distributions install a very basic graphics driver.

Kernel base driver installation (CentOS, Ubuntu 10.4)
The CUDA releases should be used with a specific set of development drivers. Installing drivers by
methods other than the one listed here may result in CUDA not working. Note the versions of the OS
supported for the given version of the CUDA toolkit. These may not be the latest version of the
particular Linux distribution. Using a later distribution will likely not work. Thus, the first installation
step is to replace any existing drivers with the version specified for your specific Linux distribution.
See Figure 4.6.
Once the download is complete, you need to boot Linux in text-only mode. Unlike Windows, which
is always in graphics mode, text mode is required to install the drivers under Linux. You can make the
system boot into text on most distributions using the following command from a Terminal window
(usually under the Systems menu in the GUI):
sudo init 3

This will reboot the Linux machine and bring it back up in text mode. You can use sudo init 5 to
restore the graphics mode later.
If you get an error such as “User  is not in sudoers file,” login as root using the su
command. Edit the “/etc/sudoers” file and append the following line:
your_user_name ALL¼(ALL) ALL

Be careful to replace your_user_name with your login name.
Certain distributions (e.g., Ubuntu) insist on booting to the GUI, regardless of the init mode. One
method of resolving is as follows, from a text window. Edit the grub startup file:
sudo chmod þw /etc/default/grub
sudo nano /etc/default/grub

Change the following lines:
GRUB_CMDLINE_LINUX_DEFAULT¼"quiet splash"
GRUB_CMDLINE_LINUX_DEFAULT¼""

to
# GRUB_CMDLINE_LINUX_DEFAULT¼"quiet splash"
GRUB_CMDLINE_LINUX_DEFAULT¼"text"

60

CHAPTER 4 Setting Up CUDA

FIGURE 4.6
Supported Linux downloads and supported driver versions as of September 2012.

Now update grub using
sudo update-grub

Finally, reboot your machine and it should come up in text-only mode. Use the original lines to boot
to the GUI again once the drivers are installed.
Now navigate to the area you stored the “.run” file you downloaded from the NVIDIA website.
Then type
sudo sh NVIDIA-Linux-x86_64-285.05.33.run

Linux

61

The exact version of the driver you download will of course be different. You will be asked to
agree to the NVIDIA license and will then have to wait a few minutes while everything installs.
During this process the installer will attempt to replace the default Nouveau driver with the necessary
NVIDIA drivers. If asked if you want to do this, select “Yes.” This is an error-prone process and not
every distribution works out of the box. If the NVIDIA installer is unable to remove the Nouveau
driver then it may be necessary to blacklist the driver so the NVIDIA installer can install the correct
drivers.
When you have the NVIDIA drivers installed correctly, type
sudo init 5

The machine will then reboot into the regular graphics mode. See earlier for Ubuntu.
The next task is to install the toolkit. There are a number availabledselect Fedora, Red Hat,
Ubuntu, OpenSUSE, or SUSE depending on your distribution. As before, simply navigate to where
you installed the SDK and run it by typing
sudo sh .run

where  is the file you downloaded. It will then install all the tools needed and print
a message saying the installation was successful. It then mentions you have to update the PATH and
LD_LIBRARY_PATH environment variables, which you have to do by hand. To do this, you need to edit
the “/etc/profile” startup file. Add the following lines:
export PATH¼/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH¼/usr/local/cuda/lib:$LD_LIBRARY_PATH

Note that the file has to be writable. Use the “sudo chmod +w /etc/profile” to make it writable if
required. You can edit this file with your favorite editor using such a command as “sudo nano/etc/
profile”.
Now log out and log back in again and type
env

This will list all of the current environment variable settings. Check for the two new entries you just
amended. CUDA is now installed into the “/usr/local/bin” directory.
Next we’ll need the GNU Cþþ compiler. Install the package “gþþ” from whatever software
installer you are using on your system.
The next step is to install the SDK sample codes, so we have something to build and test. Download
these from the NVIDIA site and run them, again using the sh sdk_version.run command (replace
sdk_version with the actual one you download). Do not run this install as root as you will otherwise
have to be logged in as root to build any of the samples.
By default the SDK will install to a subdirectory of your user account area. It may complain it can’t
find the CUDA installation and will use the default directory (the same one CUDA was installed to
earlier). You can safely ignore this message.
Once the GPU computing SDK is installed, you then need to go to the “Common” subdirectory and
run make to create a set of libraries.
Once this is done the SDK samples should build, allowing you to execute your first CUDA program
in Linux and of course see if the driver is working correctly.

62

CHAPTER 4 Setting Up CUDA

MAC
The Macintosh version is available, as with the other versions, from http://developer.nvidia.com/cudatoolkit-41. Simply download and install the packages in the following order:
• Development drivers
• CUDA toolkit
• CUDA tools SDK and code samples
CUDA 4.1 requires Mac OS release 10.6.8 (Snow Leopard) or later. The latest release (10.7.x) or
Lion release is available as a download from the Apple store or via a separate purchase from
Apple.
The SDK installs into the “GPU Computing” directory under the “Developer” higher-level
directory. Simply browse the “Developer/GPU Computing/C/bin/darwin/release” directory and you
will find precompiled executables. Running the deviceQuery tool is useful to verify you have correctly
installed the drivers and runtime environment.
To compile the samples, you will need XCode installed. This is the equivalent of GCC (GNU C
Compiler) for the Mac. XCode can be downloaded from the Apple store. It’s not a free product, but is
available free of charge to anyone on the Apple Developer program, which includes both development
of Macintosh and iPhone/iPad applications. It was also released shortly after the Lion OS as a free
download for Lion OS owners.
Once XCode is installed, simply open a terminal window. To do this, go to Finder, open Utilities,
and then double-click on the Terminal window. Type the following:
cd /Developer/’GPU Computing/C/src/project’
make–i

Replace project with the name of the particular SDK application you wish to compile. If you
receive compilation errors, you have either not downloaded the XCode package or have an older
version than is required.

INSTALLING A DEBUGGER
CUDA provides a debug environment called Parallel Nsight on the Windows platform. This provides
support for debugging CPU and GPU code and highlights areas where things are working less than
efficiently. It also helps tremendously when trying to debug multithreaded applications.
Nsight is completely free and is a hugely useful tool. All it requires is that you register as a CUDAregistered developer, which is again entirely free. Once registered, you will be able to download the
tool from the NVIDIA website.
Note that you must have Visual Studio 2008 or later (not the express version) and you must have
installed Service Pack 1. There is a link within the release notes of Nsight to the SP1 download you
need to install.
Parallel Nsight comes as two parts, an application that integrates itself into Visual Studio as
shown in Figure 4.7, and a separate monitoring application. The monitoring application works in
conjunction with the main application. The monitor is usually resident, but does not have to be, on

Installing a Debugger

63

FIGURE 4.7
Nsight integrated into Microsoft Visual Studio.

the same machine as the Visual Studio environment. Parallel Nsight works best with two CUDA
capable GPUs, a dedicated GPU to run the code on and one to use as the regular display. Thus, the
GPU running the target code cannot be used to run a second display. As most GPU cards have
dual-monitor outputs, you can simply run two monitors off the display card should you have
a dual-monitor setup. Note in the latest release, 2.2, the need for two GPUs was dropped.
It’s also possible to set up the tool to acquire data from a remote GPU. However, in most cases it’s
easier to buy a low-end GPU and install it into your PC or workstation. The first step needed to set
up Parallel Nsight on Windows is to disable TDR (Figure 4.8). TDR (Timeout Detection and
Recovery) is a mechanism in Windows that detects crashes in the driver-level code. If the driver stops
responding to events, Windows resets the driver. As the driver will halt when you define a breakpoint,
this feature needs to be disabled.
To set the value, simply run the monitor and click on the “Nsight Monitor Options” hyperlink at the
bottom right of the monitor dialog box. This will bring up the dialog shown in Figure 4.8. Setting the

64

CHAPTER 4 Setting Up CUDA

FIGURE 4.8
Disabling Windows kernel timeout.

“WDDM TDR enabled” will modify the registry to disable this feature. Reboot your PC, and Parallel
Nsight will no longer warn you TDR is enabled.
To use Parallel Nsight on a remote machine, simply install the monitor package only on the remote
Windows PC. When you first run the monitor, it will warn you Windows Firewall has blocked “Public
network” (Internet based) access to the monitor, which is entirely what you want. However, the tool
needs to have access to the local network, so allow this exception to any firewall rules you have set up
on the monitor machine. As with a local node, you will have to fix the TDR issue and reboot once
installed.

FIGURE 4.9
Parallel Nsight remote connection.

Installing a Debugger

65

FIGURE 4.10
Parallel Nsight connected remotely.

The next step is to run Visual Studio on the host PC and select a new analysis activity. You will see
a section near the top of the window that looks like Figure 4.9. Notice the “Connection Name” says
localhost, which just means your local machine. Open Windows Explorer and browse the local
network to see the name of the Windows PC you would like to use to remotely debug. Replace
localhost with the name shown in Windows Explorer. Then press the “Connect” button. You should
see two confirmations that the connection has been made as shown in Figure 4.10.
First, the “Connect” button will change to a “Disconnect.” Second, the “Connection Status” box
should turn green and show all the possible GPUs on the target machine (Figure 4.11). In this case
we’re connecting to a test PC that has five GTX470 GPU cards set up on it.

FIGURE 4.11
Parallel Nsight connection status.

Clicking on the “Launch” button on the “Application Control” panel next to the “Connection
Status” panel will remotely launch the application on the target machine. However, prior to this all the
necessary files need to be copied to the remote machine. This takes a few seconds or so, but is all
automatic. Overall, it’s a remarkably simple way of analyzing/debugging a remote application.

66

CHAPTER 4 Setting Up CUDA

You may wish to set up Parallel Nsight in this manner if, for example, you have a laptop and wish to
debug, or simply remotely run, an application that will run on a GPU server. Such usage includes when
a GPU server or servers are shared by people who use it at different times, teaching classes, for
example. You may also have remote developers who need to run code on specially set up test servers,
perhaps because those servers also contain huge quantities of data and it’s not practical or desirable to
transfer that data to a local development machine. It also means you don’t need to install Visual Cþþ
on each of the remote servers you might have.
On the Linux and Mac side the debugger environment is CUDA-GDB. This provides an extended
GNU debugger package. As with Parallel Nsight it allows debugging of both host and CUDA code,
which includes setting a breakpoint in the CUDA code, single step, select a debug thread, etc. Both
CUDA-GDB and the Visual Profiler tools are installed by default when you install the SDK, rather than
being a separate download as with Parallel Nsight. As of 2012, Parallel Nsight was also released under
the Eclipse environment for Linux.
The major difference between Windows and Mac/Linux was the profiling tool support. The Parallel
Nsight tool is in this respect vastly superior to the Visual Profiler. The Visual Profiler is also available
on Windows. It provides a fairly high-level overview and recommendations as to what to address in the
code, and therefore is very suited to those starting out using CUDA. Parallel Nsight, by contrast, is
aimed at a far more advanced user. We cover usage of both Parallel Nsight and Visual Profiler later in
subsequent chapters. However, the focus throughout this text is on the use of Parallel Nsight as the
primary debugging/analysis tool for GPU development.
For advanced CUDA development I’d strongly recommend using Parallel Nsight for debugging
and analysis. For most people new to CUDA the combination of the Visual Profiler and CUDA-GDB
work well enough to allow for development.

COMPILATION MODEL
The NVIDIA compiler, NVCC, sits in the background and is invoked when a CUDA source file needs
to be compiled. The file extensions shown in Table 4.1 are used to define files as with CUDA source
files or regular source files. This determines which compiler will be invoked, NVCC or the host
compiler.
The generated executable file, or fat binary, contains one or more binary executable images for the
different GPU generations. It also contains a PTX image, allowing the CUDA runtime to do just-intime (JIT) compilation. This is very similar to Java byte code where the target is a virtual architecture,
and this is compiled to the actual target hardware at the point the program is invoked. The PTX JIT
Table 4.1 Different CUDA File Types
File Extension

Meaning

Processed By

.cu
.cup
.c, .cc, .cpp
.ptx, .gpu
.cubin

Mixed host and device source file.
A preprocessed expanded version of .cu file.
A host C or Cþþ source file.
Intermediate virtual assembly files.
Binary image of GPU code.

NVCC
NVCC
Host compiler
NVCC
NVCC

Error Handling

67

compilation only happens if the executable does not contain a binary image that is identical to the GPU
in use. Consequently, all future architectures are backward compatible with the basic-level virtual
architecture. Even GPUs for which the program was not compiled will execute legacy GPU code by
simply compiling at runtime the PTX code embedded in the executable.
Just as with Java, code depositories are supported. Defining the environment variable
CUDA_DEVCODE_CACHE to point to a directory will cause the runtime to save the compiled binary for later
use, thus avoiding the startup delay necessary to compile the PTX code for the unknown GPU variant
every time it is invoked.
We cover in the later chapters how you can view the real target assembly code, the result of the PTX
to target translation.

ERROR HANDLING
Error handling in CUDA, as with C in general, is not as good as it could be. There are few runtime
checks performed, and if you do something stupid, the runtime will usually allow it. This results in
GPU programs that exit strangely. If you are lucky, you will get an error message which, like compiler
errors, you learn to interpret over time.
Almost all function calls in CUDA return the error type cudaError_t, which is simply an
integer value. Any value other than cudaSuccess will indicate a fatal error. This is usually caused
by your program not setting up something correctly prior to use, or using an object after it
has been destroyed. It can also be caused by the GPU kernel timeout present in Microsoft
Windows if the kernel runs for more than a few seconds and you have not disabled this when
installing tools such as Parallel Nsight (see previous section). Out-of-bounds memory accesses
may generate exceptions that will often print various error messages to stderr (standard error
output).
As every function returns an error code, every function call must be checked and some handler
written. This makes for very tiresome and highly indented programming. For example,
if (cudaMalloc(.) ¼¼ cudaSuccess)
{
if (cudaEventCreate(&event) ¼¼ cudaSucess)
{
.
}
}
else
{
.
}

To avoid this type of repetitive programming, throughout the book we will use the following macro
definition to making calls to the CUDA API:
#define CUDA_CALL(x) {const cudaError_t a ¼ (x); if (a !¼ cudaSuccess) { printf("\nCUDA
Error: %s (err_num¼%d) \n", cudaGetErrorString(a), a); cudaDeviceReset(); assert(0);} }
What this macro does is to allow you to specify x as some function call, for example,

68

CHAPTER 4 Setting Up CUDA

CUDA_CALL(cudaEventCreate(&kernel_start));

This then creates a temporary variable a and assigns to it the return value of the function, which is of
type cudaError_t. It then checks if this is not equal to cudaSuccess, that is, the call encountered some
error. If there was an error detected, it prints to the screen the error returned plus a short description of
what the error means. It also uses the assert macro, which identifies the source file and line in which the
error occurs so you can easily track down the point at which the error is being detected.
This technique works for all the CUDA calls except for the invocation of kernels. Kernels are the
programs you write to run on the GPU. These are executed using the <<< and >>> operators as follows:
my_kernel <<>>(param1, param2,.);

For error checking of kernels, we’ll use the following function:
__host__ void cuda_error_check(const char * prefix, const char * postfix)
{
if (cudaPeekAtLastError() !¼ cudaSuccess)
{
printf("\n%s%s%s", prefix, cudaGetErrorString(cudaGetLastError()), postfix);
cudaDeviceReset();
wait_exit();
exit(1);
}
}

This function should be called immediately after executing the kernel call. It checks for any
immediate errors, and if so, prints an error message, resets the GPU, optionally waits for a key press via
the wait_exit function, and then exits the program.
Note that this is not foolproof, as the kernel call is asynchronous with the CPU code. That is, the
GPU code is running in the background at the time we call cudaPeekAtLastError. If there has been
no error detected at this time, then we see no error printed and the function continues to the next
code line. Often that next code line will be a copy back from GPU memory to CPU memory. The
error in the kernel may cause a subsequent API call to fail, which is almost always the next API call
after the kernel call. Surrounding all calls to the API with the CUDA_CALL macro will flag the error at
this point.
You can also force the kernel to complete prior to the error checking by simply inserting a call to
cudaDeviceSynchronize prior to the cudaPeekAtLastError call. However, only do this on the debug
version of the program or where you want the CPU to idle while the GPU is busy. As you should
understand by the end of this text, such synchronous operation is good for debugging, but will harm
performance, so you should be careful these calls do not remain in production code if they were
inserted solely for debugging.

CONCLUSION
You should now have a working installation of the CUDA SDK, including the GPU computing SDK
samples and a debugging environment. You should be able to build a simple GPU SDK sample, such as
the deviceQuery project, and have it identify the GPUs in your system when run.

CHAPTER

Grids, Blocks, and Threads

5

WHAT IT ALL MEANS
NVIDIA chose a rather interesting model for its scheduling, a variant of SIMD it calls SPMD (single
program, multiple data). This is based on the underlying hardware implementation in many respects.
At the heart of parallel programming is the idea of a thread, a single flow of execution through the
program in the same way a piece of cotton flows through a garment. In the same way threads of cotton
are woven into cloth, threads used together make up a parallel program. The CUDA programming
model groups threads into special groups it calls warps, blocks, and grids, which we will look at in turn.

THREADS
A thread is the fundamental building block of a parallel program. Most C programmers are familiar
with the concept if they have done any multicore programming. Even if you have never launched
a thread in any code, you will be familiar with executing at least one thread, the single thread of
execution through any serial piece of code.
With the advent of dual, quad, hex core processors, and beyond, more emphasis is explicitly placed
on the programmer to make use of such hardware. Most programs written in the past few decades, with
the exception of perhaps the past decade, were single-thread programs because the primary hardware
on which they would execute was a single-core CPU. Sure, you had clusters and supercomputers that
sought to exploit a high level of parallelism by duplicating the hardware and having thousands of
commodity servers instead of a handful of massively powerful mac
ines. However, these were mostly restricted to universities and large institutions, not generally
available to the masses.
Thinking in terms of lots of threads is hard. It’s much easier to think in terms of one task at a time.
Serial programming languages like C/Cþþ were born from a time when serial processing speed
doubled every few years. There was little need to do the hard parallel programming. That stopped
almost a decade ago, and now, like it or not, to improve program speed requires us to think in terms of
parallel design.

Problem decomposition
Parallelism in the CPU domain tends to be driven by the desire to run more than one (single-threaded)
program on a single CPU. This is the task-level parallelism that we covered earlier. Programs, which
CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00005-3
Copyright Ó 2013 Elsevier Inc. All rights reserved.

69

70

CHAPTER 5 Grids, Blocks, and Threads

are data intensive, like video encoding, for example, use the data parallelism model and split the task in
N parts where N is the number of CPU cores available. You might, for example, have each CPU core
calculate one “frame” of data where there are no interdependencies between frames. You may also
choose to split each frame into N segments and allocate each one of the segments to an individual core.
In the GPU domain, you see exactly these choices when attempting to speed up rendering of 3D
worlds in computer games by using more than one GPU. You can send complete, alternate frames to
each GPU (Figure 5.1). Alternatively, you can ask one GPU to render the different parts of the screen.

Frame N+3 - GPU3
Top - GPU0

Frame N+2 - GPU2
Frame N+1 - GPU1

Middle-Top - GPU1

Frame N - GPU0

Middle-Bottom - GPU2

Bottom - GPU3

FIGURE 5.1
Alternate frame rendering (AFR) vs. Split Frame
Rendering (SFR).

FIGURE 5.2
Coarse-grained parallelism.

However, there is a trade off here. If the dataset is self-contained, you can use less memory and transfer
less data by only providing the GPU (or CPU) with the subset of the data you need to calculate. In the SFR
GPU example used here, there may be no need for GPU3, which is rendering the floor to know the content
of data from GPU0, which is probably rendering the sky. However, there may be shadows from a flying
object, or the lighting level of the floor may need to vary based on the time of day. In such instances, it
might be more beneficial to go with the alternate frame rendering approach because of this shared data.
We refer to SFR type splits as coarse-grained parallelism. Large chunks of data are split in some
way between N powerful devices and then reconstructed later as the processed data. When designing
applications for a parallel environment, choices at this level seriously impact the performance of your
programs. The best choice here is very much linked to the actual hardware you will be using, as you
will see with the various applications we develop throughout this book.
With a small number of powerful devices, such as in CPUs, the issue is often how to split the
workload evenly. This is often easier to reason with because you are typically talking about only
a small number of devices. With huge numbers of smaller devices, as with GPUs, they average out
peaks in workload much better, but suffer from issues around synchronization and coordination.
In the same way as you have macro (large-scale) and micro (small-scale) economics, you have
coarse and fine-grained parallelism. However, you only really find fine-grained parallelism at the

Threads

71

programmer level on devices that support huge numbers of threads, such as GPUs. CPUs, by contrast,
also support threads, but with a large overhead and thus are considered to be useful for more coarse-grained parallelism problems. CPUs, unlike GPUs, follow the MIMD (Multiple Instruction
Multiple Data) model in that they support multiple independent instruction streams. This is a more
flexible approach, but incurs additional overhead in terms of fetching multiple independent instruction
streams as opposed to amortizing the single instruction stream over multiple processors.
To put this in context, let’s consider a digital photo where you apply an image correction function to
increase the brightness. On a GPU you might choose to assign one thread for every pixel in the image.
On a quad-core CPU, you would likely assign one-quarter of the image to each CPU core.

How CPUs and GPUs are different
GPUs and CPUs are architecturally very different devices. CPUs are designed for running a small
number of potentially quite complex tasks. GPUs are designed for running a large number of quite
simple tasks. The CPU design is aimed at systems that execute a number of discrete and unconnected
tasks. The GPU design is aimed at problems that can be broken down into thousands of tiny fragments
and worked on individually. Thus, CPUs are very suitable for running operating systems and application software where there are a vast variety of tasks a computer may be performing at any given time.
CPUs and GPUs consequently support threads in very different ways. The CPU has a small number
of registers per core that must be used to execute any given task. To achieve this, they rapidly context
switch between tasks. Context switching on CPUs is expensive in terms of time, in that the entire
register set must be saved to RAM and the next one restored from RAM. GPUs, by comparison, also
use the same concept of context switching, but instead of having a single set of registers, they have
multiple banks of registers. Consequently, a context switch simply involves setting a bank selector to
switch in and out the current set of registers, which is several orders of magnitude faster than having to
save to RAM.
Both CPUs and GPUs must deal with stall conditions. These are generally caused by I/O operations
and memory fetches. The CPU does this by context switching. Providing there are enough tasks and
the runtime of a thread is not too small, this works reasonably well. If there are not enough processes to
keep the CPU busy, it will idle. If there are too many small tasks, each blocking after a short period, the
CPU will spend most of its time context switching and very little time doing useful work. CPU
scheduling policies are often based on time slicing, dividing the time equally among the threads. As the
number of threads increases, the percentage of time spent context switching becomes increasingly
large and the efficiency starts to rapidly drop off.
GPUs are designed to handle stall conditions and expect this to happen with high frequency. The
GPU model is a data-parallel one and thus it needs thousands of threads to work efficiently. It uses
this pool of available work to ensure it always has something useful to work on. Thus, when it hits
a memory fetch operation or has to wait on the result of a calculation, the streaming processors
simply switch to another instruction stream and return to the stalled instruction stream sometime
later.
One of the major differences between CPUs and GPUs is the sheer number of processors on each
device. CPUs are typically dual- or quad-core devices. That is to say they have a number of execution
cores available to run programs on. The current Fermi GPUs have 16 SMs, which can be thought of
a lot like CPU cores. CPUs often run single-thread programs, meaning they calculate just a single data

72

CHAPTER 5 Grids, Blocks, and Threads

point per core, per iteration. GPUs run in parallel by default. Thus, instead of calculating just a single
data point per SM, GPUs calculate 32 per SM. This gives a 4 times advantage in terms of number of
cores (SMs) over a typical quad core CPU, but also a 32 times advantage in terms of data throughput.
Of course, CPU programs can also use all the available cores and extensions like MMX, SSE, and
AVX. The question is how many CPU applications actually use these types of extensions.
GPUs also provide something quite uniquedhigh-speed memory next to the SM, so-called
shared memory. In many respects this implements the design philosophy of the Connection Machine
and the Cell processor, in that it provides local workspace for the device outside of the standard
register file. Thus, the programmer can leave data in this memory, safe in the knowledge the
hardware will not evict it behind his or her back. It is also the primary mechanism communication
between threads.

Task execution model
There are two major differences in the task execution model. The first is that groups of N SPs execute
in a lock-step basis (Figure 5.3), running the same program but on different data. The second is that,
because of this huge register file, switching threads has effectively zero overhead. Thus, the GPU can
support a very large number of threads and is designed in this way.
Now what exactly do we mean by lock-step basis? Each instruction in the instruction queue is
dispatched to every SP within an SM. Remember each SM can be thought of as single processor with
N cores (SPs) embedded within it.
A conventional CPU will fetch a separate instruction stream for each CPU core. The GPU SPMD
model used here allows an instruction fetch for N logical execution units, meaning you have 1/N the
instructions memory bandwidth requirements of a conventional processor. This is a very similar
approach to the vector or SIMD processors found in many high-end supercomputers.
However, this is not without its costs. As you will see later, if the program does not follow a nice
neat execution flow where all N threads follow the same control path, for each branch, you will require
additional execution cycles.

Instruction2
Instruction1
Instruction0

SP 0

SP 1

SP 2

SP 3

SP 4

SP 5

SP 6

SP 7
SM 0

FIGURE 5.3
Lock-step instruction dispatch.

Threads

73

Threading on GPUs
So coming back to threads, let’s look at a section of code and see what this means from a programming
perspective.
void some_func(void)
{
int i;
for (i¼0;i<128;iþþ)
{
a[i] ¼ b[i] * c[i];
}
}

This piece of code is very simple. It stores the result of a multiplication of b and c value for a given
index in the result variable a for that same index. The for loop iterates 128 times (indexes 0 to 127). In
CUDA you could translate this to 128 threads, each of which executes the line
a[i] ¼ b[i] * c[i];

This is possible because there is no dependency between one iteration of the loop and the next.
Thus, to transform this into a parallel program is actually quite easy. This is called loop parallelization
and is very much the basis for one of the more popular parallel language extensions, OpenMP.
On a quad-core CPU you could also translate this to four blocks, where CPU core 1 handles
indexes 0–31, core 2 indexes 32–63, core 3 indexes 64–95, and core 4 indexes 96–127. Some
compilers will either automatically translate such blocks or translate them where the programmer
marks that this loop can be parallelized. The Intel compiler is particularly good at this. Such
compilers can be used to create embedded SSE instructions to vectorize a loop in this way, in
addition to spawning multiple threads. This gives two levels of parallelism and is not too different
from the GPU model.
In CUDA, you translate this loop by creating a kernel function, which is a function that executes on
the GPU only and cannot be executed directly on the CPU. In the CUDA programming model the CPU
handles the serial code execution, which is where it excels. When you come to a computationally
intense section of code the CPU hands it over to the GPU to make use of the huge computational power
it has. Some of you might remember the days when CPUs would use a floating-point coprocessor.
Applications that used a large amount of floating-point math ran many times faster on machines fitted
with such coprocessors. Exactly the same is true for GPUs. They are used to accelerate computationally intensive sections of a program.
The GPU kernel function, conceptually, looks identical to the loop body, but with the loop structure
removed. Thus, you have the following:
__global__ void some_kernel_func(int * const a, const int * const b, const int * const c)
{
a[i] ¼ b[i] * c[i];
}

Notice you have lost the loop and the loop control variable, i. You also have a __global__
prefix added to the C function that tells the compiler to generate GPU code and not CPU

74

CHAPTER 5 Grids, Blocks, and Threads

code when compiling this function, and to make that GPU code globally visible from within
the CPU.
The CPU and GPU have separate memory spaces, meaning you cannot access CPU parameters in
the GPU code and vice versa. There are some special ways of doing exactly this, which we’ll cover
later in the book, but for now we will deal with them as separate memory spaces. As a consequence, the
global arrays a, b, and c at the CPU level are no longer visible on the GPU level. You have to declare
memory space on the GPU, copy over the arrays from the CPU, and pass the kernel function pointers to
the GPU memory space to both read and write from. When you are done, you copy that memory back
into the CPU. We’ll look at this a little later.
The next problem you have is that i is no longer defined; instead, the value of i is defined for you
by the thread you are currently running. You will be launching 128 instances of this function, and
initially this will be in the form of 128 threads. CUDA provides a special parameter, different for each
thread, which defines the thread ID or number. You can use this to directly index into the array. This is
very similar to MPI, where you get the process rank for each process.
The thread information is provided in a structure. As it’s a structure element, we will store it in
a variable, thread_idx for now to avoid having to reference the structure every time. Thus, the code
becomes:
__global__ void some_kernel_func(int * const a, const int * const b, const int * const c)
{
const unsigned int thread_idx ¼ threadIdx.x;
a[thread_idx] ¼ b[thread_idx] * c[thread_idx];
}

Note, some people prefer idx or tid as the name for the thread index since these are somewhat
shorter to type.
What is happening, now, is that for thread 0, the thread_idx calculation returns 0. For thread 1, it
returns 1, and so on, up to thread 127, which uses index 127. Each thread does exactly two reads from
memory, one multiply and one store operation, and then terminates. Notice how the code executed by
each thread is identical, but the data changes. This is at the heart of the CUDA and SPMD model.
In OpenMP and MPI, you have similar blocks of code. They extract, for a given iteration of the
loop, the thread ID or thread rank allocated to that thread. This is then used to index into the dataset.

A peek at hardware
Now remember you only actually have N cores on each SM, so how can you run 128 threads? Well,
like the CPU, each thread group is placed into the SM and the N SPs start running the code. The first
thing you do after extracting the thread index is fetch a parameter from the b and c array. Unfortunately, this doesn’t happen immediately. In fact, some 400–600 GPU clocks can go by before the
memory subsystem comes back with the requested data. During this time the set of N threads gets
suspended.
Threads are, in practice, actually grouped into 32 thread groups, and when all 32 threads are
waiting on something such as memory access, they are suspended. The technical term for these groups
of threads is a warp (32 threads) and a half warp (16 threads), something we’ll return to later.
Thus, the 128 threads translate into four groups of 32 threads. The first set all run together to extract
the thread ID and then calculate the address in the arrays and issue a memory fetch request (see

Threads

75

Figure 5.4). The next instruction, a multiply, requires both operands to have been provided, so the
thread is suspended. When all 32 threads in that block of 32 threads are suspended, the hardware
switches to another warp.
In Figure 5.5, you can see that when warp 0 is suspended pending its memory access completing,
warp 1 becomes the executing warp. The GPU continues in this manner until all warps have moved to
the suspended state (see Figure 5.6).
Prior to issuing the memory fetch, fetches from consecutive threads are usually coalesced or grouped
together. This reduces the overall latency (time to respond to the request), as there is an overhead
associated in the hardware with managing each request. As a result of the coalescing, the memory fetch
returns with the data for a whole group of threads, usually enough to enable an entire warp.
These threads are then placed in the ready state and become available for the GPU to switch in the
next time it hits a blocking operation, such as another memory fetch from another set of threads.
Having executed all the warps (groups of 32 threads) the GPU becomes idle waiting for any one of
the pending memory accesses to complete. At some point later, you’ll get a sequence of memory
blocks being returned from the memory subsystem. It is likely, but not guaranteed, that these will come
back in the order in which they were requested.
Let’s assume that addresses 0–31 were returned at the same time. Warp 0 moves to the ready queue,
and since there is no warp currently executing, warp 0 automatically moves to the executing state (see
Figure 5.7). Gradually all the pending memory requests will complete, resulting in all of the warp
blocks moving back to the ready queue.

Warp 1
(Theads 32
to 63)

Ready
Queue

Executing

Warp 2
(Theads 64
to 95)

Warp 3
(Theads 96
to 127)

Warp 0
(Theads 0 to
31)

Suspended

Memory
Request
Pending
Scheduling Cycle 0

FIGURE 5.4
Cycle 0.

76

CHAPTER 5 Grids, Blocks, and Threads

Warp 2
(Theads 64
to 95)

Ready
Queue

Warp 3
(Theads 96
to 127)

Warp 1
(Theads 32
to 63)

Executing

Suspended

Warp 0
(Theads 0 to
31)

Memory
Request
Pending

Address 0 to
31
Scheduling Cycle 1

FIGURE 5.5
Cycle 1.

Ready
Queue

Executing

Suspended

Warp 0
(Theads 0 to
31)

Warp 1
(Theads 32
to 63)

Warp 2
(Theads 64
to 95)

Warp 3
(Theads 96
to 127)

Memory
Request
Pending

Address 0 to
31

Address 32
to 63

Address 64
to 95

Address 96
to 127
Scheduling Cycle 8

FIGURE 5.6
Cycle 8.

Threads

Warp 1
(Theads 32
to 63)

Ready
Queue

Executing

77

Warp 0
(Theads 0 to
31)

Suspended

Warp 2
(Theads 64
to 95)

Warp 3
(Theads 96
to 127)

Memory
Request
Pending

Address 64
to 95

Address 96
to 127
Scheduling Cycle 9

FIGURE 5.7
Cycle 9.

Once warp 0 has executed, its final instruction is a write to the destination array a. As there are no
dependent instructions on this operation, warp 0 is then complete and is retired. The other warps move
through this same cycle and eventually they have all issued a store request. Each warp is then retired,
and the kernel completes, returning control to the CPU.

CUDA kernels
Now let’s look a little more at how exactly you invoke a kernel. CUDA defines an extension to the C
language used to invoke a kernel. Remember, a kernel is just a name for a function that executes on the
GPU. To invoke a kernel you use the following syntax:
kernel_function<<>>(param1, param2, .)

There are some other parameters you can pass, and we’ll come back to this, but for now you have two
important parameters to look at: num_blocks and num_threads. These can be either variables or literal
values. I’d recommend the use of variables because you’ll use them later when tuning performance.
The num_blocks parameter is something you have not yet covered and is covered in detail in the
next section. For now all you need to do is ensure you have at least one block of threads.
The num_threads parameter is simply the number of threads you wish to launch into the kernel. For
this simple example, this directly translates to the number of iterations of the loop. However, be aware
that the hardware limits you to 512 threads per block on the early hardware and 1024 on the later

78

CHAPTER 5 Grids, Blocks, and Threads

hardware. In this example, it is not an issue, but for any real program it is almost certainly an issue.
You’ll see in the following section how to overcome this.
The next part of the kernel call is the parameters passed. Parameters can be passed via registers
or constant memory, the choice of which is based on the compilers. If using registers, you will use
one register for every thread per parameter passed. Thus, for 128 threads with three parameters,
you use 3  128 ¼ 384 registers. This may sound like a lot, but remember that you have at least
8192 registers in each SM and potentially more on later hardware revisions. So with 128 threads,
you have a total of 64 registers (8192 registers O 128 threads) available to you, if you run just one
block of threads on an SM.
However, running one block of 128 threads per SM is a very bad idea, even if you can use
64 registers per thread. As soon as you access memory, the SM would effectively idle. Only in
the very limited case of heavy arithmetic intensity utilizing the 64 registers should you even
consider this sort of approach. In practice, multiple blocks are run on each SM to avoid any
idle states.

BLOCKS
Now 512 threads are not really going to get you very far on a GPU. This may sound like a huge number
to many programmers from the CPU domain, but on a GPU you usually need thousands or tens of
thousands of concurrent threads to really achieve the throughput available on the device.
We touched on this previously in the last section on threads, with the num_blocks parameter for the
kernel invocation. This is the first parameter within the <<< and >>> symbols:
kernel_function<<>>(param1, param2,...)

If you change this from one to two, you double the number of threads you are asking the GPU to
invoke on the hardware. Thus, the same call,
some_kernel_func<<< 2, 128 >>>(a, b, c);

will call the GPU function named some_kernel_func 2  128 times, each with a different thread. This,
however, complicates the calculation of the thread_idx parameter, effectively the array index position. This previous, simple kernel needs a slight amendment to account for this.
__global__ void some_kernel_func(int * const a, const int * const b, const int * const c)
{
const unsigned int thread_idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
a[thread_idx] ¼ b[thread_idx] * c[thread_idx];
}

To calculate the thread_idx parameter, you must now take into account the number of blocks. For
the first block, blockIdx.x will contain zero, so effectively the thread_idx parameter is equal to the
threadIdx.x parameter you used earlier. However, for block two, blockIdx.x will hold the value 1.
The parameter blockDim.x holds the value 128, which is, in effect, the number of threads you
requested per block in this example. Thus, you have a 1  128 thread base addresses, before adding in
the thread offset from the threadIdx.x parameter.

Blocks

79

Have you noticed the small error we have introduced in adding in another block? You will now
launch 256 threads in total and index the array from 0 to 255. If you don’t also change the size of the
array, from 128 elements to 256 elements, you will access and write beyond the end of the array. This
array out-of-bounds error will not be caught by the compiler and the code may actually run, depending
on what is located after the destination array, a. Be careful when invoking a kernel that you do not
access out of bounds elements.
For this example, we will stick with the 128-byte array size and change the kernel to invoke two
blocks of 64 threads each:
some_kernel_func<<< 2, 64 >>>(a, b, c);

Thus, you get what is shown in Figure 5.8.
Notice how, despite now having two blocks, the thread_idx parameter still equates to the array
index, exactly as before. So what is the point of using blocks? In this trivial example, absolutely
nothing. However, in any real-world problem, you have far more than 512 elements to deal with. In
fact, if you look at the limit on the number of blocks, you find you have 65,536 blocks you can use.
At 65,536 blocks, with 512 threads per block, you can schedule 33,554,432 (around 33.5 million)
threads in total. At 512 threads, you can have up to three blocks per SM. Actually, this limit is based on
the total number of threads per SM, which is 1536 in the latest Fermi hardware, and as little as 768 in
the original G80 hardware.
If you schedule the maximum of 1024 threads per block on the Fermi hardware, 65,536 blocks
would translate into around 64 million threads. Unfortunately, at 1024 threads, you only get one thread
block per SM. Consequently, you’d need some 65,536 SMs in a single GPU before you could not
allocate at least one block per SM. Currently, the maximum number of SMs found on any card is 30.
Thus, there is some provision for the number of SMs to grow before you have more SMs than the
number of blocks the hardware can support. This is one of the beauties of CUDAdthe fact it can scale
to thousands of execution units. The limit of the parallelism is only really the limit of the amount of
parallelism that can be found in the application.
With 64 million threads, assuming one thread per array element, you can process up to 64 million
elements. Assuming each element is a single-precision floating-point number, requiring 4 bytes of
data, you’d need around 256 million bytes, or 256 MB, of data storage space. Almost all GPU cards
support at least this amount of memory space, so working with threads and blocks alone you can
achieve quite a large amount of parallelism and data coverage.

FIGURE 5.8
Block mapping to address.

Block 0
Warp 0
(Thread
0 to 31)

Block 0
Warp 1
(Thread
32 to 63)

Block 1
Warp 0
(Thread
64 to 95)

Block 1
Warp 1
(Thread
96 to 127)

Address
0 to 31

Address
32 to 63

Address
64 to 95

Address
96 to 127

80

CHAPTER 5 Grids, Blocks, and Threads

For anyone worried about large datasets, where large problems can run into gigabytes, terabytes, or
petabytes of data, there is a solution. For this, you generally either process more than one element per
thread or use another dimension of blocks, which we’ll cover in the next section.

Block arrangement
To ensure that we understand the block arrangement, we’re going to write a short kernel program to
print the block, thread, warp, and thread index to the screen. Now, unless you have at least version 3.2 of
the SDK, the printf statement is not supported in kernels. So we’ll ship the data back to the CPU and
print it to the console window. The kernel program is thus as follows:
__global__ void what_is_my_id(unsigned int * const block,
unsigned int * const thread,
unsigned int * const warp,
unsigned int * const calc_thread)
{
/* Thread id is block index * block size þ thread offset into the block */
const unsigned int thread_idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
block[thread_idx] ¼ blockIdx.x;
thread[thread_idx] ¼ threadIdx.x;
/* Calculate warp using built in variable warpSize */
warp[thread_idx] ¼ threadIdx.x / warpSize;
calc_thread[thread_idx] ¼ thread_idx;
}

Now on the CPU you have to run a section of code, as follows, to allocate memory for the arrays on
the GPU and then transfer the arrays back from the GPU and display them on the CPU.
#include 
#include 
#include 
__global__ void what_is_my_id(unsigned int * const block,
unsigned int * const thread,
unsigned int * const warp,
unsigned int * const calc_thread)
{
/* Thread id is block index * block size þ thread offset into the block */
const unsigned int thread_idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
block[thread_idx] ¼ blockIdx.x;
thread[thread_idx] ¼ threadIdx.x;
/* Calculate warp using built in variable warpSize */
warp[thread_idx] ¼ threadIdx.x / warpSize;
calc_thread[thread_idx] ¼ thread_idx;
}

Blocks

#define ARRAY_SIZE 128
#define ARRAY_SIZE_IN_BYTES (sizeof(unsigned int) * (ARRAY_SIZE))
/* Declare statically four arrays of ARRAY_SIZE each */
unsigned
unsigned
unsigned
unsigned

int
int
int
int

cpu_block[ARRAY_SIZE];
cpu_thread[ARRAY_SIZE];
cpu_warp[ARRAY_SIZE];
cpu_calc_thread[ARRAY_SIZE];

int main(void)
{
/* Total thread count ¼ 2 * 64 ¼ 128 */
const unsigned int num_blocks ¼ 2;
const unsigned int num_threads ¼ 64;
char ch;
/* Declare pointers for GPU based params */
unsigned int * gpu_block;
unsigned int * gpu_thread;
unsigned int * gpu_warp;
unsigned int * gpu_calc_thread;
/* Declare loop counter for use later */
unsigned int i;
/* Allocate four
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void

arrays on the GPU */
**)&gpu_block, ARRAY_SIZE_IN_BYTES);
**)&gpu_thread, ARRAY_SIZE_IN_BYTES);
**)&gpu_warp, ARRAY_SIZE_IN_BYTES);
**)&gpu_calc_thread, ARRAY_SIZE_IN_BYTES);

/* Execute our kernel */
what_is_my_id<<>>(gpu_block, gpu_thread, gpu_warp,
gpu_calc_thread);
/* Copy back the gpu results to the CPU */
cudaMemcpy(cpu_block, gpu_block, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_thread, gpu_thread, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_warp, gpu_warp, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_calc_thread, gpu_calc_thread, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
/* Free the arrays on the GPU as now we’re done with them */

81

82

CHAPTER 5 Grids, Blocks, and Threads

cudaFree(gpu_block);
cudaFree(gpu_thread);
cudaFree(gpu_warp);
cudaFree(gpu_calc_thread);
/* Iterate through the arrays and print */
for (i¼0; i < ARRAY_SIZE; iþþ)
{
printf("Calculated Thread: %3u - Block: %2u - Warp %2u - Thread %3u\n",
cpu_calc_thread[i], cpu_block[i], cpu_warp[i], cpu_thread[i]);
}
ch ¼ getch();
}

In this example, what you see is that each block is located immediately after the one before
it. As you have only a single dimension to the array, laying out the thread blocks in a similar
way is an easy way to conceptualize a problem. The output of the previous program is as
follows:
Calculated
Calculated
Calculated
Calculated
Calculated
.
Calculated
Calculated
Calculated
Calculated
Calculated
.
Calculated
Calculated
Calculated
Calculated
Calculated
Calculated
.
Calculated
Calculated
Calculated
Calculated
Calculated
Calculated
Calculated
.
Calculated
Calculated

Thread:
Thread:
Thread:
Thread:
Thread:

0
1
2
3
4

-

Block:
Block:
Block:
Block:
Block:

0
0
0
0
0

-

Warp
Warp
Warp
Warp
Warp

0
0
0
0
0

-

Thread
Thread
Thread
Thread
Thread

0
1
2
3
4

Thread:
Thread:
Thread:
Thread:
Thread:

30
31
32
33
34

-

Block:
Block:
Block:
Block:
Block:

0
0
0
0
0

-

Warp
Warp
Warp
Warp
Warp

0
0
1
1
1

-

Thread
Thread
Thread
Thread
Thread

30
31
32
33
34

Thread:
Thread:
Thread:
Thread:
Thread:
Thread:

62
63
64
65
66
67

-

Block:
Block:
Block:
Block:
Block:
Block:

0
0
1
1
1
1

-

Warp
Warp
Warp
Warp
Warp
Warp

1
1
0
0
0
0

-

Thread
Thread
Thread
Thread
Thread
Thread

62
63
0
1
2
3

Thread:
Thread:
Thread:
Thread:
Thread:
Thread:
Thread:

94 - Block: 1 - Warp 0 - Thread 30
95 - Block: 1 - Warp 0 - Thread 31
96 - Block: 1 - Warp 1 - Thread 32
97 - Block: 1 - Warp 1 - Thread 33
98 - Block: 1 - Warp 1 - Thread 34
99 - Block: 1 - Warp 1 - Thread 35
100 - Block: 1 - Warp 1 - Thread 36

Thread: 126 - Block: 1 - Warp 1 - Thread 62
Thread: 127 - Block: 1 - Warp 1 - Thread 63

Grids

83

As you can see, the calculated thread, or the thread ID, goes from 0 to 127. Within that you allocate
two blocks of 64 threads each. The thread indexes within each of these blocks go from 0 to 63. You also
see that each block generates two warps.

GRIDS
A grid is simply a set of blocks where you have an X and a Y axis, in effect a 2D mapping. The final Y
mapping gives you Y  X  T possibilities for a thread index. Let’s look at this using an example, but
limiting the Y axis to a single row to start off with.
If you were to look at a typical HD image, you have a 1920  1080 resolution. The number of
threads in a block should always be a multiple of the warp size, which is currently defined as 32. As
you can only schedule a full warp on the hardware, if you don’t do this, then the remaining part of the
warp goes unused and you have to introduce a condition to ensure you don’t process elements off the
end of the X axis. This, as you’ll see later, slows everything down.
To avoid poor memory coalescing, you should always try to arrange the memory and thread usage
so they map. This will be covered in more detail in the next chapter on memory. Failure to do so will
result in something in the order of a five times drop in performance.
To avoid tiny blocks, as they don’t make full use of the hardware, we’ll pick 192 threads per block.
In most cases, this is the minimum number of threads you should think about using. This gives you
exactly 10 blocks across each row of the image, which is an easy number to work with (Figure 5.9).
Using a thread size that is a multiple of the X axis and the warp size makes life a lot easier.
Along the top on the X axis, you have the thread index. The row index forms the Yaxis. The height of the
row is exactly one pixel. As you have 1080 rows of 10 blocks, you have in total 1080  10 ¼ 10,800 blocks.
As each block has 192 threads, you are scheduling just over two million threads, one for each pixel.
This particular layout is useful where you have one operation on a single pixel or data point, or
where you have some operation on a number of data points in the same row. On the Fermi hardware, at
eight blocks per SM, you’d need a total of 1350 SMs (10,800 total blocks O 8 scheduled blocks) to run
out of parallelism at the application level. On the Fermi hardware currently available, you have only 16
SMs (GTX580), so each SM would be given 675 blocks to process.
This is all very well, but what if your data is not row based? As with arrays, you are not limited to
a single dimension. You can have a 2D thread block arrangement. A lot of image algorithms, for

Row 0

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6
B

Block 7
B

Block 8
B

Block 9
B

Row 1

Block 10

Block 11
B

Block 12
B

Block 13
B
3

Block 14
B

Block 15
B
5

Block 16
B
6

Block 17
B
7

Block 18
B
8

B
Block 19
9

Row 2

Block 20

Block 21

Block 22

Block 23

Block 24

Block 25

Block 26

Block 27

Block 28

Block 29

Block
10,790

Block
10,791

Block
10,792

Block
10,793

Block
10,794

Block
10,795

Block
10,796

Block
10,797

Block
10,798

Block
10,799

Row ....
Row 1079

FIGURE 5.9
Block allocation to rows.

84

CHAPTER 5 Grids, Blocks, and Threads

example, use 8  8 blocks of pixels. We’re using pixels here to show this arrangement, as it’s easy for
most people to conceptualize. Your data need not be pixel based. You typically represent pixels as
a red, green, and blue component. You could equally have x, y, and z spatial coordinates as a single data
point, or a simple 2D or 3D matrix holding the data points.

Stride and offset
As with arrays in C, thread blocks can be thought of as 2D structures. However, for 2D thread blocks,
we need to introduce some new concepts. Just like in array indexing, to index into a Y element of 2D
array, you need to know the width of the array, the number of X elements. Consider the array in
Figure 5.10.
The width of the array is referred to as the stride of the memory access. The offset is the column
value being accessed, starting at the left, which is always element 0. Thus, you have array element 5
being accessed with the index [1][5] or via the address calculation (row  (sizeof(array_element) 
width))) þ ((sizeof(array_element)  offset)). This is the calculation the compiler effectively uses, in
an optimized form, when you do multidimensional array indexing in C code.

Array Element 0
X=0
Y=0

Array Element 1
X=1
Y=0

Array Element 2
X=2
Y=0

Array Element 3
X=3
Y=0

Array Element 4
X=4
Y=0

Array Element 5
X=0
Y=1

Array Element 6
X=1
Y=1

Array Element 7
X=2
Y=1

Array Element 8
X=3
Y=1

Array Element 9
X=4
Y=1

Array Element 10
X=0
Y=2

Array Element 11
X=1
Y=2

Array Element 12
X=2
Y=2

Array Element 13
X=3
Y=2

Array Element 14
X=0
Y=2

FIGURE 5.10
Array mapping to elements.

Grids

85

Now, how is this relevant to threads and blocks in CUDA? CUDA is designed to allow for data
decomposition into parallel threads and blocks. It allows you to define 1D, 2D, or 3D indexes (Y  X 
T) when referring to the parallel structure of the program. This maps directly onto the way a typical area
of memory is set out, allowing the data you are processing to be allocated to individual SMs. The process
of keeping data close to the processor hugely increases performance, both on the GPU and CPU.
However, there is one caveat you must be aware of when laying out such arrays. The width value of
the array must always be a multiple of the warp size. If it is not, pad the array to the next largest
multiple of the warp size. Padding to the next multiple of the warp size should introduce only a very
modest increase in the size of the dataset. Be aware, however, you’ll need to deal with the padded
boundary, or halo cells, differently than the rest of the cells. You can do this using divergence in the
execution flow (e.g., using an if statement) or you can simply calculate the padded cells and discard
the result. We’ll cover divergence and the problems it causes later in the book.

X and Y thread indexes
Having a 2D array in terms of blocks means you get two thread indexes, as you will be accessing the
data in a 2D way:
const unsigned int idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
const unsigned int idy ¼ (blockIdx.y * blockDim.y) þ threadIdx.y;
some_array[idy][idx] þ¼ 1.0;

Notice the use of blockDim.x and blockDim.y, which the CUDA runtime completes for you,
specifying the dimension on the X and Y axis. So let’s modify the existing program to work on
a 32  16 array. As you want to schedule four blocks, you can schedule them as stripes across the array,
or as squares within the array, as shown in Figure 5.11.
You could also rotate the striped version 90 degrees and have a column per thread block. Never do
this, as it will result in completely noncoalesced memory accesses that will drop the performance of
your application by an order of magnitude or more. Be careful when parallelizing loops so that the
access pattern always runs sequentially through memory in rows and never columns. This applies
equally to CPU and GPU code.
Now why might you choose the square layout over the rectangle layout? Well, two reasons. The
first is that threads within the same block can communicate using shared memory, a very quick way
to cooperate with one another. The second consideration is you get marginally quicker memory
access with single 128-byte transaction instead of two, 64-byte transactions, due to accessing within
a warp being coalesced and 128 bytes being the size of a cache line in the Fermi hardware. In the
square layout notice you have threads 0 to 15 mapped to one block and the next memory location
belongs to another block. As a consequence you get two transactions instead of one, as with the
rectangular layout. However, if the array was slightly larger, say 64  16, then you would not see this
issue, as you’d have 32 threads accessing contiguous memory, and thus a single 128-byte fetch from
memory issued.
Use the following to modify the program to use either of the two layouts:
dim3 threads_rect(32,4);
dim3 blocks_rect(1,4);

86

CHAPTER 5 Grids, Blocks, and Threads

Thread 0-15, Block 0

Thread 16-31, Block 0

Thread 0-15, Block 0

Thread 0-15, Block 1

Thread 32-47, Block 0

Thread 48-63, Block 0

Thread 16-31, Block 0

Thread 16-31, Block 1

Thread 64-79, Block 0

Thread 80-95, Block 0

Thread 32-47, Block 0

Thread 32-47, Block 1

Thread 96-111, Block 0

Thread 112-127, Block 0

Thread 48-63, Block 0

Thread 48-63, Block 1

Thread 0-15, Block 1

Thread 16-31, Block 1

Thread 64-79, Block 0

Thread 64-79, Block 1

Thread 32-47, Block 1

Thread 48-63, Block 1

Thread 80-95, Block 0

Thread 80-95, Block 1

Thread 64-79, Block 1

Thread 80-95, Block 1

Thread 96-111, Block 0

Thread 96-111, Block 1

Thread 96-111, Block 1

Thread 112-127, Block 1

Thread 112-127, Block 0

Thread 112-127, Block 1

Thread 0-15, Block 2

Thread 16-31, Block 2

Thread 32-47, Block 2

OR

Thread 0-15, Block 2

Thread 0-15, Block 4

Thread 48-63, Block 3

Thread 16-31, Block 2

Thread 16-31, Block 4

Thread 64-79, Block 3

Thread 80-95, Block 3

Thread 32-47, Block 2

Thread 32-47, Block 4

Thread 96-111, Block 3

Thread 112-127, Block 3

Thread 48-63, Block 3

Thread 48-63, Block 4

Thread 0-15, Block 4

Thread 16-31, Block 4

Thread 64-79, Block 3

Thread 64-79, Block 4

Thread 32-47, Block 4

Thread 48-63, Block 4

Thread 80-95, Block 3

Thread 80-95, Block 4

Thread 64-79, Block 4

Thread 80-95, Block 4

Thread 96-111, Block 3

Thread 96-111, Block 4

Thread 96-111, Block 4

Thread 112-127, Block 4

Thread 112-127, Block 3

Thread 112-127, Block 4

FIGURE 5.11
Alternative thread block layouts.
or
dim3 threads_square(16,8);
dim3 blocks_square(2,2);

In either arrangement you have the same total number of threads (32  4 ¼ 128, 16  8 ¼ 128). It’s
simply the layout of the threads that is different.
The dim3 type is simply a special CUDA type that you have to use to create a 2D layout of threads. In the
rectangle example, you’re saying you want 32 threads along the X axis by 4 threads along the Yaxis, within
a single block. You’re then saying you want the blocks to be laid out as one block wide by four blocks high.
You’ll need to invoke the kernel with
some_kernel_func<<< blocks_rect, threads_rect >>>(a, b, c);

or
some_kernel_func<<< blocks_square, threads_square >>>(a, b, c);

As you no longer want just a single thread ID, but an X and Y position, you’ll need to update the
kernel to reflect this. However, you also need to linearize the thread ID because there are situations
where you may want an absolute thread index. For this we need to introduce a couple of new concepts,
shown in Figure 5.12.
You can see a number of new parameters, which are:
gridDim.x–The size in blocks of the X dimension of the grid.
gridDim.y–The size in blocks of the Y dimension of the grid.
blockDim.x–The size in threads of the X dimension of a single block.
blockDim.y–The size in threads of the Y dimension of a single block.

Grids

Array Element 0
X=0
Y=0

Array Element 1
X=1
Y=0

Array Element 2
X=2
Y=0

Array Element 3
X=3
Y=0

Array Element 4
X=4
Y=0

Array Element 5
X=0
Y=1

Array Element 6
X=1
Y=1

Array Element 7
X=2
Y=1

Array Element 8
X=3
Y=1

Array Element 9
X=4
Y=1

Array Element 10
X=0
Y=2

Array Element 11
X=1
Y=2

Array Element 12
X=2
Y=2

Array Element 13
X=3
Y=2

Array Element 14
X=0
Y=2

87

FIGURE 5.12
Grid, block, and thread dimensions.

theadIdx.x–The offset within a block of the X thread index.
theadIdx.y–The offset within a block of the Y thread index.

You can work out the absolute thread index by working out the Y position and multiplying this by
number of threads in a row. You then simply add in the X offset from the start of the row. Thus, the
thread index calculation is
thread_idx ¼ ((gridDim.x * blockDim.x) * idy) þ idx;

So you need to modify the kernel to additionally return the X and Y positions plus some other useful
bits of information, as follows:
__global__ void what_is_my_id_2d_A(
unsigned int * const block_x,
unsigned int * const block_y,
unsigned int * const thread,
unsigned int * const calc_thread,
unsigned int * const x_thread,
unsigned int * const y_thread,
unsigned int * const grid_dimx,

88

CHAPTER 5 Grids, Blocks, and Threads

unsigned int * const
unsigned int * const
unsigned int * const
{
const unsigned int
const unsigned int
const unsigned int

block_dimx,
grid_dimy,
block_dimy)
idx
¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
idy
¼ (blockIdx.y * blockDim.y) þ threadIdx.y;
thread_idx ¼ ((gridDim.x * blockDim.x) * idy) þ idx;

block_x[thread_idx]
block_y[thread_idx]
thread[thread_idx]
calc_thread[thread_idx]
x_thread[thread_idx]
y_thread[thread_idx]
grid_dimx[thread_idx]
block_dimx[thread_idx]
grid_dimy[thread_idx]
block_dimy[thread_idx]

¼
¼
¼
¼
¼
¼
¼
¼
¼
¼

blockIdx.x;
blockIdx.y;
threadIdx.x;
thread_idx;
idx;
idy;
gridDim.x;
blockDim.x;
gridDim.y;
blockDim.y;

}

We’ll call the kernel twice to demonstrate how you can arrange array blocks and threads.
As you’re now passing an additional dataset to compute, you need an additional cudaMalloc,
cudaFree, and cudaMemcpy to copy the data from the device. As you’re using two dimensions, you’ll
also need to modify the array size to allocate and transfer the correct size of data.
#define ARRAY_SIZE_X 32
#define ARRAY_SIZE_Y 16
#define ARRAY_SIZE_IN_BYTES ((ARRAY_SIZE_X) * (ARRAY_SIZE_Y) * (sizeof(unsigned int)))
/* Declare statically six arrays of ARRAY_SIZE each */
unsigned int cpu_block_x[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_block_y[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_thread[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_warp[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_calc_thread[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_xthread[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_ythread[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_grid_dimx[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_block_dimx[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_grid_dimy[ARRAY_SIZE_Y][ARRAY_SIZE_X];
unsigned int cpu_block_dimy[ARRAY_SIZE_Y][ARRAY_SIZE_X];
int main(void)
{
/* Total thread count ¼ 32 * 4 ¼ 128 */
const dim3 threads_rect(32, 4); /* 32 * 4 */
const dim3 blocks_rect(1,4);

Grids

89

/* Total thread count ¼ 16 * 8 ¼ 128 */
const dim3 threads_square(16, 8); /* 16 * 8 */
const dim3 blocks_square(2,2);
/* Needed to wait for a character at exit */
char ch;
/* Declare pointers for GPU based params */
unsigned int * gpu_block_x;
unsigned int * gpu_block_y;
unsigned int * gpu_thread;
unsigned int * gpu_warp;
unsigned int * gpu_calc_thread;
unsigned int * gpu_xthread;
unsigned int * gpu_ythread;
unsigned int * gpu_grid_dimx;
unsigned int * gpu_block_dimx;
unsigned int * gpu_grid_dimy;
unsigned int * gpu_block_dimy;
/* Allocate four
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void
cudaMalloc((void

arrays on the GPU */
**)&gpu_block_x, ARRAY_SIZE_IN_BYTES);
**)&gpu_block_y, ARRAY_SIZE_IN_BYTES);
**)&gpu_thread, ARRAY_SIZE_IN_BYTES);
**)&gpu_calc_thread, ARRAY_SIZE_IN_BYTES);
**)&gpu_xthread, ARRAY_SIZE_IN_BYTES);
**)&gpu_ythread, ARRAY_SIZE_IN_BYTES);
**)&gpu_grid_dimx, ARRAY_SIZE_IN_BYTES);
**)&gpu_block_dimx, ARRAY_SIZE_IN_BYTES);
**)&gpu_grid_dimy, ARRAY_SIZE_IN_BYTES);
**)&gpu_block_dimy, ARRAY_SIZE_IN_BYTES);

for (int kernel¼0; kernel < 2; kernelþþ)
{
switch (kernel)
{
case 0:
{
/* Execute our kernel */
what_is_my_id_2d_A<<>>(gpu_block_x, gpu_block_y,
gpu_thread, gpu_calc_thread, gpu_xthread, gpu_ythread, gpu_grid_dimx, gpu_block_dimx,
gpu_grid_dimy, gpu_block_dimy);
} break;
case 1:
{

90

CHAPTER 5 Grids, Blocks, and Threads

/* Execute our kernel */
what_is_my_id_2d_A<<>>(gpu_block_x, gpu_block_y,
gpu_thread, gpu_calc_thread, gpu_xthread, gpu_ythread, gpu_grid_dimx, gpu_block_dimx,
gpu_grid_dimy, gpu_block_dimy);
} break;
default: exit(1); break;
}
/* Copy back the gpu results to the CPU */
cudaMemcpy(cpu_block_x, gpu_block_x, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_block_y, gpu_block_y, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_thread, gpu_thread, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_calc_thread, gpu_calc_thread, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_xthread, gpu_xthread, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_ythread, gpu_ythread, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_grid_dimx, gpu_grid_dimx, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_block_dimx,gpu_block_dimx, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_grid_dimy, gpu_grid_dimy, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
cudaMemcpy(cpu_block_dimy, gpu_block_dimy, ARRAY_SIZE_IN_BYTES,
cudaMemcpyDeviceToHost);
printf("\nKernel %d\n", kernel);
/* Iterate through the arrays and print */
for (int y¼0; y < ARRAY_SIZE_Y; yþþ)
{
for (int x¼0; x < ARRAY_SIZE_X; xþþ)
{
printf("CT: %2u BKX: %1u BKY: %1u TID: %2u YTID: %2u XTID: %2u GDX: %1u BDX: %
1u GDY %1u BDY %1u\n", cpu_calc_thread[y][x], cpu_block_x[y][x], cpu_block_y[y][x],
cpu_thread[y][x], cpu_ythread[y][x], cpu_xthread[y][x], cpu_grid_dimx[y][x],
cpu_block_dimx[y][x], cpu_grid_dimy[y][x], cpu_block_dimy[y][x]);
/* Wait for any key so we can see the console window */
ch ¼ getch();
}
}
/* Wait for any key so we can see the console window */
printf("Press any key to continue\n");

Warps

91

ch ¼ getch();
}
/* Free the arrays on the GPU as now we’re done with them */
cudaFree(gpu_block_x);
cudaFree(gpu_block_y);
cudaFree(gpu_thread);
cudaFree(gpu_calc_thread);
cudaFree(gpu_xthread);
cudaFree(gpu_ythread);
cudaFree(gpu_grid_dimx);
cudaFree(gpu_block_dimx);
cudaFree(gpu_grid_dimy);
cudaFree(gpu_block_dimy);
}

The output is too large to list here. If you run the program in the downloadable source code section,
you’ll see you iterate through the threads and blocks as illustrated in Figure 5.12.

WARPS
We touched a little on warp scheduling when talking about threads. Warps are the basic unit of
execution on the GPU. The GPU is effectively a collection of SIMD vector processors. Each group of
threads, or warps, is executed together. This means, in the ideal case, only one fetch from memory for
the current instruction and a broadcast of that instruction to the entire set of SPs in the warp. This is
much more efficient than the CPU model, which fetches independent execution streams to support
task-level parallelism. In the CPU model, for every core you have running an independent task, you
can conceptually divide the memory bandwidth, and thus the effective instruction throughput, by the
number of cores. In practice, on CPUs, the multilevel, on-chip caches hide a lot of this providing the
program fits within the cache.
You find vector-type instructions on conventional CPUs, in the form of SSE, MMX, and AVX
instructions. These execute the same single instruction on multiple data operands. Thus, you can
say, for N values, increment all values by one. With SSE, you get 128-bit registers, so you can
operate on four parameters at any given time. AVX extends this to 256 bits. This is quite powerful,
but until recently, unless you were using the Intel compiler, there was little native support for this
type of optimization. AVX is now supported by the current GNU gcc compiler. Microsoft Visual
Studio 2010 supports it through the use of a “/arch:AVX” compiler switch. Given this lack of
support until relatively recently, vector-type instructions are not as widely used as they could be,
although this is likely to change significantly now that support is no longer restricted to the Intel
compiler.
With GPU programming, you have no choice: It’s vector architecture and expects you to write code
that runs on thousands of threads. You can actually write a single-thread GPU program with a simple
if statement checking if the thread ID is zero, but this will get you terrible performance compared with
the CPU. It can, however, be useful just to get an initial serial CPU implementation working. This

92

CHAPTER 5 Grids, Blocks, and Threads

approach allows you to check things, such as whether memory copying to/from the GPU is working
correctly, before introducing parallelism into the application.
Warps on the GPU are currently 32 elements, although nVidia reserves the right to change this in
the future. Therefore, they provide an intrinsic variable, warpSize, for you to use to obtain the warp
size on the current hardware. As with any magic number, you should not hard code an assumed warp
size of 32. Many SSE-optimized programs were hard coded to assume an SSE size of 128 bits. When
AVX was released, simply recompiling the code was not sufficient. Don’t make the same mistake and
hard code such details into your programs.
So why should you be interested in the size of a warp? The reasons are many, so we’ll look briefly
at each.

Branching
The first reason to be interested in the size of a warp is because of branching. Because a warp is a single
unit of execution, branching (e.g., if, else, for, while, do, switch, etc.) causes a divergence in the
flow of execution. On a CPU there is complex hardware to do branch prediction, predicting from past
execution which path a given piece of code will take. The instruction flow is then prefetched and
pumped into the CPU instruction pipeline ahead of time. Assuming the prediction is correct, the CPU
avoids a “stall event.” Stall events are very bad, as the CPU then has to undo any speculative instruction
execution, fetch instructions from the other branch path, and refill the pipeline.
The GPU is a much simpler device and has none of this complexity. It simply executes one path of
the branch and then the other. Those threads that take the branch are executed and those that do not are
marked as inactive. Once the taken branch is resolved, the other side of the branch is executed, until the
threads converge once more. Take the following code:
__global__ some_func(void)
{
if (some_condition)
{
action_a();
}
else
{
action_b();
}
}

As soon as you evaluate some_condition, you will have divergence in at least one block or there is
no point in having the test in the program. Let’s say all the even thread numbers take the true path and
all the odd threads take the false path. The warp scoreboard then looks as shown in Figure 5.13.
0
+

1
-

2
+

3
-

4
+

FIGURE 5.13
Predicate thread/branch selection.

5
-

6
+

7
-

8
+

9
-

10
+

11
-

12
+

13
-

14
+

15
-

16
+

Warps

93

For simplicity, I’ve drawn only 16 of the 32 threads, and you’ll see why in a minute. All those
threads marked þ take the true or positive path and all those marked  take the false or negative path.
As the hardware can only fetch a single instruction stream per warp, half of the threads stall and
half progress down one path. This is really bad news as you now have only 50% utilization of the
hardware. This is a bit like having a dual-core CPU and only using one core. Many lazy programmers
get away with it, but the performance is terrible compared to what it could be.
Now as it happens, there is a trick here that can avoid this issue. The actual scheduler in terms of
instruction execution is half-warp based, not warp based. This means if you can arrange the divergence
to fall on a half warp (16-thread) boundary, you can actually execute both sides of the branch
condition, the if-else construct in the example program. You can achieve 100% utilization of the
device in this way.
If you have two types of processing of the data, interleaving the data on a 16-word boundary can
result in quite good performance. The code would simply branch on the thread ID, as follows:
if ((thread_idx % 32) < 16)
{
action_a();
}
else
{
action_b();
}

The modulus operator in C (%) returns the remainder of the integer division of the operand. In
effect, you count from 0 to 31 and then loop back to 0 again. Ideally, the function action_a() has each
of its 16 threads access a single float or integer value. This causes a single 64-byte memory fetch. The
following half warp does the same and thus you issue a single 128-byte memory fetch, which it just so
happens is the size of the cache line and therefore the optimal memory fetch size for a warp.

GPU utilization
So why else might you be interested in warps? To avoid underutilizing the GPU. The CUDA model
uses huge numbers of threads to hide memory latency (the time it takes for a memory request to
come back). Typically, latency to the global memory (DRAM) is around 400–600 cycles. During
this time the GPU is busy doing other tasks, rather than idly waiting for the memory fetch to
complete.
When you allocate a kernel to a GPU, the maximum number of threads you can put onto an SM is
currently 768 to 2048, depending on the compute level. This is implementation dependent, so it may
change with future hardware revisions. Take a quick look at utilization with different numbers of
threads in Table 5.1.
Compute 1.0 and 1.2 devices are the G80/G92 series devices. Compute 1.3 devices are the GT200
series. Compute 2.0/2.1 devices are the Fermi range. Compute 3.0 is Kepler.
Notice that the only consistent value that gets you 100% utilization across all levels of the hardware
is 256 threads. Thus, for maximum compatibility, you should aim for either 192 or 256 threads. The
dataset should, however, match the thread layout to achieve certain optimizations. You should,
therefore, also consider the 192-thread layout where you have a three-point data layout.

94

CHAPTER 5 Grids, Blocks, and Threads

Table 5.1 Utilization %
Threads per Block/
Compute Capability

1.0

1.1

1.2

1.3

2.0

2.1

3.0

64
96
128
192
256
384
512
768
1024

67
100
100
100
100
100
67
N/A
N/A

67
100
100
100
100
100
67
N/A
N/A

50
75
100
94
100
75
100
N/A
N/A

50
75
100
94
100
75
100
N/A
N/A

33
50
67
100
100
100
100
100
67

33
50
67
100
100
100
100
100
67

50
75
100
94
100
94
100
75
100

Another alternative to having a fixed number of threads is to simply look up the compute level from
the device and select a the smallest number of threads, that gives the highest device utilization.
Now you might want to also consider the number of blocks that can be scheduled into a given SM.
This really only makes a difference when you have synchronization points in the kernel. These
are points where every thread must wait on every other thread to reach the same point, for example,
when you’re doing a staged read and all threads must do the read. Due to the nature of the execution,
some warps may make good progress and some may make poor progress to the
synchronization point.
The time, or latency, to execute a given block is undefined. This is not good from a load balancing
point of view. You want lots of threads available to be run. With 256 threads, 32 threads per warp give
you 8 warps on compute 2.x hardware. You can schedule up to 24 warps (32  24 ¼ 768 threads) at
any one time into a given SM for compute 1.x devices and 48 (32  48 ¼ 1536 threads) for compute
2.x devices. A block cannot be retired from an SM until it’s completed its entire execution. With
compute 2.0x devices or higher that support 1024 threads per block, you can be waiting for that single
warp to complete while all other warps are idle, effectively making the SM also idle.
Thus, the larger the thread block, the more potential you have to wait for a slow warp to catch up,
because the GPU can’t continue until all threads have passed the checkpoint. Therefore, you might
have chosen a smaller number of threads, say 128 threads in the past, to reduce this potential waiting
time. However, this hurts the performance on Fermi-level hardware as the device utilization drops to
two-thirds. As you can see from Table 5.1, on compute 2.0 devices (Fermi), you need to have at least
192 threads per block to make good use of the SM.
However, you should not get too tied up concerning the number of warps, as they are really just
a measure of the overall number of threads present on the SMs. Table 5.3 shows the total number of
threads running, and it’s this total number that is really the interesting part, along with the percentage
utilization shown in Table 5.1.
Notice with 128 or less threads per block, as you move from the compute 1.3 hardware (the GT200
series) to the compute 2.x hardware (Fermi), you see no difference in the total number of threads
running. This is because there are limits to the number of blocks an SM can schedule. The number of

95

Block Scheduling

Table 5.2 Blocks per SM
Threads per Block/
Compute Capability

1.0

1.1

1.2

1.3

2.0

2.1

3.0

64
96
128
192
256
384
512
768
1024

8
8
6
4
3
2
1
N/A
N/A

8
8
6
4
3
2
1
N/A
N/A

8
8
8
5
4
2
2
1
1

8
8
8
5
4
2
2
1
1

8
8
8
8
6
4
3
2
1

8
8
8
8
6
4
3
2
1

16
12
16
10
8
5
4
2
2

Table 5.3 Total Threads per SM
Threads per Block/
Compute Capability

1.0

1.1

1.2

1.3

2.0

2.1

3.0

64
96
128
192
256
384
512
768
1024

512
768
768
768
768
768
512
N/A
N/A

512
768
768
768
768
768
512
N/A
N/A

512
768
1024
960
1024
768
1024
N/A
N/A

512
768
1024
960
1024
768
1024
N/A
N/A

512
768
1024
1536
1536
1536
1536
1536
1024

512
768
1024
1536
1536
1536
1536
1536
1024

1024
1536
2048
1920
2048
1920
2048
1536
2048

threads an SM could support was increased, but not the number of blocks. Thus, to achieve better
scaling you need to ensure you have at least 192 threads and preferably considerably more.

BLOCK SCHEDULING
Suppose you have 1024 blocks to schedule, and eight SMs to schedule these onto. With the Fermi
hardware, each SM can accept up to 8 blocks, but only if there is a low thread count per block. With
a reasonable thread count, you typically see 6 to 8 blocks per SM.
Now 1024 blocks divided between six SMs is 170 complete blocks each, plus 4 blocks left over.
We’ll look at the leftover blocks in a minute, because it causes an interesting problem.

96

CHAPTER 5 Grids, Blocks, and Threads

With the 1020 blocks that can be allocated to the SMs, how should they be allocated? The hardware
could allocate 6 blocks to the first SM, 6 to the second, and so on. Alternatively, it could distribute
1 block to each SM in turn, so SM 0 gets block 0, SM 1 gets block 1, SM 2 gets block 2, etc. NVIDIA
doesn’t specify what method it uses, but it’s fairly likely to be the latter to achieve a reasonable load
balance across the SMs.
If you have 19 blocks and four SMs, allocating blocks to an SM until it’s full is not a good idea. The
first three SMs would get 6 blocks each, and the last SM, a single block. The last SM would likely
finish quickly and sit idle waiting for the other SMs. The utilization of the available hardware is poor.
If you allocate blocks to alternate SMs on a rotating basis, each SM gets 4 blocks (4 SMs  4
blocks ¼ 16 total) and three SMs get an additional block each. Assuming each block takes the same
time to execute you have reduced the execution time by 17%, simply by balancing the blocks among
the SMs, rather than overloading some SMs while underloading others.
Now in practice you will usually have thousands or tens of thousands of blocks to get through in
a typical application. Having done the initial allocation of blocks to an SM, the block dispatcher is then
idle until one block finishes on any of the SMs. At this point the block is retired and the resources used
by that block become free. As all the blocks are the same size, any block in the list of waiting blocks
can be scheduled. The order of execution of blocks is deliberately undefined and there should be no
implicit assumption that blocks will execute in any order when programming a solution to a problem.
This can have serious problems if there is some associative operation being performed, such as
floating-point addition, which is not in practice associative. The order of execution of adds through
an array in floating-point math will affect the result. This is due to the rounding errors and the way in
which floating-point math works. The result is correct in all cases. It’s not a parallel execution
problem, but an ordering problem. You see exactly the same issue with single-thread CPU code. If
you add a set of random numbers from bottom to top, or top to bottom, in a floating-point array on
a CPU or GPU, you will get different answers. Perhaps worse still is that on a GPU, due to the
undefined block scheduling, multiple runs on the same data can result in different but correct
answers. There are methods to deal with this and it is something we cover later in the book. So for
now, just be aware that because the result is different than before, it doesn’t necessarily make the
result incorrect.
Coming back to the problem of having leftover blocks, you will have this scenario anytime the
number of blocks is not a multiple of the number of SMs. Typically you see CUDA devices ship with an
odd number of SMs, due to it being difficult to make large, complex processors. As the physical amount
of silicon used in creating a processor increases, the likelihood there is a failure in some section
increases considerably. NVIDIA, like many processor manufacturers, simply disables faulty SMs and
ships devices as lower-specification units. This increases yields and provides some economic value to
otherwise faulty devices. However, for the programmer, this means the total number of SMs is not
always even a multiple of two. The Fermi 480 series cards, and also the Tesla S2050/S2070/C2050/
C2070 series, have a 16 SM device with 1 SM disabled, thus making 15 SMs. This was resolved in the
580 series, but this problem is likely to be repeated as we see future GPU generations released.
Having a few leftover blocks is really only an issue if you have a very long kernel and need to wait
for each kernel to complete. You might see this, for example, in a finite time step simulation. If you had
16 blocks, assuming a Fermi 480 series card, 15 blocks would be allocated to each of the SMs. The
remaining block will be scheduled only after one of the other 15 blocks has completed. If each kernel
took 10 minutes to execute, it’s likely all the blocks would finish at approximately the same time. The

A Practical ExampledHistograms

97

GPU would then schedule one additional block and the complete kernel invocation would wait for an
additional 10 minutes for this single block to execute. At the same time, the other 14 available SMs
would be idle. The solution to this problem is to provide better granularity to break down the small
number of blocks into a much larger number.
In a server environment you may not have just 15 SMs, but actually multiple nodes each having
multiple GPUs. If their only task is this kernel, then they will likely sit idle toward the end of the kernel
invocation. In this instance it might prove better to redesign the kernel in some way to ensure the
number of blocks is an exact multiple of the number of SMs on each node.
From a load balancing perspective, this problem is clearly not good. As a consequence, in the later
CUDA runtime, you have support for overlapping kernels and running multiple, separate kernels on
the same CUDA device. Using this method, you can maintain the throughput if you have more than one
source of jobs to schedule onto the cluster of GPUs. As the CUDA devices start to idle, they instead
pick up another kernel from a stream of available kernels.

A PRACTICAL EXAMPLEdHISTOGRAMS
Histograms are commonly found in programming problems. They work by counting the distribution of
data over a number of “bins.” Where the data point contains a value that is associated with a given bin,
the value in that bin is incremented.
In the simplest example, you have 256 bins and data that range from 0 to 255. You iterate through
an array of bytes. If the value of the element in the array is 0, you increment bin 0. If the value of the
element is 10, you increment bin 10, etc.
The algorithm from a serial perspective is quite simple:
for (unsigned int i¼0; i< max; iþþ)
{
bin[array[i]]þþ;
}

Here you extract the value from the array, indexed by i. You then increment the appropriate bin
using the þþ operator.
The serial implementation suffers from a problem when you convert it to a parallel problem. If you
execute this with 256 threads, you get more than one thread simultaneously incrementing the value in
the same bin.
If you look at how the C language gets converted to an assembler, you see it can take a series of
assembler instructions to execute this code. These would break down into
1.
2.
3.
4.
5.

Read the value from the array into a register.
Work out the base address and offset to the correct bin element.
Fetch the existing bin value.
Increment the bin value by one.
Write the new bin value back to the bin in memory.

The problem is steps three, four, and five are not atomic. An atomic operation is one that cannot be
interrupted prior to completion. If you execute this pseudocode in a lockstep manner, as CUDA does

98

CHAPTER 5 Grids, Blocks, and Threads

with its thread model, you hit a problem. Two or more threads fetch the same value at step three. They
all increment it and write it back. The last thread to do the write wins. The value should have been
incremented N times, but it’s incremented only once. All threads read the same value to apply the
increment to, thus you lose N increments to the value.
The problem here is that you have a data dependency you do not see on the serial execution version.
Each increment of the bin value must complete before the read and increment by the next thread. You
have a shared resource between threads.
This is not an uncommon problem and CUDA provides a primitive for this called
atomicAdd(&value);

This operation guarantees the addition operation is serialized among all threads.
Having now solved this problem, you come to the real choice heredhow to structure the tasks you
have to cover into threads, blocks, and grids. There are two approaches: the task decomposition model
or the data decomposition model. Both generally need to be considered.
With the task decomposition model, you simply allocate one thread to every element in input array
and have it do an atomic add. This is the simplest solution to program, but has some major disadvantages. You must remember that this is actually a shared resource. If you have 256 bins and an array
of 1024 elements, assuming an equal distribution, you have 4 elements contending for each bin. With
large arrays (there is no point in processing small arrays with CUDA) this problem becomes the
dominant factor determining the total execution time.
If you assume an equal distribution of values in the histogram, which is often not the case,
the number of elements contending for any single bin is simply the array size in elements
divided by the number of bins. With a 512 MB array (524,288 elements) you would have
131,072 elements contending for each bin. In the worst case, all elements write to the same
bin, so you have, in effect, a serial program due to the serialization of the atomic memory
writes.
In either example, the execution time is limited by the hardware’s ability to handle this contention
and the read/write memory bandwidth.
Let’s see how this works in reality. Here is the GPU program to do this.
/* Each thread writes to one block of 256 elements of global memory and contends for
write access */
__global__ void myhistogram256Kernel_01(
const unsigned char const * d_hist_data,
unsigned int * const d_bin_data)
{
/* Work out our thread id */
const unsigned int idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
const unsigned int idy ¼ (blockIdx.y * blockDim.y) þ threadIdx.y;
const unsigned int tid ¼ idx þ idy * blockDim.x * gridDim.x;
/* Fetch the data value */

A Practical ExampledHistograms

99

const unsigned char value ¼ d_hist_data[tid];
atomicAdd(&(d_bin_data[value]),1);
}

With a GTX460 card, we measured 1025 MB/s with this approach. What is interesting is that it
does not scale with the number of elements in the array. You get a consistently poor performance,
regardless of the array size. Note that the GPU used for this test, a 1 GB GTX460, has a memory
bandwidth of 115 GB/s, so this shows just how terrible a performance you can achieve by implementing the naive solution.
This figure, although bad, simply tells you that you are limited by some factor and it’s your job as
a programmer to figure out which factor and eliminate it. The most likely factor affecting performance
in this type of program is memory bandwidth. You are fetching N values from the input array and
compressing those down to N writes to a small, 1 K (256 elements  4 bytes per integer counter)
memory section.
If you look at the memory reads first, you will see each thread reads one byte element of the array.
Reads are combined together (coalesced) at the half-warp level (16 threads). The minimum transfer
size is 32 bytes, so you’re wasting read memory bandwidth by about 50%, which is pretty poor. The
optimal memory fetch for a half warp is the maximum supported size, which is 128 bytes. For this,
each thread has to fetch 4 bytes of memory. You can do this by having each thread process four
histogram entries instead of one.
We can issue a 4-byte read, by reading a single integer, and then extracting the component parts of
that integer as shown in Figure 5.14. This should provide better read coalescing and therefore better
performance. The modified kernel is as follows:
/* Each read is 4 bytes, not one, 32 x 4 ¼ 128 byte reads */
__global__ void myhistogram256Kernel_02(
const unsigned int const * d_hist_data,
unsigned int * const d_bin_data)
{
/* Work out our thread id */
const unsigned int idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;

Word 0 (4 x 8 bytes)

Byte 0

FIGURE 5.14
Word-to-byte mapping.

Byte 1

Byte 2

Byte 3

100

CHAPTER 5 Grids, Blocks, and Threads

const unsigned int idy ¼ (blockIdx.y * blockDim.y) þ threadIdx.y;
const unsigned int tid ¼ idx þ idy * blockDim.x * gridDim.x;
/* Fetch the data value as 32 bit */
const unsigned int value_u32 ¼ d_hist_data[tid];
atomicAdd(&(d_bin_data[ ((value_u32 & 0x000000FF) ) ]),1);
atomicAdd(&(d_bin_data[ ((value_u32 & 0x0000FF00) >> 8 ) ]),1);
atomicAdd(&(d_bin_data[ ((value_u32 & 0x00FF0000) >> 16 ) ]),1);
atomicAdd(&(d_bin_data[ ((value_u32 & 0xFF000000) >> 24 ) ]),1);
}

When running the kernel we notice we have achieved for all our effort zero speedup. This is, in fact,
quite common when trying to optimize programs. It’s a pretty strong indicator you did not understand
the cause of the bottleneck.
One issue to note here is that in compute 2.x, hardware does not suffer with only being able to coalesce
data from a half warp and can do full-warp coalescing. Thus, on the test device, a GTX460 (compute 2.1
hardware), the 32 single byte fetches issued by a single warp were coalesced into a 32-byte read.
The obvious candidate is the atomic write operation, rather than the usual memory bandwidth
culprit. For this you need to look at the alternative approach given by the data decomposition
model. Here you look at the data flow side of the equation, looking for data reuse and optimizing
the data size into that which works effectively with shared resources, such as a cache or shared
memory.
You can see that the contention for the 256 bins is a problem. With multiple blocks writing to
memory from multiple SMs, the hardware needs to sync the value of the bin array across the
caches in all processors. To do this it needs to fetch the current value from memory, increment it,
and then write it back. There is some potential for this to be held permanently in the L2 cache,
which is shared between the SMs in the Fermi generation of hardware. With compute 1.x hardware, you are reading and writing to the global memory, so this approach is an order of magnitude
slower.
Even if you can use the L2 cache on the Fermi hardware, you are still having to go out of the SM to
sync with all the other SMs. On top of this the write pattern you are generating is a scattered pattern,
dependent very much on the nature of the input data for the histogram. This means no or very little
coalescing, which again badly hurts performance.
An alternative approach is to build the histogram within each SM and then write out the histogram
to the main memory at the end. This is the approach you must always try to achieve, whether for CPU
or GPU programming. The more you make use of resources close to the processor (SM in this case),
the faster the program runs.
We mentioned earlier that we can use shared memory, a special form of memory that is on chip and
thus very fast. You can create a 256-bin histogram in the shared memory and then do the atomic add at
the end to the global memory. Assuming you process only one histogram per block, you do not

A Practical ExampledHistograms

101

decrease the number of global memory reads or writes, but you do coalesce all the writes to memory.
The kernel for this approach is as follows:
__shared__ unsigned int d_bin_data_shared[256];
/* Each read is 4 bytes, not one, 32 x 4 ¼ 128 byte reads */
__global__ void myhistogram256Kernel_03(
const unsigned int const * d_hist_data,
unsigned int * const d_bin_data)
{
/* Work out our thread id */
const unsigned int idx ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
const unsigned int idy ¼ (blockIdx.y * blockDim.y) þ threadIdx.y;
const unsigned int tid ¼ idx þ idy * blockDim.x * gridDim.x;
/* Clear shared memory */
d_bin_data_shared[threadIdx.x] ¼ 0;
/* Fetch the data value as 32 bit */
const unsigned int value_u32 ¼ d_hist_data[tid];
/* Wait for all threads to update shared memory */
__syncthreads();
atomicAdd(&(d_bin_data_shared[
atomicAdd(&(d_bin_data_shared[
atomicAdd(&(d_bin_data_shared[
atomicAdd(&(d_bin_data_shared[

((value_u32
((value_u32
((value_u32
((value_u32

&
&
&
&

0x000000FF)
0x0000FF00)
0x00FF0000)
0xFF000000)

) ]),1);
>> 8 ) ]),1);
>> 16 ) ]),1);
>> 24 ) ]),1);

/* Wait for all threads to update shared memory */
__syncthreads();
/* The write the accumulated data back to global memory in blocks, not scattered */
atomicAdd(&(d_bin_data[threadIdx.x]), d_bin_data_shared[threadIdx.x]);
}

The kernel must do an additional clear operation on the shared memory, as you otherwise have
random data left there from other kernels. Notice also you need to wait (__syncthreads) until all the
threads in a block have managed to clear their memory cell in the shared memory before you start
allowing threads to update any of the shared memory cells. You need to do the same sync operation at
the end, to ensure every thread has completed before you write the result back to the global memory.
You should see that, suddenly, you get a huge six times jump in performance, simply by virtue of
arranging the writes in order so they can be coalesced. You can now achieve 6800 MB/s processing
speed. Note, however, you can only do this with compute 1.2 or higher devices as only these support
shared memory atomic operations.
Now that you have the ordering correct, you need to look at reducing the global memory traffic. You
have to read every value from the source data, and you only read each value once. You are already using the

102

CHAPTER 5 Grids, Blocks, and Threads

optimal transfer size for read accesses, so let’s look at the data being written. If you process N histograms
per block instead of one histogram per block you reduce the write bandwidth by a factor of N.
Table 5.4 shows the value achieved on the 512 MB histogram based on processing different values
of N with a Fermi 460 card (which contains seven SMs). You can see a peak of 7886 MB/s at an N value
of 64. The kernel is as follows:
/* Each read is 4 bytes, not one, 32 x 4 ¼ 128 byte reads */
/* Accumulate into shared memory N times */
__global__ void myhistogram256Kernel_07(const unsigned int const * d_hist_data,
unsigned int * const d_bin_data,
unsigned int N)
{
/* Work out our thread id */
const unsigned int idx ¼ (blockIdx.x * (blockDim.x*N) ) þ threadIdx.x;
const unsigned int idy ¼ (blockIdx.y * blockDim.y ) þ threadIdx.y;
const unsigned int tid ¼ idx þ idy * (blockDim.x*N) * (gridDim.x);
/* Clear shared memory */
d_bin_data_shared[threadIdx.x] ¼ 0;
/* Wait for all threads to update shared memory */
__syncthreads();
for (unsigned int i¼0, tid_offset¼0; i< N; iþþ, tid_offsetþ¼256)
{
const unsigned int value_u32 ¼ d_hist_data[tidþtid_offset];
atomicAdd(&(d_bin_data_shared[ ((value_u32 & 0x000000FF)
atomicAdd(&(d_bin_data_shared[ ((value_u32 & 0x0000FF00)
atomicAdd(&(d_bin_data_shared[ ((value_u32 & 0x00FF0000)
atomicAdd(&(d_bin_data_shared[ ((value_u32 & 0xFF000000)
}
/* Wait for all threads to update shared memory */
__syncthreads();

) ]),1);
>> 8 ) ]),1);
>> 16 ) ]),1);
>> 24 ) ]),1);

/* The write the accumulated data back to global memory in blocks, not scattered */
atomicAdd(&(d_bin_data[threadIdx.x]), d_bin_data_shared[threadIdx.x]);
}

Let’s examine this a little, because it’s important to understand what you are doing here. You
have a loop i that runs for N iterations. This is the number of times you will process 256 bytes of
data into the shared memory histogram. There are 256 threads invoked for the kernel, one for each
bin. As such, the only loop you need is a loop over the number of histograms to process. When
you’ve done one iteration, you move 256 bytes on in memory to process the next histogram
(tid_offset þ¼ 256).
Notice also that as you’re using atomic operations throughout, you need sync points only at the start
and end of the kernel. Adding unnecessary synchronization points typically slows down the program,
but can lead to a more uniform access pattern in memory.

Conclusion

103

Table 5.4 Histogram Results
Factor

MB/s

Total Blocks

Whole Blocks per SM

Remainder Blocks

1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16,384
32,768
65,536

6766
7304
7614
7769
7835
7870
7886
7884
7868
7809
7737
7621
7093
6485
6435
5152
2756

524,288
262,144
131,072
65,536
32,768
16,384
8192
4096
2048
1024
512
256
128
64
32
16
8

74,898
37,449
18,724
9362
4681
2340
1170
585
292
146
73
36
18
9
4
2
1

3
1
6
3
1
6
3
1
6
3
1
6
3
1
6
3
1

Now what is interesting here is that, after you start to process 32 or more histograms per block, you
see no effective increase in throughput. The global memory bandwidth is dropping by a factor of two
every time you increase that value of N. If global memory bandwidth is indeed the problem, you should
see a linear speed up here for every factor of N you add. So what is going on?
The main problem is the atomic operations. Every thread must content for access to the shared data
area, along with other threads. The data pattern has a huge influence on the execution time, which is
not a good design.
We’ll return to this issue later when we look at how you can write such algorithms without having
to use atomic operations.

CONCLUSION
We covered a lot in this chapter and you should now be familiar with how CUDA breaks tasks into
grids, blocks, and threads. We covered the scheduling of blocks and warps on the hardware and the
need to ensure you always have enough threads on the hardware.
The threading model used in CUDA is fundamental to understanding how to program GPUs
efficiently. You should understand how CPUs and GPUs are fundamentally different beasts to program,
but at the same time how they are related to one another.
You have seen how arrangement of threads relative to the data you are going to process is important
and impacts performance. You have also seen, in particular with applications that need to share data, it

104

CHAPTER 5 Grids, Blocks, and Threads

is not always an easy task to parallelize a particular problem. You should note that often taking time to
consider the correct approach is somewhat more important than diving in with the first solution that
seems to fit.
We also covered the use of atomics and some of the problems of serialization these cause. We
touched on the problems branching can cause and you should have in the back of your mind the need to
ensure all threads follow the same control path. We look at atomics and branching in more detail later
in the book.
You have had some exposure to the extended C syntax used within CUDA and should feel
comfortable in writing a CUDA program with a clear understanding of what will happen.
By reading this chapter you have gained a great deal of knowledge and hopefully should no longer
feel that CUDA or parallel programming is a bit like a black art.

Questions
1. Identify the best and worst data pattern for the histogram algorithm developed in this chapter. Is
there a common usage case that is problematic? How might you overcome this?
2. Without running the algorithm, what do you think is the likely impact of running this code on older
hardware based on the G80 design?
3. When processing an array in memory on a CPU, is it best to transverse in row-column order or
column-row order? Does this change when you move the code to a GPU?
4. Consider a section of code that uses four blocks of 256 threads and the same code that uses one block
of 1024 threads. Which is likely to complete first and why? Each block uses four syncthreads()
calls at various points through the code. The blocks require no interblock cooperation.
5. What are the advantages and disadvantages of an SIMD-based implementation that we find in
GPUs versus the MIMD implementation we find in multicore CPUs?

Answers
1. The best case is uniform distribution of data. This is because this loads the buckets equally and you
therefore get an equal distribution of atomic operations on the available shared memory banks.
The worst case is identical data values. This causes all threads to continuously hit the same shared
memory bucket, causing serialization of the entire program through both the atomic operations
and bank conflicts in the shared memory.
Unfortunately, one very common usage is with sorted data. This provides a variation on the worst-case
usage. Here one bank after another gets continuously hit with atomic writes, effectively serializing
the problem.
One solution is to step through the dataset such that each iteration writes to a new bucket. This requires
knowledge of the data distribution. For example, consider the case of 256 data points modeling
a linear function using 32 buckets. Let’s assume data points 0 to 31 fall into the first bucket and
this is replicated for every bucket. By processing one value for each bucket, you can distribute
writes to the buckets and avoid contention. In this example, you would read data points 0, 32,
64, 96, 1, 33, 65, 97, 2, 34, 66, 98, etc.
2. The G80 devices (compute 1.0, compute 1.1) don’t support shared memory atomics, so the code
will not compile. Assuming you modified it to use global memory atomics, we saw a seven-fold
decrease in performance in the example provided earlier in the chapter.

Conclusion

105

3. The row-column ordering is best because the CPU will likely use a prefetch technique, ensuring the
subsequent data to be accessed will be in the cache. At the very least, an entire cache line will be
fetched from memory. Thus, when the CPU comes to the second iteration of the row-based access,
a[0] will have fetched a[1] into the cache.
The column transversal will result in much slower code because the fetch of a single cache line on the
CPU is unlikely to fetch data used in the subsequent loop iteration unless the row size is very small.
On the GPU each thread fetches one or more elements of the row, so the loop transversal, at a high
level, is usually by column, with an entire row being made up of individual threads. As with the
CPU the entire cache line will be fetched on compute 2.x hardware. However, unlike the CPU,
this cache line will likely be immediately consumed by the multiple threads.
4. During a syncthreads() operation, the entire block stalls until every one of the threads meets the
syncthreads() checkpoint. At this point they all become available for scheduling again. Having
a very large number of threads per block can mean the SM runs out of other available warps to
schedule while waiting for the threads in a single block to meet the checkpoint. The execution
flow as to which thread gets to execute when is undefined. This means some threads can make
much better progress than others to the syncthreads() checkpoint. This is the result of a design
decision in favor of throughput over latency at the hardware level. A very high thread count per
block is generally only useful where the threads in the block need to communicate with one
another, without having to do interblock communication via the global memory.
5. The SIMD model amortizes the instruction fetch time over many execution units where the
instruction stream is identical. However, where the instruction stream diverges, execution must
be serialized. The MIMD model is designed for divergent execution flow and doesn’t need to
stall threads when the flow diverges. However, the multiple fetch and decoding units require
more silicon and higher instruction bandwidth requirements to maintain multiple independent
execution paths.
A mixture of SIMD and MIMD is often the best way of dealing with both control flow and identical
operations of large datasets. You see this in CPUs in terms of SSE/MMX/AVX support. You see this
in GPUs in terms of warps and blocks allowing for divergence at a higher granularity.

This page intentionally left blank

CHAPTER

Memory Handling with CUDA

6

INTRODUCTION
In the conventional CPU model we have what is called a linear or flat memory model. This is where any
single CPU core can access any memory location without restriction. In practice, for CPU hardware, you
typically see a level one (L1), level two (L2), and level three (L3) cache. Those people who have
optimized CPU code or come from a high-performance computing (HPC) background will be all too
familiar with this. For most programmers, however, it’s something they can easily abstract away.
Abstraction has been a trend in modern programming language, where the programmer is further
and further removed from the underlying hardware. While this can lead to higher levels of productivity,
as problems can be specified at a very high level, it relies hugely on clever compilers to implement
these abstractions into a level understood by the hardware. While this is great in theory, the reality can
be somewhat less than the marketing dream. I’m sure in the decades to come we’ll see huge
improvements in compilers and languages such that they will take advantage of parallel hardware
automatically. However, until this point, and certainly until we get there, the need to understand how
the hardware functions will be key to extracting the best performance from any platform.
For real performance on a CPU-based system, you need to understand how the cache works. We’ll
look at this on the CPU side and then look at the similarities with the GPU. The idea of a cache is that
most programs execute in a serial fashion, with various looping constructs, in terms of their execution
flow. If the program calls a function, the chances are the program will call it again soon. If the program
accesses a particular memory location, the chances are most programs will access that same location
again within a short time period. This is the principle of temporal locality, that it is highly likely that
you will reuse data and reexecute the same code having used/executed it once already.
Fetching data from DRAM, the main memory of a computer system is very slow. DRAM has
historically always been very slow compared to processor clock speeds. As processor clock speeds
have increased, DRAM speeds have fallen further and further behind.
DDR-3 DRAM in today’s processors runs up to 1.6 Ghz as standard, although this can be pushed to
up to 2.6 Ghz with certain high speed modules and the correct processor. However, each of the CPU
cores is typically running at around 3 GHz. Without a cache to provide quick access to areas of memory,
the bandwidth of the DRAM will be insufficient for the CPU. As both code and data exist in the DRAM
space, the CPU is effectively instruction throughput limited (how many instructions it executes in
a given timeframe) if it cannot fetch either the program or data from the DRAM fast enough.
This is the concept of memory bandwidth, the amount of data we can read or store to DRAM in
a given period of time. However, there is another important concept, latency. Latency is the amount of
time it takes to respond to a fetch request. This can be hundreds of processor cycles. If the program
CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00006-5
Copyright Ó 2013 Elsevier Inc. All rights reserved.

107

108

CHAPTER 6 Memory Handling with CUDA

wants four elements from memory it makes sense therefore to issue all requests together and then wait
for them to arrive, rather than issue one request, wait until it arrives, issue the next request, wait, and so
on. Without a cache, a processor would be very much memory bandwidth and latency limited.
To think of bandwidth and latency in everyday terms, imagine a supermarket checkout process.
There are N checkouts available in a given store, not all of which may be staffed. With only two
checkouts active (staffed), a big queue will form behind them as the customers back up, having to wait to
pay for their shopping. The throughput or bandwidth is the number of customers processed in a given
time period (e.g., one minute). The time the customer has to wait in the queue is a measure of the latency,
that is, how long after joining the queue did the customer wait to pay for his or her shopping and leave.
As the queue becomes large, the shop owner may open more checkout points and the queue
disperses between the new checkout points and the old ones. With two new checkout points opened,
the bandwidth of the checkout area is doubled, because now twice as many people can be served in the
same time period. The latency is also halved, because, on average, the queue is only half as big and
everyone therefore waits only half the time.
However, this does not come for free. It costs money to employ more checkout assistants and more
of the retail space has to be allocated to checkout points rather than shelf space for products. The same
tradeoff occurs in processor design, in terms of the memory bus width and the clock rate of the memory
devices. There is only so much silicon space on the device and often the width of the external memory
bus is limited by the number of physical pins on the processor.
One other concept we also need to think about is transaction overhead. There is a certain overhead
in processing the payment for every customer. Some may have two or three items in a basket while
others may have overflowing shopping carts. The shop owners love the shopping cart shoppers because
they can be processed efficiently, that is, more of the checkout person’s time is spent checking out
groceries, rather than in the overhead of processing the payment.
We see the same in GPUs. Some memory transactions are lightweight compared to the fixed
overhead to process them. The number of memory cells fetched relative to the overhead time is low, or,
in other words, the percentage of peak efficiency is poor. Others are large and take a bunch of time to
serve, but can be serviced efficiently and achieve near peak memory transfer rates. These translate to
byte-based memory transactions at one end of the spectrum and to long word-based transactions at the
other end. To achieve peak memory efficiency, we need lots of large transactions and very few, if any,
small ones.

CACHES
A cache is a high-speed memory bank that is physically close to the processor core. Caches are
expensive in terms of silicon real estate, which in turn translates into bigger chips, lower yields, and
more expensive processors. Thus, the Intel Xeon chips with the huge L3 caches found in a lot of server
machines are far more expensive to manufacture than the desktop version that has less cache on the
processor die.
The maximum speed of a cache is proportional to the size of the cache. The L1 cache is the fastest,
but is limited in size to usually around 16 K, 32 K, or 64 K. It is usually allocated to a single CPU core.
The L2 cache is slower, but much larger, typically 256 K to 512 K. The L3 cache may or may not be
present and is often several megabytes in size. The L2 and/or L3 cache may be shared between

Caches

109

processor cores or maintained as separate caches linked directly to given processor cores. Generally, at
least the L3 cache is a shared cache between processor cores on a conventional CPU. This allows for
fast intercore communication via this shared memory within the device.
The G80 and GT200 series GPUs have no equivalent CPU-like cache to speak of. They do,
however, have a hardware-managed cache that behaves largely like a read-only CPU cache in terms of
constant and texture memory. The GPU relies instead primarily on a programmer-managed cache, or
shared memory section.
The Fermi GPU implementation was the first to introduce the concept of a nonprogrammermanaged data cache. The architecture additionally has, per SM, an L1 cache that is both programmer
managed and hardware managed. It also has a shared L2 cache across all SMs.
So does it matter if the cache is shared across processor cores or SMs? Why is this arrangement
relevant? This has an interesting implication for communicating with other devices using the same
shared cache. It allows interprocessor communication, without having to go all the way out to global
memory. This is particularly useful for atomic operations where, because the L2 cache is unified, all
SMs see a consistent version of the value at a given memory location. The processor does not have to
write to the slow global memory, to read it back again, just to ensure consistency between processor

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

Independent L1 Cache (16-48K)

Independent L1 Cache (16-48K)

Shared L2 Cache (256K)

FIGURE 6.1
SM L1/L2 data path.

Independent L1 Cache (16-48K)

110

CHAPTER 6 Memory Handling with CUDA

cores. On G80/GT200 series hardware, where there is no unified cache, we see exactly this deficiency
and consequently quite slow atomic operations compared with Fermi and later hardware.
Caches are useful for most programs. Significant numbers of programmers either care little for or
have a limited understanding of how to achieve good performance in software. Introducing a cache
means most programs work reasonably well and the programmer does not have to care too much about
how the hardware works. This ease of programming is useful for initial development, but in most cases
you can do somewhat better.
The difference between a novice CUDA programmer and someone who is an expert can be up to an
order of magnitude. I hope that through reading this book, you’ll be able to get several times speedup
from your existing code and move toward being routinely able to write CUDA code, which significantly outperforms the equivalent serial code.

Types of data storage
On a GPU, we have a number of levels of areas where you can place data, each defined by its potential
bandwidth and latency, as shown in Table 6.1.
At the highest and most preferred level are registers inside the device. Then we have shared
memory, effectively a programmer-managed L1 cache, constant memory, texture memory, regular
device memory, and finally host memory. Notice how the order of magnitude changes between the
slowest and fastest type of storage. We will now look at the usage of each of these in turn and how you
can maximize the gain from using each type.
Traditionally, most texts would start off by looking at global memory, as this often plays a key role
in performance. If you get the global memory pattern wrong then you can forget anything else until you
get the correct pattern. We take a different approach here, in that we look first at how to use the device
efficiently internally, and from there move out toward global and host memory. Thus, you will
understand efficiency at each level and have an idea of how to extract it.
Most CUDA programs are developed progressively, using global memory exclusively at least
initially. Once there is an initial implementation, then the use of other memory types such as zero copy
and shared, constant, and ultimately registers is considered. For an optimal program, you need to be
thinking about these issues while you are developing a program. Thus, instead of the faster memory
types being an afterthought, they are considered at the outset and you know exactly where and how to
improve the program. You should be continuously thinking about not only how to access global
memory efficiently, but also how those accesses, especially for data that is reused in some way, can be
eliminated.

Table 6.1 Access Time by Memory Type
Storage
Type

Registers

Shared
Memory

Texture
Memory

Constant
Memory

Global
Memory

Bandwidth
Latency

~8 TB/s
1 cycle

~1.5 TB/s
1 to 32 cycles

~200 MB/s
~400 to 600

~200 MB/s
~400 to 600

~200 MB/s
~400 to 600

Register Usage

111

REGISTER USAGE
The GPU, unlike its CPU cousin, has thousands of registers per SM (streaming multiprocessor). An
SM can be thought of like a multithreaded CPU core. On a typical CPU we have two, four, six, or eight
cores. On a GPU we have N SM cores. On a Fermi GF100 series, there are 16 SMs on the top-end
device. The GT200 series has up to 32 SMs per device. The G80 series has up to 16 SMs per device.
It may seem strange that Fermi has less SMs than its predecessors. This is until you realize that
each Fermi SM contains more SPs (streaming processors) and that it is these that do the “grunt” work.
Due to the different number of SPs per core, you see a major difference in the number of threads per
core. A typical CPU will support one or two hardware threads per core. A GPU by contrast has
between 8 and 192 SPs per core, meaning each SM can at any time be executing this number of
concurrent hardware threads.
In practice on GPUs, application threads are pipelined, context switched, and dispatched to
multiple SMs, meaning the number of active threads across all SMs in a GPU device is usually in the
tens of thousands range.
One major difference we see between CPU and GPU architectures is how CPUs and GPUs map
registers. The CPU runs lots of threads by using register renaming and the stack. To run a new task the
CPU needs to do a context switch, which involves storing the state of all registers onto the stack (the
system memory) and then restoring the state from the last run of the new thread. This can take several
hundred CPU cycles. If you load too many threads onto a CPU it will spend all of the time simply
swapping out and in registers as it context switches. The effective throughput of useful work rapidly
drops off as soon as you load too many threads onto a CPU.
The GPU by contrast is the exact opposite. It uses threads to hide memory fetch and instruction
execution latency, so too few threads on the GPU means the GPU will become idle, usually waiting on
memory transactions. The GPU also does not use register renaming, but instead dedicates real registers
to each and every thread. Thus, when a context switch is required, it has near zero overhead. All that
happens on a context switch is the selector (or pointer) to the current register set is updated to point to
the register set of the next warp that will execute.
Notice I used the concept of warps here, which was covered in detail in the Chapter 5 on threading.
A warp is simply a grouping of threads that are scheduled together. In the current hardware, this is 32
threads. Thus, we swap in or swap out, and schedule, groups of 32 threads within a single SM.
Each SM can schedule a number of blocks. Blocks at the SM level are simply logical groups of
independent warps. The number of registers per kernel thread is calculated at compile time. All blocks
are of the same size and have a known number of threads, and the register usage per block is known and
fixed. Consequently, the GPU can allocate a fixed set of registers for each block scheduled onto the
hardware.
At a thread level, this is transparent to the programmer. However, a kernel that requests too many
registers per thread can limit the number of blocks the GPU can schedule on an SM, and thus the total
number of threads that will be run. Too few threads and you start underutilizing the hardware and the
performance starts to rapidly drop off. Too many threads can mean you run short of resources and
whole blocks of threads are dropped from being scheduled to the SM.
Be careful of this effect, as it can cause sudden performance drops in the application. If previously
the application was using four blocks and now it uses more registers, causing only three blocks to be

112

CHAPTER 6 Memory Handling with CUDA

available, you may well see a one-quarter drop in GPU throughput. You can see this type of problem
with various profiling tools available, covered in Chapter 7 in the profiling section.
Depending on the particular hardware you are using, there is 8 K, 16 K, 32 K or 64 K of register
space per SM for all threads within an SM. You need to remember that one register is required per
thread. Thus, a simple local float variable in C results in N registers usage, where N is the number of
threads that are scheduled. With the Fermi-level hardware, you get 32 K of register space per SM. With
256 threads per block, you would have ((32,768/4 bytes per register)/256 threads) ¼ 32 registers per
thread available. To achieve the maximum number of registers available on Fermi, 64 (128 on G80/
GT200), you’d need to half the thread count to just 128 threads. You could have a single block per SM,
with the maximum permitted number of registers in that block. Equally, you could have eight blocks of
32 threads (8  32 ¼ 256 threads in total), each using the maximum number of registers.
If you can make use of the maximum number of registers, for example, using them to work on
a section of an array, then this approach can work quite well. It works because such a set of values is
usually N elements from a dataset. If each element is independent, you can create instruction-level
parallelism (ILP) within a single thread. This is exploited by the hardware in terms of pipelining many
independent instructions. You’ll see later an example of this working in practice.
However, for most kernels, the number of registers required is somewhat lower. If you drop your
register requirements from 128 to 64, you can schedule another block into the same SM. For example,
with 32 registers, you can schedule four blocks. In doing so, you are increasing the total thread count.
On Fermi, you can have up to 1536 threads per SM and, for the general case, the higher the level of
occupancy you can achieve, the faster your program will execute. You will reach a point where you
have enough thread-level parallelism (TLP) to hide the memory latency. To continue to increase
performance further, you either need to move to larger memory transactions or introduce ILP, that is,
process more than one element of the dataset within a single thread.
There is, however, a limit on the number of warps that can be scheduled to an SM. Thus, dropping
the number of registers from 32 to 16 does not get eight blocks. For that we are limited to 192 threads,
as shown in Table 6.2.
Table 6.2 refers to the Fermi architecture. For the Kepler architecture, simply double the number of
registers and blocks shown here. We’ve used 192 and 256 threads here as they provide good utilization
of the hardware. Notice that the kernel usage of 16 versus 20 registers does not introduce any additional blocks to the SM. This is due to the limit on the number of warps that can be allocated to an SM.
So in this case, you can easily increase register usage without impacting the total number of threads
that are running on a given SM.

Table 6.2 Register Availability by Thread Usage on Fermi
No. of Threads

Maximum Register Usage

192
Blocks Scheduled

16
8

20
8

No. of Threads
256
Blocks Scheduled

16
6

20
6

24
7

28
6

Maximum Register Usage
24
28
5
4

32
5

64
2

32
4

64
2

Register Usage

113

You want to use registers to avoid usage of the slower memory types, but you have to be careful that
you use them effectively. For example, suppose we had a loop that set each bit in turn, depending on the
value of some Boolean variable. Effectively, we’d be packing and unpacking 32 Booleans into 32 bits
of a word. We could write this as a loop, each time modifying the memory location by the new
Boolean, shifted to the correct position within the word, as shown in the following:
for (i¼0; i<31; iþþ)
{
packed_result j¼ (pack_array[i] << i);
}

Here we are reading array element i from an array of elements to pack into an integer,
We’re left shifting the Boolean by the necessary number of bits and then using
a bitwise or operation with the previous result.
If the parameter packed_result exists in memory, you’d be doing 32 memory read and writes. We
could equally place the parameter packed_result in a local variable, which in turn the compiler would
place into a register. As we accumulate into the register instead of in main memory, and later write only
the result to main memory, we save 31 of the 32 memory reads and writes.
Looking back at Table 6.1, you can see it takes several hundred cycles to do a global memory
operation. Let’s assume 500 cycles for one global memory read or write operation. For every value
you’d need to read, apply the or operation, and write the result back. Therefore, you’d have 32  read
þ 32  write ¼ 64  500 cycles ¼ 32,000 cycles. The register version would eliminate 31 read and 32
write operations, replacing the 500-cycle operations with single-cycle operations. Thus, you’d have
packed_result.

ð1  memory readÞ þ ð1  memory writeÞ þ ð31  register readÞ þ ð31  register writeÞ or
ð1  500Þ þ ð1  500Þ þ ð31  1Þ þ ð31  1Þ ¼ 1062 cycles versus 32; 000 cycles
Clearly, this is a huge reduction in the number of cycles. We have a 31 times improvement to perform
a relatively common operation in certain problem domains.
We see similar relationships with common reduction operations like sum, min, max, etc. A reduction
operation is where a dataset is reduced by some function to a smaller set, typically a single item. Thus,
max (10, 12, 1, 4, 5) would return a single value, 12, the maximum of the given dataset.
Accumulating into a register saves huge numbers of memory writes. In our bit packing example,
we reduce our memory writes by a factor of 31. Whether you are using a CPU or GPU, this type of
register optimization will make a huge difference in the speed of execution of your programs.
However, this burdens the programmer with having to think about which parameters are in registers
and which are in memory, which registers need to be copied back to memory, etc. This might seem like
quite a bit of trouble to go to, and for the average programmer, often it is. Therefore, we see
a proliferation of code that works directly on memory. For the most part, cache memory you find on
CPUs significantly masks this problem. The accumulated value is typically held in the L1 cache. If
a write-back policy is used on the cache, where the values do not need to be written out to main
memory until later, the performance is not too bad. Note that the L1 cache is still slower than registers,
so the solution will be suboptimal and may be several times slower than it could be.
Some compilers may detect such inefficiencies and implement a load into a register during the
optimizer phase. Others may not. Relying on the optimizer to fix poor programming puts you at the

114

CHAPTER 6 Memory Handling with CUDA

mercy of how good the compiler is, or is not. You may find that, as the optimization level is increased,
errors creep into the program. This may not be the fault of the compiler. The C language definition is
quite complex. As the optimization level is increased, subtle bugs may appear due to a missed volatile
qualifier or the like. Automatic test scripts and back-to-back testing against a nonoptimized version are
good solutions to ensure correctness.
You should also be aware that optimizing compiler vendors don’t always choose to implement the
best solution. If just 1% of programs fail when a certain optimization strategy is employed by the
compiler vendor, then it’s unlikely to be employed due to the support issues this may generate.
The GPU has a computation rate many times in excess of its memory bandwidth capacity. The
Fermi hardware has around 190 GB/s peak bandwidth to memory, with a peak compute performance of
over one teraflop. This is over five times the memory bandwidth. On the Kepler GTX680/Tesla K10 the
compute power increases to 3 Teraflops, yet with a memory bandwidth almost identical to the
GTX580. In the bit packing example, without register optimization and on a system with no cache, you
would require one read and one write per loop iteration. Each integer or floating-point value is 4 bytes
in length. The best possible performance we could, theoretically, achieve in this example, due to the
need to read and write a total of 8 bytes, would be one-eighth of the memory bandwidth. Using the 190
GB/s figure, this would equate to around 25 billion operations per second.
In practice you’d never get near this, because there are loop indexes and iterations to take into
account as well as simply the raw memory bandwidth. However, this sort of back-of-the-envelope
calculation provides you with some idea of the upper bounds of your application before you start
coding anything.
Applying our factor of 31 reductions to the number of memory operations allows you to achieve
a theoretical peak of 31 times this figure, some 775 billion iterations per second. We’ll in practice hit
other limits, within the device. However, you can see we’d easily achieve many times better performance than a simple global memory version by simply accumulating to or making use of registers
wherever possible.
To get some real figures here, we’ll write a program to do this bit packing on global memory and
then with registers. The results are as follows:
ID:0
ID:1
ID:2
ID:3

GeForce
GeForce
GeForce
GeForce

GTX 470:Reg.
9800 GT:Reg.
GTX 260:Reg.
GTX 460:Reg.

version
version
version
version

faster
faster
faster
faster

by:
by:
by:
by:

2.22ms (Reg¼0.26ms, GMEM¼2.48ms)
52.87ms (Reg¼9.27ms, GMEM¼62.14ms)
5.00ms (Reg¼0.62ms, GMEM¼5.63ms)
1.56ms (Reg¼0.34ms, GMEM¼1.90ms)

The two kernels to generate these are as follows:
__global__ void test_gpu_register(u32 * const data, const u32 num_elements)
{
const u32 tid ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
if (tid < num_elements)
{
u32 d_tmp [ 0;
for (int i¼0;i;
.reg .u64 %rd<9>;
.reg .pred %p<5>;
// __cuda_local_var_108903_15_non_const_tid ¼ 0
// __cuda_local_var_108906_13_non_const_d_tmp ¼ 4
// i ¼ 8
.loc 16 36 0
$LDWbegin__Z18test_gpu_register1Pjj:
$LDWbeginblock_180_1:
.loc 16 38 0
mov.u32 %r1, %tid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %ntid.x;
mul.lo.u32 %r4, %r2, %r3;
add.u32 %r5, %r1, %r4;
mov.s32 %r6, %r5;
.loc 16 39 0
ld.param.u32 %r7, [__cudaparm__Z18test_gpu_register1Pjj_num_elements];
mov.s32 %r8, %r6;
setp.le.u32 %p1, %r7, %r8;
@%p1 bra $L_0_3074;
$LDWbeginblock_180_3:
.loc 16 41 0
mov.u32 %r9, 0;
mov.s32 %r10, %r9;
$LDWbeginblock_180_5:
.loc 16 43 0
mov.s32 %r11, 0;
mov.s32 %r12, %r11;
mov.s32 %r13, %r12;
mov.u32 %r14, 31;
setp.gt.s32 %p2, %r13, %r14;
@%p2 bra $L_0_3586;
$L_0_3330:
.loc 16 45 0
mov.s32 %r15, %r12;
cvt.s64.s32 %rd1, %r15;
cvta.global.u64 %rd2, packed_array;
add.u64 %rd3, %rd1, %rd2;
ld.s8 %r16, [%rd3þ0];
mov.s32 %r17, %r12;
shl.b32 %r18, %r16, %r17;
mov.s32 %r19, %r10;
or.b32 %r20, %r18, %r19;

117

118

CHAPTER 6 Memory Handling with CUDA

mov.s32 %r10, %r20;
.loc 16 43 0
mov.s32 %r21, %r12;
add.s32 %r22, %r21, 1;
mov.s32 %r12, %r22;
$Lt_0_1794:
mov.s32 %r23, %r12;
mov.u32 %r24, 31;
setp.le.s32 %p3, %r23, %r24;
@%p3 bra $L_0_3330;
$L_0_3586:
$LDWendblock_180_5:
.loc 16 48 0
mov.s32 %r25, %r10;
ld.param.u64 %rd4, [__cudaparm__Z18test_gpu_register1Pjj_data];
cvt.u64.u32 %rd5, %r6;
mul.wide.u32 %rd6, %r6, 4;
add.u64 %rd7, %rd4, %rd6;
st.global.u32 [%rd7þ0], %r25;
$LDWendblock_180_3:
$L_0_3074:
$LDWendblock_180_1:
.loc 16 50 0
exit;
$LDWend__Z18test_gpu_register1Pjj:
}

Thus, the PTX code first tests if the for loop will actually enter the loop. This is done in the block
labeled $LDWbeginblock_180_5. The code at the $Lt_0_1794 label then performs the loop operation,
jumping back to label $L_0_3330 until such time as the loop has completed 32 iterations. The other
code in the section labeled $L_0_3330 performs the operation:
d_tmp j¼ (packed_array[i] << i);

Notice, in addition to the loop overhead, because packed_array is indexed by a variable the code
has to work out the address on every iteration of the loop:
cvt.s64.s32 %rd1, %r15;
cvta.global.u64 %rd2, packed_array;
add.u64 %rd3, %rd1, %rd2;

This is rather inefficient. Compare this to a loop unrolled version and we see something quite
interesting:
.entry _Z18test_gpu_register2Pjj (
.param .u64 __cudaparm__Z18test_gpu_register2Pjj_data,
.param .u32 __cudaparm__Z18test_gpu_register2Pjj_num_elements)
{
.reg .u32 %r<104>;
.reg .u64 %rd<6>;

Register Usage

119

.reg .pred %p<3>;
// __cuda_local_var_108919_15_non_const_tid ¼ 0
.loc 16 52 0
$LDWbegin__Z18test_gpu_register2Pjj:
$LDWbeginblock_181_1:
.loc 16 54 0
mov.u32 %r1, %tid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %ntid.x;
mul.lo.u32 %r4, %r2, %r3;
add.u32 %r5, %r1, %r4;
mov.s32 %r6, %r5;
.loc 16 55 0
ld.param.u32 %r7, [__cudaparm__Z18test_gpu_register2Pjj_num_elements];
mov.s32 %r8, %r6;
setp.le.u32 %p1, %r7, %r8;
@%p1 bra $L_1_1282;
.loc 16 57 0
ld.global.s8 %r9, [packed_arrayþ0];
ld.global.s8 %r10, [packed_arrayþ1];
shl.b32 %r11, %r10, 1;
or.b32 %r12, %r9, %r11;
ld.global.s8 %r13, [packed_arrayþ2];
shl.b32 %r14, %r13, 2;
or.b32 %r15, %r12, %r14;
[Repeated code for pack_arrayþ3 to packed_arrayþ29 removed for clarity]
ld.global.s8 %r97, [packed_arrayþ30];
shl.b32 %r98, %r97, 30;
or.b32 %r99, %r96, %r98;
ld.global.s8 %r100, [packed_arrayþ31];
shl.b32 %r101, %r100, 31;
or.b32 %r102, %r99, %r101;
ld.param.u64 %rd1, [__cudaparm__Z18test_gpu_register2Pjj_data];
cvt.u64.u32 %rd2, %r6;
mul.wide.u32 %rd3, %r6, 4;
add.u64 %rd4, %rd1, %rd3;
st.global.u32 [%rd4þ0], %r102;
$L_1_1282:
$LDWendblock_181_1:
.loc 16 90 0
exit;
$LDWend__Z18test_gpu_register2Pjj:
}

Almost all the instructions now contribute to the result. The loop overhead is gone. The address
calculation for packed_array is reduced to a compile time–resolved base plus offset type address.

120

CHAPTER 6 Memory Handling with CUDA

Everything is much simpler, but much longer, both in the C code and also in the virtual PTX
assembly code.
The point here is not to understand PTX, but to see the vast difference small changes in C code can
have on the virtual assembly generated. It’s to understand that techniques like loop unrolling can be
hugely beneficial in many cases. We look at PTX and how it gets translated in the actual code that gets
executed in more detail in Chapter 9 on optimization.
So what does this do in terms of speedup? See Table 6.5. You can see that on the 9800GT or the
GTX260, there was no effect at all. However, on the more modern compute 2.x hardware, the GTX460
and GTX470, you see a 2.4 and 3.4 speedup, respectively. If you look back to the pure GMEM
implementation, on the GTX470 this is a 6.4 speedup. To put this in perspective, if the original
program took six and a half hours to run, then the optimized version would take just one hour.
Register optimization can have a huge impact on your code execution timing. Take the time to look
at the PTX code being generated for the inner loops of your program. Can you unroll the loop to
expand it into a single, or set, of expressions? Think about this with your code and you’ll see a huge
performance leap. It is better to register usage, such as eliminating memory accesses, or provide
additional ILP as one of the best ways to speed up a GPU kernel.
Table 6.5 Effects of Loop Unrolling
Card

Register Version

Unrolled Version

Speedup

GTX470
9800GT
GTX260
GTX460
Average

0.27
9.28
0.62
0.34

0.08
9.27
0.62
0.14

3.4
1
1
2.4
2

SHARED MEMORY
Shared memory is effectively a user-controlled L1 cache. The L1 cache and shared memory share a 64 K
memory segment per SM. In Kepler this can be configured in 16 K blocks in favor of the L1 or shared
memory as you prefer for your application. In Fermi the choice is 16 K or 48K in favor of the L1 or shared
memory. Pre-Fermi hardware (compute 1.) has a fixed 16 K of shared memory and no L1 cache. The
shared memory has in the order of 1.5 TB/s bandwidth with extremely low latency. Clearly, this is hugely
superior to the up to 190 GB/s available from global memory, but around one-fifth of the speed of registers.
In practice, global memory speeds on low-end cards are as little as one-tenth that of the high-end cards.
However, the shared memory speed is driven by the core clock rate, which remains much more consistent
(around a 20% variation) across the entire range of GPUs. This means that to get the most from any card,
not just the high-end cards, you must use shared memory effectively in addition to using registers.
In fact, just by looking at the bandwidth figuresd1.5 TB/s for shared memory and 190 GB/s for
the best global memory accessdyou can see that there is a 7:1 ratio. To put it another way, there is
potential for a 7 speedup if you can make effective use of shared memory. Clearly, shared memory is
a concept that every CUDA programmer who cares about performance needs to understand well.

Shared Memory

121

However, the GPU operates a load-store model of memory, in that any operand must be loaded into
a register prior to any operation. Thus, the loading of a value into shared memory, as opposed to just
loading it into a register, must be justified by data reuse, coalescing global memory, or data sharing
between threads. Otherwise, better performance is achieved by directly loading the global memory
values into registers.
Shared memory is a bank-switched architecture. On Fermi it is 32 banks wide, and on G200 and
G80 hardware it is 16 banks wide. Each bank of data is 4 bytes in size, enough for a single-precision
floating-point data item or a standard 32-bit integer value. Kepler also introduces a special 64 bit wide
mode so larger double precision values no longer span two banks. Each bank can service only a single
operation per cycle, regardless of how many threads initiate this action. Thus, if every thread in a warp
accesses a separate bank address, every thread’s operation is processed in that single cycle. Note there
is no need for a one-to-one sequential access, just that every thread accesses a separate bank in the
shared memory. There is, effectively, a crossbar switch connecting any single bank to any single
thread. This is very useful when you need to swap the words, for example, in a sorting algorithm, an
example of which we’ll look at later.
There is also one other very useful case with shared memory and that is where every thread in
a warp reads the same bank address. As with constant memory, this triggers a broadcast mechanism to
all threads within the warp. Usually thread zero writes the value to communicate a common value with
the other threads in the warp. See Figure 6.2.
However, if we have any other pattern, we end up with bank conflicts of varying degrees. This
means you stall the other threads in the warp that idle while the threads accessing the shared memory
address queue up one after another. One important aspect of this is that it is not hidden by a switch to
another warp, so we do in fact stall the SM. Thus, bank conflicts are to be avoided if at all possible as
the SM will idle until all the bank requests have been fulfilled.
However, this is often not practical, such as in the histogram example we looked at in Chapter 5.
Here the data is unknown, so which bank it falls into is entirely dependent on the data pattern.
The worst case is where every thread writes to the same bank, in which case we get 32 serial
accesses to the same bank. We see this typically where the thread accesses a bank by a stride other than
32. Where the stride decreases by a power of two (e.g., in a parallel reduction), we can also see this,
with each successive round causing more and more bank conflicts.

Sorting using shared memory
Let’s introduce a practical example here, using sorting. A sorting algorithm works by taking a random
dataset and generating a sorted dataset. We thus need N input data items and N output data items. The
key aspect with sorting is to ensure you minimize the number of reads and writes to memory. Many
sorting algorithms are actually multipass, meaning we read every element of N, M times, which is
clearly not good.
The quicksort algorithm is the preferred algorithm for sorting in the serial world. Being a divideand-conquer algorithm, it would appear to be a good choice for a parallel approach. However, by
default it uses recursion, which is only supported in CUDA compute 2.x devices. Typical parallel
implementations spawn a new thread for every split of the data. The current CUDA model (see also
discussion on Kepler’s Dynamic Parallelism in chapter 12) requires a specification of the total number
of threads at kernel launch, or a series of kernel launches per level. The data causes significant branch

122

CHAPTER 6 Memory Handling with CUDA

Thread 00

Bank 00

Thread 00

Bank 00

Thread 00

Bank 00

Thread 00

Bank 00

Thread 01

Bank 01

Thread 01

Bank 01

Thread 01

Bank 01

Thread 01

Bank 01

Thread 02

Bank 02

Thread 02

Bank 02

Thread 02

Bank 02

Thread 02

Bank 02

Thread 03

Bank 03

Thread 03

Bank 03

Thread 03

Bank 03

Thread 03

Bank 03

Thread 04

Bank 04

Thread 04

Bank 04

Thread 04

Bank 04

Thread 04

Bank 04

Thread 05

Bank 05

Thread 05

Bank 05

Thread 05

Bank 05

Thread 05

Bank 05

Thread 06

Bank 06

Thread 06

Bank 06

Thread 06

Bank 06

Thread 06

Bank 06

Thread 07

Bank 07

Thread 07

Bank 07

Thread 07

Bank 07

Thread 07

Bank 07

Thread 08

Bank 08

Thread 08

Bank 08

Thread 08

Bank 08

Thread 08

Bank 08

Thread 09

Bank 09

Thread 09

Bank 09

Thread 09

Bank 09

Thread 09

Bank 09

Thread 10

Bank 10

Thread 10

Bank 10

Thread 10

Bank 10

Thread 10

Bank 10

Thread 11

Bank 11

Thread 11

Bank 11

Thread 11

Bank 11

Thread 11

Bank 11

Thread 12

Bank 12

Thread 12

Bank 12

Thread 12

Bank 12

Thread 12

Bank 12

Thread 13

Bank 13

Thread 13

Bank 13

Thread 13

Bank 13

Thread 13

Bank 13

Thread 14

Bank 14

Thread 14

Bank 14

Thread 14

Bank 14

Thread 14

Bank 14

Thread 15

Bank 15

Thread 15

Bank 15

Thread 15

Bank 15

Thread 15

Bank 15

Thread 16

Bank 16

Thread 16

Bank 16

Thread 16

Bank 16

Thread 16

Bank 16

Thread 17

Bank 17

Thread 17

Bank 17

Thread 17

Bank 17

Thread 17

Bank 17

Thread 18

Bank 18

Thread 18

Bank 18

Thread 18

Bank 18

Thread 18

Bank 18

Thread 19

Bank 19

Thread 19

Bank 19

Thread 19

Bank 19

Thread 19

Bank 19

Thread 20

Bank 20

Thread 20

Bank 20

Thread 20

Bank 20

Thread 20

Bank 20

Thread 21

Bank 21

Thread 21

Bank 21

Thread 21

Bank 21

Thread 21

Bank 21

Thread 22

Bank 22

Thread 22

Bank 22

Thread 22

Bank 22

Thread 22

Bank 22

Thread 23

Bank 23

Thread 23

Bank 23

Thread 23

Bank 23

Thread 23

Bank 23

Thread 24

Bank 24

Thread 24

Bank 24

Thread 24

Bank 24

Thread 24

Bank 24

Thread 25

Bank 25

Thread 25

Bank 25

Thread 25

Bank 25

Thread 25

Bank 25

Thread 26

Bank 26

Thread 26

Bank 26

Thread 26

Bank 26

Thread 26

Bank 26

Thread 27

Bank 27

Thread 27

Bank 27

Thread 27

Bank 27

Thread 27

Bank 27

Thread 28

Bank 28

Thread 28

Bank 28

Thread 28

Bank 28

Thread 28

Bank 28

Thread 29

Bank 29

Thread 29

Bank 29

Thread 29

Bank 29

Thread 29

Bank 29

Thread 30

Bank 30

Thread 30

Bank 30

Thread 30

Bank 30

Thread 30

Bank 30

Thead 31

Bank 31

Thead 31

Bank 31

Thead 31

Bank 31

Thead 31

Bank 31

1:1 Write = Ideal case

FIGURE 6.2
Shared memory patterns.

1:1 Write = Ideal case

1:1 Read = Ideal case

1:4 Read = 4 Bank Conflicts

Shared Memory

1

5

1

5

2

8

2

8

9

3

2

1

9

3

2

1

1

5

2

8

9

3

2

1

1

5

2

8

3

9

1

2

1

2

5

8

1

1

2

2

3

5

1

2

8

9

3

123

9

FIGURE 6.3
Simple merge sort example.

divergence, which again is not good for GPUs. There are ways to address some of these issues.
However, these issues make quicksort not the best algorithm to use on a pre-Kepler GK110/ Tesla K20
GPU. In fact, you often find the best serial algorithm is not the best parallel algorithm and it is better to
start off with an open mind about what will work best.
One common algorithm found in the parallel world is the merge sort (Figure 6.3). It works by
recursively partitioning the data into small and smaller packets, until eventually you have only two
values to sort. Each sorted list is then merged together to produce an entire sorted list.
Recursion is not supported in CUDA prior to compute 2., so how can such an algorithm be
performed? Any recursive algorithm will at some point have a dataset of size N. On GPUs the thread
block size or the warp size is the ideal size for N. Thus, to implement a recursive algorithm all you have
to do is break the data into blocks of 32 or larger elements as the smallest case of N.
With merge sort, if you take a set of elements such as {1,5,2,8,9,3,2,1} we can split the data at
element four and obtain two datasets, {1,5,2,8} and {9,3,2,1}. You can now use two threads to apply
a sorting algorithm to the two datasets. Instantly you have gone from p ¼ 1 to p ¼ 2, where p is the
number of parallel execution paths.
Splitting the data from two sets into four sets gives you {1,5}, {2,8}, {9,3}, and {2,1}. It’s now
trivial to execute four threads, each of which compares the two numbers and swaps them if necessary.
Thus, you end up with four sorted datasets: {1,5}, {2,8}, {3,9}, and {1,2}. The sorting phase is now
complete. The maximum parallelism that can be expressed in this phase is N/2 independent threads.
Thus, with a 512 MB dataset, you have 128K 32-bit elements, for which we can use a maximum of
64K threads (N ¼ 128K, N/2 ¼ 64K). Since a GTX580 GPU has 16 SMs, each of which can support up
to 1536 threads, we get up to 24K threads supported per GPU. With around two and a half passes, you
can therefore iterate through the 64K data pairs that need to be sorted with such a decomposition.
However, you now run into the classic problem with merge sort, the merge phase. Here the lists are
combined by moving the smallest element of each list into the output list. This is then repeated until all
members of the input lists are consumed. With the previous example, the sorted lists are {1,5}, {2,8},
{3,9}, and {1,2}. In a traditional merge sort, these get combined into {1,2,5,8} and {1,2,3,9}. These

124

CHAPTER 6 Memory Handling with CUDA

1

5

1

5

2

8

2

8

9

3

2

1

9

3

2

1

1

5

2

8

9

3

2

1

1

5

2

8

3

9

1

2

1

2

5

8

1

1

Eliminate

2

2

3

5

1

2

8

9

3

9

FIGURE 6.4
Merging N lists simultaneously.

two lists are then further combined in the same manner to produce one final sorted list,
{1,1,2,2,3,5,8,9}.
Thus, as each merge stage is completed, the amount of available parallelism halves. As an alternative
approach where N is small, you can simply scan N sets of lists and immediately place the value in the
correct output list, skipping any intermediate merge stages as shown in Figure 6.4. The issue is that the
sort performed at the stage highlighted for elimination in Figure 6.4 is typically done with two threads.
As anything below 32 threads means we’re using less than one warp, this is inefficient on a GPU.
The downside of this approach if that it means you would need to read the first element of the sorted
list set from every set. With 64 K sets, this is 64 K reads, or 256 MB of data that has to be fetched from
memory. Clearly, this is not a good solution when the number of lists is very large.
Thus, our approach is to try to achieve a much better solution to the merge problem by limiting the
amount of recursion applied to the original problem and stopping at the number of threads in a warp,
32, instead of two elements per sorted set, as with a traditional merge sort. This reduces the number of
sets in the previous example from 64 K sorted sets to just 4 K sets. It also increases the maximum
amount of parallelism available from N/2 to N/32. In the 128 K element example we looked at
previously, this would mean we would need 4 K processing elements. This would distribute 256
processing elements (warps) to every SM on a GTX580. As each Fermi SM can execute a maximum of
48 warps, multiple blocks will need to be iterated through, which allows for smaller problem sizes and
speedups on future hardware. See Figure 6.5.
Shared memory is bank switched. We have 32 threads within a single warp. However, if any of those
threads access the same bank, there will be a bank conflict. If any of the threads diverge in execution
flow, you could be running at up to 1/32 of the speed in the worst case. Threads can use registers that are
private to a thread. They can only communicate with one another using shared memory.
By arranging a dataset in rows of 32 elements in the shared memory, and accessing it in columns by
thread, you can achieve bank conflict–free access to the memory (Figure 6.6).
For coalesced access to global memory, something we’ll cover in the next section, you’d need to
fetch the data from global memory in rows of 32 elements. Then you can apply any sorting algorithm

Shared Memory

Set 1

Set 0

Set 2

Set 3

125

128 Elements
2 x 64 Elements
4 x 32 Elements

FIGURE 6.5
Shared memory–based decomposition.

Bank 31

Bank 8

Bank 0

FIGURE 6.6
Shared memory bank access.

to the column without worrying about shared memory conflicts. The only thing you need to consider is
branch divergence. You need to try to ensure that every thread follows the same execution flow, even
though they are processing quite different data elements.
One side effect of this strategy is we will end up having to make a tradeoff. Assuming we have
a single warp per SM, we will have no shared memory bank conflicts. However, a single warp per SM
will not hide the latency of global memory reads and writes. At least for the memory fetch and writeback stage, we need lots of threads. However, during the sort phase, multiple warps may conflict with
one another. A single warp would not have any bank conflicts, yet this would not hide the instruction
execution latency. So in practice, we’ll need multiple warps in all phases of the sort.

Radix sort
One algorithm that has a fixed number of iterations and a consistent execution flow is the radix sort. It
works by sorting based on the least significant bit and then working up to the most significant bit. With
a 32-bit integer, using a single radix bit, you will have 32 iterations of the sort, no matter how large the
dataset. Let’s consider an example with the following dataset:
{ 122, 10, 2, 1, 2, 22, 12, 9 }

The binary representation of each of these would be
122
10
2
22
12
9

¼
¼
¼
¼
¼
¼

01111010
00001010
00000010
00010010
00001100
00001001

126

CHAPTER 6 Memory Handling with CUDA

In the first pass of the list, all elements with a 0 in the least significant bit (the right side) would
form the first list. Those with a 1 as the least significant bit would form the second list. Thus, the two
lists are
0 ¼ { 122, 10, 2, 22, 12 }
1 ¼ { 9 }

The two lists are appended in this order, becoming
{ 122, 10, 2, 22, 12, 9 }

The process is then repeated for bit one, generating the next two lists based on the ordering of the
previous cycle:
0 ¼ { 12, 9 }
1 ¼ { 122, 10, 2, 22 }

The combined list is then
{ 12, 9, 122, 10, 2, 22 }

Scanning the list by bit two, we generate
0 ¼ { 9, 122, 10, 2, 22 }
1 ¼ { 12 }
¼ { 9, 122, 10, 2, 22, 12 }

And so the program continues until it has processed all 32 bits of the list in 32 passes. To build the
list you need N þ 2N memory cells, one for the source data, one of the 0 list, and one of the 1 list. We
do not strictly need 2N additional cells, as we could, for example, count from the start of the memory
for the 0 list and count backward from the end of the memory for the 1 list. However, to keep it simple,
we’ll use two separate lists.
The serial code for the radix sort is shown as follows:
__host__ void cpu_sort(u32 * const data,
const u32 num_elements)
{
static u32 cpu_tmp_0[NUM_ELEM];
static u32 cpu_tmp_1[NUM_ELEM];
for (u32 bit¼0;bit<32;bitþþ)
{
u32 base_cnt_0 ¼ 0;
u32 base_cnt_1 ¼ 0;
for (u32 i¼0; i 0 )
{

Shared Memory

127

cpu_tmp_1[base_cnt_1] ¼ d;
base_cnt_1þþ;
}
else
{
cpu_tmp_0[base_cnt_0] ¼ d;
base_cnt_0þþ;
}
}
// Copy data back to source - first the zero list
for (u32 i¼0; i 0 )
{
sort_tmp_1[base_cnt_1þtid] ¼ elem;
base_cnt_1þ¼num_lists;
}
else
{
sort_tmp_0[base_cnt_0þtid] ¼ elem;
base_cnt_0þ¼num_lists;
}
}
// Copy data back to source - first the zero list
for (u32 i¼0; i 0 )
{
sort_tmp_1[base_cnt_1þtid] ¼ elem;
base_cnt_1þ¼num_lists;
}
else
{
sort_tmp[base_cnt_0þtid] ¼ elem;
base_cnt_0þ¼num_lists;
}
}
// Copy data back to source from the one’s list

Shared Memory

131

for (u32 i¼0; iptxas info : Function properties for _Z12merge_arrayPKjPjjjj
1> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads

When a function makes a call into a subfunction and passes parameters, those parameters must
somehow be provided to the called function. The program makes just such a call:
dest_array[i] ¼ find_min(src_array,

Shared Memory

135

list_indexes,
num_lists,
num_elements_per_list);

There are two options that can be employed, to pass the necessary values through registers, or to
create an area of memory called a stack frame. Most modern processors have a large register set (32
or more registers). Thus, for a single level of calls, often this is enough. Older architectures use
stack frames and push the values onto the stack. The called function then pops the values off the
stack. As you require memory to do this, on the GPU this would mean using “local” memory, which
is local only in terms of which thread can access it. In fact, “local” memory can be held in global
memory, so this is hugely inefficient, especially on the older architectures (1.x) where it’s not
cached. At this point we need to rewrite the merge routine to avoid the function call. The new
routine is thus:
// Uses a single thread for merge
__device__ void merge_array1(const u32 * const src_array,
u32 * const dest_array,
const u32 num_lists,
const u32 num_elements,
const u32 tid)
{
__shared__ u32 list_indexes[MAX_NUM_LISTS];
// Multiple threads
list_indexes[tid] ¼ 0;
__syncthreads();
// Single threaded
if (tid ¼¼ 0)
{
const u32 num_elements_per_list ¼ (num_elements / num_lists);
for (u32 i¼0; i> 1;
u32 data;
// If the current list has already been
// emptied then ignore it

0

141

142

CHAPTER 6 Memory Handling with CUDA

if (list_indexes[tid] < num_elements_per_list)
{
// Work out from the list_index, the index into
// the linear array
const u32 src_idx ¼ tid þ (list_indexes[tid] * num_lists);
// Read the data from the list for the given
// thread
data ¼ src_array[src_idx];
}
else
{
data ¼ 0xFFFFFFFF;
}
// Store the current data value and index
reduction_val[tid] ¼ data;
reduction_idx[tid] ¼ tid;
// Wait for all threads to copy
__syncthreads();
// Reduce from num_lists to one thread zero
while (tid_max !¼ 0)
{
// Gradually reduce tid_max from
// num_lists to zero
if (tid < tid_max)
{
// Calculate the index of the other half
const u32 val2_idx ¼ tid þ tid_max;
// Read in the other half
const u32 val2 ¼ reduction_val[val2_idx];
// If this half is bigger
if (reduction_val[tid] > val2)
{
// The store the smaller value
reduction_val[tid] ¼ val2;
reduction_idx[tid] ¼ reduction_idx[val2_idx];
}
}
// Divide tid_max by two
tid_max >>¼ 1;
__syncthreads();

Shared Memory

143

}
if (tid ¼¼ 0)
{
// Incremenet the list pointer for this thread
list_indexes[reduction_idx[0]]þþ;
// Store the winning value
dest_array[i] ¼ reduction_val[0];
}
// Wait for tid zero
__syncthreads();
}
}

This code works by creating a temporary list of data in shared memory, which it populates with
a dataset from each cycle from the num_list datasets. Where a list has already been emptied, the
dataset is populated with 0xFFFFFFFF, which will exclude the value from the list. The while loop
gradually reduces the number of active threads until there is only a single thread active, thread zero.
This then copies the data and increments the list indexes to ensure the value is not processed twice.
Notice the use of the __syncthreads directive within the loop and at the end. The program needs to
sync across warps when there are more than 32 threads (one warp) in use.
So how does this perform? As you can see from Table 6.11 and Figure 6.13, this approach is
significantly slower than the atomicMin version, the fastest reduction being 8.4 ms versus the 5.86 ms
atomicMin (GTX460, 16 threads). This is almost 50% slower than the atomicMin version. However,
one thing to note is that it’s a little under twice the speed of the atomicMin when using 256 threads
(12.27 ms versus 21.58 ms). This is, however, still twice as slow as the 16-thread version.
Although this version is slower, it has the advantage of not requiring the use of the atomicMin
function. This function is only available on compute 1.2 devices, which is generally only an issue if
you need to consider the consumer market or you need to support really old Tesla systems. The main
issue, however, is that atomicMin can only be used with integer values. A significant number of realworld problems are floating-point based. In such cases we need both algorithms.
However, what we can take from both the atomicMin and the parallel reduction method is that the
traditional merge sort using two lists is not the ideal case on a GPU. You get increasing performance from

Table 6.11 Parallel Reduction Results (ms)
Device/Threads

1

2

4

8

16

32

64

128

256

GTX470
9800GT
GTX260
GTX460

28.4
45.66
56.07
23.22

17.67
28.35
34.71
14.52

12.44
19.82
24.22
10.3

10.32
16.25
19.84
8.63

9.98
15.61
19.04
8.4

10.59
17.03
20.6
8.94

11.62
19.03
23.2
9.82

12.94
21.45
26.28
10.96

14.61
25.33
31.01
12.27

144

CHAPTER 6 Memory Handling with CUDA

60
50
40
30
20
10
0
1

2

4
GTX470

8
9800GT

16

32

64

GTX260

128

256

GTX460

FIGURE 6.13
Parallel reduction graph.

the increasing parallelism in the radix sort as you increase the number of lists. However, you get
decreasing performance from the merge stage as you increase the parallelism and move beyond 16 lists.

A hybrid approach
There is potential here to exploit the benefits of both algorithms by creating a hybrid approach. We can
rewrite the merge sort as follows:
#define REDUCTION_SIZE 8
#define REDUCTION_SIZE_BIT_SHIFT 3
#define MAX_ACTIVE_REDUCTIONS ( (MAX_NUM_LISTS) / REDUCTION_SIZE )
// Uses multiple threads for merge
// Does reduction into a warp and then into a single value
__device__ void merge_array9(const u32 * const src_array,
u32 * const dest_array,
const u32 num_lists,
const u32 num_elements,
const u32 tid)
{
// Read initial value from the list
u32 data ¼ src_array[tid];
// Shared memory index
const u32 s_idx ¼ tid >> REDUCTION_SIZE_BIT_SHIFT;
// Calcuate number of 1st stage reductions
const u32 num_reductions ¼ num_lists >> REDUCTION_SIZE_BIT_SHIFT;

Shared Memory

const u32 num_elements_per_list ¼ (num_elements / num_lists);
// Declare a number of list pointers and
// set to the start of the list
__shared__ u32 list_indexes[MAX_NUM_LISTS];
list_indexes[tid] ¼ 0;
// Iterate over all elements
for (u32 i¼0; i 0)
{
// Wait for all threads
__syncthreads();
// Have each thread in warp zero do an
// additional min over all the partial
// mins to date
if ( (tid < num_reductions) )
{
atomicMin(&min_val[0], min_val[tid]);
}

145

146

CHAPTER 6 Memory Handling with CUDA

// Make sure all threads have taken their turn.
__syncthreads();
}
// If this thread was the one with the minimum
if (min_val[0] ¼¼ data)
{
// Check for equal values
// Lowest tid wins and does the write
atomicMin(&min_tid, tid);
}
// Make sure all threads have taken their turn.
__syncthreads();
// If this thread has the lowest tid
if (tid ¼¼ min_tid)
{
// Incremenet the list pointer for this thread
list_indexes[tid]þþ;
// Store the winning value
dest_array[i] ¼ data;
// If the current list has not already been
// emptied then read from it, else ignore it
if (list_indexes[tid] < num_elements_per_list)
data ¼ src_array[tid þ (list_indexes[tid] * num_lists)];
else
data ¼ 0xFFFFFFFF;
}
// Wait for min_tid thread
__syncthreads();
}
}

One of the main problems of the simple 1-to-N reduction is it becomes increasingly slower as the
value of N increases. We can see from previous tests that the ideal value of N is around 16 elements. The
kernel works by creating a partial reduction of N values and then a final reduction of those N values into
a single value. In this way it’s similar to the reduction example, but skips most of the iterations.
Notice that min_val has been extended from a single value into an array of shared values. This is
necessary so each independent thread can minimize the values over its dataset. Each min value is 32
bits wide so it exists in a separate shared memory bank, meaning there are no bank conflicts provided
the maximum number of first-level reductions results in 32 or less elements.
The value of REDUCTION_SIZE has been set to eight, which means the program will do a min over
groups of eight values prior to a final min. With the maximum of 256 elements, we get exactly 32

Shared Memory

Thread L

Thread L+1

Min(L0)

147

Thread L+N

Min(L1)

Min(Ln)

SM Bank 0

SM Bank 1

SM Bank N

Min(L0..N)

FIGURE 6.14
Hybrid parallel reduction.

Table 6.12 Hybrid Atomic and Parallel Reductions Results (ms)
Device/Threads

1

2

4

8

16

32

64

128

256

GTX470
GTX260
GTX460

29.41
56.85
24.12

17.62
33.54
14.54

11.24
20.83
9.36

8.98
15.29
7.64

7.2
11.87
6.22

6.49
10.5
5.67

6.46
10.36
5.68

7.01
11.34
6.27

8.57
14.65
7.81

60
50
40
30
20
10
0
1

2

4

8
GTX470

FIGURE 6.15
Hybrid atomic and parallel reduction graph.

16
GTX260

32

64
GTX460

128

256

148

CHAPTER 6 Memory Handling with CUDA

seperate banks being used to do the reduction. In the 256 elements we have a 256:32:1 reduction. With
a 128-element list we have a 128:16:1 reduction, etc.
The other major change is now only the thread that writes out the winning element reads a new
value into data, a register-based value that is per thread. Previously, all threads re-read in the value
from their respective lists. As only one thread won each round, only one list pointer changed. Thus, as
N increased, this became increasingly inefficent. However, this doesn’t help as much as you might at
first imagine.
So how does this version perform? Notice in Table 6.12 that the minimum time, 5.86 ms from the
atomicMin example, has fallen to 5.67 ms. This is not spectacular, but what is interesting to note is the
shape of the graph (Figure 6.15). No longer is the graph such an inclined U shape. Both the 32- and 64thread versions beat the simple atomicMin based on 16 threads. We’re starting to smooth out the
upward incline introduced by the merge step as shown in table 6.12 and figure 6.15.

Shared memory on different GPUs
Not all GPUs are created equal. With the move to compute 2.x devices, the amount of shared memory
became configurable. By default, compute 2.x (Fermi) devices are configured to provide 48K of shared
memory instead of the 16 K of shared memory on compute 1.x devices.
The amount of shared memory can change between hardware releases. To write programs that scale
in performance with new GPU releases, you have to write portable code. To support this, CUDA allows
you to query the device for the amount of shared memory available with the following code:
struct cudaDeviceProp device_prop;
CUDA_CALL(cudaGetDeviceProperties(&device_prop, device_num));
printf("\nSharedMemory: %u", device_prop.sharedMemPerBlock);

Having more shared memory available allows us to select one of two strategies. We can either
extend the amount of shared memory used from 16 K to 48 K or we can simply schedule more blocks
into a single SM. The best choice will really depend on the application at hand. With our sorting
example, 48 K of shared memory would allow the number of lists per SM to be reduced by a factor of
three. As we saw earlier, the number of lists to merge has a significant impact on the overall execution
time.

Shared memory summary
So far we have looked only at sorting within a single SM, in fact within a single block. Moving from
a single-block version to a multiple-block version introduces another set of merges. Each block will
produce an independent sorted list. These lists then have to be merged, but this time in global memory.
The list size moves outside that which can be held in shared memory. The same then becomes true
when using multiple GPUsdyou generate N or more sorted lists where N equals the number of GPUs
in the system.
We’ve looked primarily at interthread cooperation with shared memory in this section. The
merging example was selected to demonstrate this in a manner that was not too complex and easy to
follow. Parallel sorting has a large body of research behind it. More complex algorithms may well be
more efficient, in terms of the memory usage and/or SM utilization. The point here was to use

Shared Memory

149

a practical example that could be easily followed and process lots of data that did not simply reduce to
a single value.
We’ll continue to look at sorting later and look at how interblock communication and coordination
can be achieved in addition to thread-level communication.

Questions on shared memory
1. Looking at the radix_sort algorithm, how might the use of shared memory be reduced? Why
would this be useful?
2. Are all the synchronization points necessary? In each instance a synchronization primitive is used.
Discuss why. Are there conditions where they are not necessary?
3. What would be the effect of using Cþþ templates in terms of execution time?
4. How would you further optimize this sorting algorithm?

Answers for shared memory
1. There are a number of solutions. One is to use only the memory allocated to the sort. This can be
done using an MSB radix sort and swapping the 1s with elements at the end of the list. The 0 list
counts forward and the 1 list counts backward. When they meet, the next digit is sorted until the
LSB is sorted. Reducing the memory usage is useful because it allows larger lists in the shared
memory, reducing the total number of lists needed, which significantly impacts execution time.
2. The main concept to understand here is the synchronization points are necessary only when more
than one warp is used. Within a warp all instructions execute synchronously. A branch causes the
nonbranched threads to stall. At the point the branch converges, you are guaranteed all instructions
are in sync, although the warps can then instantly diverge again. Note that memory must be
declared as volatile or you must have syncthread points within the warp if you wish to
guarantee visibility of writes between threads. See Chapter 12 on common problems for
a discussion on the use of the volatile qualifier.
3. Templates would allow much of the runtime evaluation of the num_lists parameter to be replaced
with compile time substitution. The parameter must always be a power of 2, and in practice will be
limited to a maximum of 256. Thus, a number of templates can be created and the appropriate
function called at runtime. Given a fixed number of iterations known at compiler time instead of
runtime, the compiler can efficiently unroll loops and substitute variable reads with literals.
Additionally, templates can be used to support multiple implementations for different data
types, for example, using the atomicMin version for integer data while using a parallel reduction
for floating-point data.
4. This is rather an open-ended question. There are many valid answers. As the number of sorted
lists to merge increases, the problem becomes significantly larger. Elimination of the merge
step would be a good solution. This could be achieved by partially sorting the original list
into N sublists by value. Each sublist can then be sorted and the lists concatenated, rather than
merged. This approach is the basis of another type of sort, sample sort, an algorithm we look
at later in this chapter.
Consider also the size of the dataset in the example, 1024 elements. With 256 threads there are just
four elements per list. A radix sort using a single bit is very inefficient for this number of

150

CHAPTER 6 Memory Handling with CUDA

elements, requiring 128 iterations. A comparison-based sort is much quicker for such small
values of N.
In this example, we used a single bit for the radix sort. Multiple bits can be used, which reduces the
number of passes over the dataset at the expense of more intermediate storage. We currently use an
iterative method to sort elements into sequential lists. It’s quite possible to work where the data will
move to by counting the radix bits and using a prefix sum calculation to work out the index of
where the data should be written. We look at prefix sum later in this chapter.

CONSTANT MEMORY
Constant memory is a form of virtual addressing of global memory. There is no special reserved
constant memory block. Constant memory has two special properties you might be interested in. First,
it is cached, and second, it supports broadcasting a single value to all the elements within a warp.
Constant memory, as its name suggests, is for read-only memory. This is memory that is
either declared at compile time as read only or defined at runtime as read only by the host. It is,
therefore, constant only in respect of the GPU’s view onto memory. The size of constant memory is
restricted to 64 K.
To declare a section of memory as constant at compile time, you simply use the __constant__
keyword. For example:
___constant__ float my_array[1024] ¼ { 0.0F, 1.0F, 1.34F, . };

To change the contents of the constant memory section at runtime, you simply use the
function call prior to invoking the GPU kernel. If you do not define the constant
memory at either compile time or host runtime then the contents of the memory section are undefined.

cudaCopyToSymbol

Constant memory caching
Compute 1.x devices
On compute 1.x devices (pre-Fermi), constant memory has the property of being cached in a small
8K L1 cache, so subsequent accesses can be very fast. This is providing that there is some potential
for data reuse in the memory pattern the application is using. It is also highly optimized for
broadcast access such that threads accessing the same memory address can be serviced in a single
cycle.
With a 64 K segment size and an 8 K cache size, you have an 8:1 ratio of memory size to cache,
which is really very good. If you can contain or localize accesses to 8 K chunks within this constant
section you’ll achieve very good program performance. On certain devices you will find localizing the
data to even smaller chunks will provide higher performance.
With a nonuniform access to constant memory a cache miss results in N fetches from global
memory in addition to the fetch from the constant cache. Thus, a memory pattern that exhibits poor
locality and/or poor data reuse should not be accessed as constant memory. Also, each divergence in
the memory fetch pattern causes serialization in terms of having to wait for the constant memory. Thus,
a warp with 32 separate fetches to the constant cache would take at least 32 times longer than an access
to a single data item. This would grow significantly if it also included cache misses.

Constant Memory

151

Single-cycle access is a huge improvement on the several hundred cycles required for a fetch from
global memory. However, the several hundred–cycle access to global memory will likely be hidden by
task switches to other warps, if there are enough available warps for the SM to execute. Thus, the
benefit of using constant memory for its cache properties relies on the time taken to fetch data from
global memory and the amount of data reuse the algorithm has. As with shared memory, the low-end
devices have much less global memory bandwidth, so they benefit proportionally more from such
techniques than the high-end devices.
Most algorithms can have their data broken down into “tiles” (i.e., smaller datasets) from a much
larger problem. In fact, as soon as you have a problem that can’t physically fit on one machine, you
have to do tiling of the data. The same tiling can be done on a multicore CPU with each one of the N
cores taking 1/N of the data. You can think of each SM on the GPU as being a core on a CPU that is
able to support hundreds of threads.
Imagine overlaying a grid onto the data you are processing where the total number of cells, or
blocks, in the grid equals the number of cores (SMs) you wish to split the data into. Take these SMbased blocks and further divide them into at least eight additional blocks. You’ve now decomposed
your data area into N SMs, each of which is allocated M blocks.
In practice, this split is usually too large and would not allow for future generations of GPUs to
increase the number of SMs or the number of available blocks and see any benefit. It also does not
work well where the number of SMs is unknown, for example, when writing a commercial program
that will be run on consumer hardware. The largest number of SMs per device to date has been 32
(GT200 series). The Kepler and Fermi range aimed at compyte have a maximum of 15 and 16 SMs
respectively. The range designed primarily for gaming have up to 8 SMs.
One other important consideration is what interthread communication you need, if any. This can
only reasonably be done using threads and these are limited to 1024 per block on Fermi and Kepler,
less on earlier devices. You can, of course, process multiple items of data per thread, so this is not such
a hard limit as it might first appear.
Finally, you need to consider load balancing. Many of the early card releases of GPU families had
non power of two numbers of SMs (GTX460 ¼ 7, GTX260 ¼ 30, etc.). Therefore, using too few
blocks leads to too little granularity and thus unoccupied SMs in the final stages of computation.
Tiling, in terms of constant memory, means splitting the data into blocks of no more than 64 K
each. Ideally, the tiles should be 8 K or less. Sometimes tiling involves having to deal with halo or
ghost cells that occupy the boundaries, so values have to be propagated between tiles. Where halos are
required, larger block sizes work better than smaller cells because the area that needs to communicated
between blocks is much smaller.
When using tiling there is actually quite a lot to think about. Often the best solution is simply to run
through all combinations of number of threads, elements processed per thread, number of blocks, and
tile widths, and search for the optimal solution for the given problem. We look at how to do this in
Chapter 9 on optimization.

Compute 2.x devices
On Fermi (compute 2.x) hardware and later, there is a level two (L2) cache. Fermi uses an L2 cache
shared between each SM. All memory accesses are cached automatically by the L2 cache. Additionally,
the L1 cache size can be increased from 16 K to 48 K by sacrificing 32 K of the shared memory per SM.
Because all memory is cached on Fermi, how constant memory is used needs some consideration.

152

CHAPTER 6 Memory Handling with CUDA

Fermi, unlike compute 1.x devices, allows any constant section of data to be treated as constant
memory, even if it is not explicitly declared as such. Constant memory on 1.x devices has to be
explicitly managed with special-purpose calls like cudaMemcpyToSymbol or declared at compile time.
With Fermi, any nonthread-based access to an area of memory declared as constant (simply with the
standard const keyword) goes through the constant cache. By nonthread-based access, this is an access
that does not include threadIdx.x in the array indexing calculation.
If you need access to constant data on a per-thread-based access, then you need to use the compile
time (__constant__) or runtime function (cudaMemcpyToSymbol) as with compute 1.x devices.
However, be aware that the L2 cache will still be there and is much larger than the constant cache. If
you are implementing a tiling algorithm that needs halo or ghost cells between blocks, the solution will
often involve copying the halo cells into constant or shared memory. Due to Fermi’s L2 cache, this
strategy will usually be slower than simply copying the tiled cells to shared or constant memory and
then accessing the halo cells from global memory. The L2 cache will have collected the halo cells from
the prior block’s access of the memory. Therefore, the halo cells are quickly available from the L2
cache and come into the device much quicker than you would on compute 1.x hardware where a global
memory fetch would have to go all the way out to the global memory.

Constant memory broadcast
Constant memory has one very useful feature. It can be used for the purpose of distributing, or
broadcasting, data to every thread in a warp. This broadcast takes place in just a single cycle, making
this ability very useful. In comparison, a coalesced access to global memory on compute 1.x hardware
would require a memory fetch taking hundreds of cycles of latency to complete. Once it has arrived
from the memory subsystem, it would be distributed in the same manner to all threads, but only after
a significant wait for the memory subsystem to provide the data. Unfortunately, this is an all too
common problem, in that memory speeds have failed to keep pace with processor clock speeds.
Think of fetching data from global memory in the same terms as you might consider fetching data
from disk. You would never write a program that fetched the data from disk multiple times, because it
would be far too slow. You have to think about what data to fetch, and once you have it, how to reuse
that data as much as possible, while some background process triggers the next block of data to be
brought in from the disk.
By using the broadcast mechanism, which is also present on Fermi for L2 cache–based accesses,
you can distribute data very quickly to multiple threads within a warp. This is particularly useful
where you have some common transformation being performed by all threads. Each thread reads
element N from constant memory, which triggers a broadcast to all threads in the warp. Some
processing is performed on the value fetched from constant memory, perhaps in combination with
a read/write to global memory. You then fetch element N þ 1 from constant memory, again via
a broadcast, and so on. As the constant memory area is providing almost L1 cache speeds, this type
of algorithm works well.
However, be aware that if a constant is really a literal value, it is better to define it as a literal value
using a #define statement, as this frees up constant memory. So don’t place literals like PI into
constant memory, rather define them as literal #define instead. In practice, it makes little difference in
speed, only memory usage, as to which method is chosen. Let’s look at an example program:

Constant Memory

#include
#include
#include
#include

153

"const_common.h"
"stdio.h"
"conio.h"
"assert.h"

#define CUDA_CALL(x) {const cudaError_t a ¼ (x); if (a !¼ cudaSuccess) { printf("\nCUDA
Error: %s (err_num¼%d) \n", cudaGetErrorString(a), a); cudaDeviceReset(); assert(0);} }
#define KERNEL_LOOP 65536
__constant__
__constant__
__constant__
__constant__

static
static
static
static

const
const
const
const

u32
u32
u32
u32

const_data_01
const_data_02
const_data_03
const_data_04

¼
¼
¼
¼

0x55555555;
0x77777777;
0x33333333;
0x11111111;

__global__ void const_test_gpu_literal(u32 * const data, const u32 num_elements)
{
const u32 tid ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
if (tid < num_elements)
{
u32 d ¼ 0x55555555;
for (int i¼0;i>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from literal startup kernel");
// Do the literal kernel
// printf("\nLaunching literal kernel");
CUDA_CALL(cudaEventRecord(kernel_start1,0));
const_test_gpu_literal <<>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from literal runtime kernel");
CUDA_CALL(cudaEventRecord(kernel_stop1,0));
CUDA_CALL(cudaEventSynchronize(kernel_stop1));
CUDA_CALL(cudaEventElapsedTime(&delta_time1, kernel_start1, kernel_stop1));
// printf("\nLiteral Elapsed time: %.3fms", delta_time1);
// Warm up run
// printf("\nLaunching constant kernel warm-up");
const_test_gpu_const <<>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from constant startup kernel");
// Do the constant kernel
// printf("\nLaunching constant kernel");
CUDA_CALL(cudaEventRecord(kernel_start2,0));
const_test_gpu_const <<>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from constant runtime kernel");
CUDA_CALL(cudaEventRecord(kernel_stop2,0));
CUDA_CALL(cudaEventSynchronize(kernel_stop2));
CUDA_CALL(cudaEventElapsedTime(&delta_time2, kernel_start2, kernel_stop2));
// printf("\nConst Elapsed time: %.3fms", delta_time2);

155

156

CHAPTER 6 Memory Handling with CUDA

if (delta_time1 > delta_time2)
printf("\n%sConstant version is faster by: %.2fms (Const¼%.2fms vs. Literal¼
%.2fms)", device_prefix, delta_time1-delta_time2, delta_time1, delta_time2);
else
printf("\n%sLiteral version is faster by: %.2fms (Const¼%.2fms vs. Literal¼
%.2fms)", device_prefix, delta_time2-delta_time1, delta_time1, delta_time2);
CUDA_CALL(cudaEventDestroy(kernel_start1));
CUDA_CALL(cudaEventDestroy(kernel_start2));
CUDA_CALL(cudaEventDestroy(kernel_stop1));
CUDA_CALL(cudaEventDestroy(kernel_stop2));
CUDA_CALL(cudaFree(data_gpu));
}
CUDA_CALL(cudaDeviceReset());
printf("\n");
}
wait_exit();
}

This program consists of two GPU kernels, const_test_gpu_literal and const_test_gpu_const.
Notice how each is declared with the __global__ prefix to say this function has public scope. Each of
these kernels fetches some data as either constant data or literal data within the for loop, and uses it to
manipulate the local variable d. It then writes this manipulated value out to global memory. This is
necessary only to avoid the compiler optimizing away the code.
The next section of code gets the number of CUDA devices present and iterates through the devices
using the cudaSetDevice call. Note that this is possible because at the end of the loop the host code
calls cudaDeviceReset to clear the current context.
Having set the device, the program allocates some global memory and creates two events, a start
and a stop timer event. These events are fed into the execution stream, along with the kernel call. Thus,
you end up with the stream containing a start event, a kernel call, and a stop event. These events would
normally happen asynchronously with the CPU, that is, they do not block the execution of the CPU and
execute in parallel. This causes some problems when trying to do timing, as a CPU timer would see no
elapsed time. The program, therefore, calls cudaEventSynchronize to wait on the last event, the kernel
stop event, to complete. It then calculates the delta time between the start and stop events and thus
knows the execution time of the kernel.
This is repeated for the constant and literal kernels, including the execution of a warm-up call to
avoid any initial effects of filling any caches. The results are shown as follows:
ID:0
ID:0
ID:0
ID:0
ID:0
ID:0

GeForce
GeForce
GeForce
GeForce
GeForce
GeForce

GTX
GTX
GTX
GTX
GTX
GTX

470:Constant version is faster by: 0.00ms (C¼345.23ms, L¼345.23ms)
470:Constant version is faster by: 0.01ms (C¼330.95ms, L¼330.94ms)
470:Literal version is faster by: 0.01ms (C¼336.60ms, L¼336.60ms)
470:Constant version is faster by: 5.67ms (C¼336.60ms, L¼330.93ms)
470:Constant version is faster by: 5.59ms (C¼336.60ms, L¼331.01ms)
470:Constant version is faster by: 14.30ms (C¼345.23ms, L¼330.94ms)

Constant Memory

157

ID:1
ID:1
ID:1
ID:1
ID:1
ID:1

GeForce
GeForce
GeForce
GeForce
GeForce
GeForce

9800
9800
9800
9800
9800
9800

GT:Literal version is faster by: 4.04ms (C¼574.85ms, L¼578.89ms)
GT:Literal version is faster by: 3.55ms (C¼578.18ms, L¼581.73ms)
GT:Literal version is faster by: 4.68ms (C¼575.85ms, L¼580.53ms)
GT:Constant version is faster by: 5.25ms (C¼581.06ms, L¼575.81ms)
GT:Literal version is faster by: 4.01ms (C¼572.08ms, L¼576.10ms)
GT:Constant version is faster by: 8.47ms (C¼578.40ms, L¼569.93ms)

ID:2
ID:2
ID:2
ID:2
ID:2
ID:2

GeForce
GeForce
GeForce
GeForce
GeForce
GeForce

GTX
GTX
GTX
GTX
GTX
GTX

260:Literal
260:Literal
260:Literal
260:Literal
260:Literal
260:Literal

ID:3
ID:3
ID:3
ID:3
ID:3
ID:3

GeForce
GeForce
GeForce
GeForce
GeForce
GeForce

GTX
GTX
GTX
GTX
GTX
GTX

460:Literal version is faster by: 0.59ms (C¼541.43ms, L¼542.02ms)
460:Literal version is faster by: 0.17ms (C¼541.20ms, L¼541.37ms)
460:Constant version is faster by: 0.45ms (C¼542.29ms, L¼541.83ms)
460:Constant version is faster by: 0.27ms (C¼542.17ms, L¼541.89ms)
460:Constant version is faster by: 1.17ms (C¼543.55ms, L¼542.38ms)
460:Constant version is faster by: 0.24ms (C¼542.92ms, L¼542.68ms)

version
version
version
version
version
version

is
is
is
is
is
is

faster
faster
faster
faster
faster
faster

by:
by:
by:
by:
by:
by:

0.27ms
0.26ms
0.26ms
0.26ms
0.13ms
0.27ms

(C¼348.74ms,
(C¼348.72ms,
(C¼348.74ms,
(C¼348.74ms,
(C¼348.83ms,
(C¼348.73ms,

L¼349.00ms)
L¼348.98ms)
L¼349.00ms)
L¼349.00ms)
L¼348.97ms)
L¼348.99ms)

What is interesting to note is that there is very little, if any, difference in the execution time if you
look at this as a percentage of the total execution time. Consequently we see a fairly random distribution as to which version, the constant or the literal, is faster. Now how does this compare with using
global memory? To test this, we simply replace the literal kernel with one that uses global memory as
shown in the following:
__device__
__device__
__device__
__device__

static
static
static
static

u32
u32
u32
u32

data_01
data_02
data_03
data_04

¼
¼
¼
¼

0x55555555;
0x77777777;
0x33333333;
0x11111111;

__global__ void const_test_gpu_gmem(u32 * const data, const u32 num_elements)
{
const u32 tid ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
if (tid < num_elements)
{
u32 d ¼ 0x55555555;
for (int i¼0;i;
.reg .u64 %rd<6>;
.reg .pred %p<5>;
// __cuda_local_var_108907_15_non_const_tid ¼ 0
// __cuda_local_var_108910_13_non_const_d ¼ 4
// i ¼ 8
.loc 16 40 0
$LDWbegin__Z20const_test_gpu_constPjj:
$LDWbeginblock_181_1:
.loc 16 42 0
mov.u32 %r1, %tid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %ntid.x;
mul.lo.u32 %r4, %r2, %r3;
add.u32 %r5, %r1, %r4;
mov.s32 %r6, %r5;
.loc 16 43 0
ld.param.u32 %r7, [__cudaparm__Z20const_test_gpu_constPjj_num_elements];
mov.s32 %r8, %r6;
setp.le.u32 %p1, %r7, %r8;
@%p1 bra $L_1_3074;
$LDWbeginblock_181_3:
.loc 16 45 0
mov.u32 %r9, 1431655765;
mov.s32 %r10, %r9;
$LDWbeginblock_181_5:
.loc 16 47 0
mov.s32 %r11, 0;
mov.s32 %r12, %r11;
mov.s32 %r13, %r12;
mov.u32 %r14, 4095;
setp.gt.s32 %p2, %r13, %r14;
@%p2 bra $L_1_3586;
$L_1_3330:
.loc 16 49 0
mov.s32 %r15, %r10;
xor.b32 %r16, %r15, 1431655765;
mov.s32 %r10, %r16;
.loc 16 50 0

159

160

CHAPTER 6 Memory Handling with CUDA

mov.s32 %r17, %r10;
or.b32 %r18, %r17, 2004318071;
mov.s32 %r10, %r18;
.loc 16 51 0
mov.s32 %r19, %r10;
and.b32 %r20, %r19, 858993459;
mov.s32 %r10, %r20;
.loc 16 52 0
mov.s32 %r21, %r10;
or.b32 %r22, %r21, 286331153;
mov.s32 %r10, %r22;
.loc 16 47 0
mov.s32 %r23, %r12;
add.s32 %r24, %r23, 1;
mov.s32 %r12, %r24;
$Lt_1_1794:
mov.s32 %r25, %r12;
mov.u32 %r26, 4095;
setp.le.s32 %p3, %r25, %r26;
@%p3 bra $L_1_3330;
$L_1_3586:
$LDWendblock_181_5:
.loc 16 55 0
mov.s32 %r27, %r10;
ld.param.u64 %rd1, [__cudaparm__Z20const_test_gpu_constPjj_data];
cvt.u64.u32 %rd2, %r6;
mul.wide.u32 %rd3, %r6, 4;
add.u64 %rd4, %rd1, %rd3;
st.global.u32 [%rd4þ0], %r27;
$LDWendblock_181_3:
$L_1_3074:
$LDWendblock_181_1:
.loc 16 57 0
exit;
$LDWend__Z20const_test_gpu_constPjj:
} // _Z20const_test_gpu_constPjj

Understanding the exact meaning of the assembly code is not necessary. We’ve shown the function
in full to give you some idea of how a small section of C code actually expands to the assembly level.
PTX code uses the format
   
Thus,
xor.b32

%r16, %r15, 1431655765;

takes the value in register 15 and does a 32-bit, bitwise xor operation with the literal value
1431655765. It then stores the result in register 16. Notice the numbers highlighted in bold within the

Constant Memory

161

previous PTX listing. The compiler has replaced the constant values used on the kernel with literals.
This is why it’s always worthwhile looking into what is going on if the results are not what are
expected. An extract of the GMEM PTX code for comparison is as follows:
ld.global.u32 %r16, [data_01];
xor.b32 %r17, %r15, %r16;

The program is now loading a value from global memory. The constant version was not actually
doing any memory reads at all. The compiler had done a substitution of the constant values for literal
values when translating the C code into PTX assembly. This can be solved by declaring the constant
version as an array, rather than a number of scalar variables. Thus, the new function becomes:
__constant__ static const u32 const_data[4] ¼ { 0x55555555, 0x77777777, 0x33333333,
0x11111111 };
__global__ void const_test_gpu_const(u32 * const data, const u32 num_elements)
{
const u32 tid ¼ (blockIdx.x * blockDim.x) þ threadIdx.x;
if (tid < num_elements)
{
u32 d ¼ const_data[0];
for (int i¼0;i>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from gmem startup kernel");
// Do the gmem kernel
// printf("\nLaunching gmem kernel");
CUDA_CALL(cudaEventRecord(kernel_start1,0));
const_test_gpu_gmem <<>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from gmem runtime kernel");
CUDA_CALL(cudaEventRecord(kernel_stop1,0));
CUDA_CALL(cudaEventSynchronize(kernel_stop1));
CUDA_CALL(cudaEventElapsedTime(&delta_time1, kernel_start1, kernel_stop1));
// printf("\nGMEM Elapsed time: %.3fms", delta_time1);
// Copy host memory to global memory section in GPU
CUDA_CALL(cudaMemcpyToSymbol(gmem_data_gpu, const_data_host,
KERNEL_LOOP * sizeof(u32)));
// Warm up run
// printf("\nLaunching constant kernel warm-up");
const_test_gpu_const <<>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from constant startup kernel");
// Do the constant kernel
// printf("\nLaunching constant kernel");
CUDA_CALL(cudaEventRecord(kernel_start2,0));

165

166

CHAPTER 6 Memory Handling with CUDA

const_test_gpu_const <<>>(data_gpu, num_elements);
cuda_error_check("Error ", " returned from constant runtime kernel");
CUDA_CALL(cudaEventRecord(kernel_stop2,0));
CUDA_CALL(cudaEventSynchronize(kernel_stop2));
CUDA_CALL(cudaEventElapsedTime(&delta_time2, kernel_start2, kernel_stop2));
// printf("\nConst Elapsed time: %.3fms", delta_time2);
if (delta_time1 > delta_time2)
printf("\n%sConstant version is faster by: %.2fms (G¼%.2fms, C¼%.2fms)",
device_prefix, delta_time1-delta_time2, delta_time1, delta_time2);
else
printf("\n%sGMEM version is faster by: %.2fms (G¼%.2fms, C¼%.2fms)",
device_prefix, delta_time2-delta_time1, delta_time1, delta_time2);
}
CUDA_CALL(cudaEventDestroy(kernel_start1));
CUDA_CALL(cudaEventDestroy(kernel_start2));
CUDA_CALL(cudaEventDestroy(kernel_stop1));
CUDA_CALL(cudaEventDestroy(kernel_stop2));
CUDA_CALL(cudaFree(data_gpu));
CUDA_CALL(cudaDeviceReset());
printf("\n");
}
wait_exit();
}

Notice how the cudaMemcpyToSymbol call works. You can copy to any named global symbol on the
GPU, regardless of whether that symbol is in global memory or constant memory. Thus, if you chunk
the data to 64 K chunks, you can access it from the constant cache. This is very useful if all threads are
accessing the same data element, as you get the broadcast and cache effect from the constant memory
section.
Notice also that the memory allocation, creation of events, destruction of the events and freeing of
device memory is now done outside the main loop. CUDA API calls such as these are actually very
costly in terms of CPU time. The CPU load of this program drops considerably with this simple
change. Always try to set up everything at the start and destroy or free it at the end. Never do this in the
loop body or it will greatly slow down the application.

Constant question
1. If you have a data structure that is 16 K in size and exhibits a random pattern of access per block but
a unified access pattern per warp, would it be best to place it into registers, constant memory, or
shared memory? Why?

Global Memory

167

Constant answer
1. Although it is a little tricky to get a large array into registers, tiling into blocks of registers per
thread would allow for the fastest access, regardless of access pattern. However, you are limited
to 32 K (compute < 1.2), 64 K (compute 1,2, 1.3), or 128 K (compute 2.x) or 256 K (compute
3.x) register space per SM. You have to allocate some of this to working registers on a perthread basis. On Fermi you can have a maximum of 64 registers per thread, so with 32 allocated
to data and 32 as the working set, you would have just 128 active threads, or four active warps.
As soon as the program accessed off-chip memory (e.g., global memory) the latency may stall
the SM. Therefore, the kernel would need a high ratio of operations on the register block to
make this a good solution.
Placing it into shared memory would likely be the best case, although depending on the actual access
pattern you may see shared memory bank conflicts. The uniform warp access would allow
broadcast from the shared memory to all the threads in a single warp. It is only in the case where
the warp from two blocks accessed the same bank that would you get a shared memory conflict.
However, 16 K of shared memory would consume entirely the shared memory in one SM on compute
1.x devices and limit you to three blocks maximum on compute 2.x/3.x hardware.
Constant memory would also be a reasonable choice on compute 1.x devices. Constant memory would
have the benefit of broadcast to the threads. However, the 16 K of data may well swamp the cache
memory. Also, and more importantly, the constant cache is optimized for linear access, that is, it
fetches cache lines upon a single access. Thus, accesses near the original access are cached.
Accesses to a noncached cache line result in a cache miss penalty that is larger than a fetch to
global memory without a cache miss.
Global memory may well be faster on compute 2.x/3.x devices, as the unified access per warp should
be translated by the compiler into the uniform warp-level global memory access. This provides the
broadcast access constant memory would have provided on compute 1.x devices.

GLOBAL MEMORY
Global memory is perhaps the most interesting of the memory types in that it’s the one you absolutely
have to understand. GPU global memory is global because it’s writable from both the GPU and the
CPU. It can actually be accessed from any device on the PCI-E bus. GPU cards can transfer data to and
from one another, directly, without needing the CPU. This peer-to-peer feature, introduced in the
CUDA 4.x SDK, is not yet supported on all platforms. Currently, the Windows 7/Vista platforms are
only supported on Tesla hardware, via the TCC driver model. Those using Linux or Windows XP can
use this feature with both consumer and Tesla cards.
The memory from the GPU is accessible to the CPU host processor in one of three ways:
• Explicitly with a blocking transfer.
• Explicitly with a nonblocking transfer.
• Implicitly using zero memory copy.
The memory on the GPU device sits on the other side of the PCI-E bus. This is a bidirectional bus that,
in theory, supports transfers of up to 8 GB/s (PCI-E 2.0) in each direction. In practice, the PCI-E
bandwidth is typically 4–5 GB/s in each direction. Depending on the hardware you are using,

168

CHAPTER 6 Memory Handling with CUDA

CPU Gather

Transfer To

GPU Kernel

Transfer
From

CPU Process

CPU Gather

Transfer To

GPU Kernel

Transfer
From

CPU Process

CPU Gather

Transfer To

GPU Kernel

Transfer
From

CPU Process

CPU Gather

Transfer To

GPU Kernel

Transfer
From

CPU Process

Time

FIGURE 6.16
Overlapping kernel and memory transfers.

Addr
0

Addr
1

Addr
2

Addr
N

Addr
30

Addr
31

TID
0

TID
1

TID
2

TID
N

TID
30

TID
31

FIGURE 6.17
Addresses accessed by thread ID.

nonblocking and implicit memory transfers may not be supported. We’ll look at these issues in more
detail in Chapter 9.
The usual model of execution involves the CPU transferring a block of data to the GPU, the GPU
kernel processing it, and then the CPU initiating a transfer of the data back to the host memory. A
slightly more advanced model of this is where we use streams (covered later) to overlap transfers and
kernels to ensure the GPU is always kept busy, as shown in Figure 6.16.
Once you have the data in the GPU, the question then becomes how do you access it efficiently on
the GPU? Remember the GPU can be rated at over 3 teraflops in terms of compute power, but typically
the main memory bandwidth is in the order of 190 GB/s down to as little as 25 GB/s. By comparison,
a typical Intel I7 Nehalem or AMD Phenom CPU achieves in the order of 25–30 GB/s, depending on
the particular device speed and width of the memory bus used.
Graphics cards use high-speed GDDR, or graphics dynamic memory, which achieves very high
sustained bandwidth, but like all memory, has a high latency. Latency is the time taken to return the
first byte of the data access. Therefore, in the same way that we can pipeline kernels, as is shown in

Global Memory

169

Figure 6.16, the memory accesses are pipelined. By creating a ratio of typically 10:1 of threads to
number of memory accesses, you can hide memory latency, but only if you access global memory in
a pattern that is coalesced.
So what is a coalescable pattern? This is where all the threads access a contiguous and aligned
memory block, as shown in Figure 6.17. Here we have shown Addr as the logical address offset from
the base location, assuming we are accessing byte-based data. TID represents the thread number. If we
have a one-to-one sequential and aligned access to memory, the address accesses of each thread are
combined together and a single memory transaction is issued. Assuming we’re accessing a single
precision float or integer value, each thread will be accessing 4 bytes of memory. Memory is coalesced
on a warp basis (the older G80 hardware uses half warps), meaning we get 32  4 ¼ 128 byte access to
memory.
Coalescing sizes supported are 32, 64, and 128 bytes, meaning warp accesses to byte, 16- and 32bit data will always be coalesced if the access is a sequential pattern and aligned to a 32-byte boundary.
The alignment is achieved by using a special malloc instruction, replacing the standard cudaMalloc
with cudaMallocPitch, which has the following syntax:
extern __host__ cudaError_t CUDARTAPI cudaMallocPitch(void **devPtr, size_t *pitch,
size_t width, size_t height);

This translates to cudaMallocPitch (pointer to device memory pointer, pointer to pitch, desired
width of the row in bytes, height of the array in bytes).
Thus, if you have an array of 100 rows of 60 float elements, using the conventional cudaMalloc,
you would allocate 100  60  sizeof(float) bytes, or 100  60  4 ¼ 24,000 bytes. Accessing array
index [1][0] (i.e., row one, element zero) would result in noncoalesced access. This is because the
length of a single row of 60 elements would be 240 bytes, which is of course not a power of two.
The first address in the series of addresses from each thread would not meet the alignment
requirements for coalescing. Using the cudaMallocPitch function the size of each row is padded by an
amount necessary for the alignment requirements of the given device (Figure 6.18). In our example, it
would in most cases be extended to 64 elements per row, or 256 bytes. The pitch the device actually
uses is returned in the pitch parameters passed to cudaMallocPitch.
Let’s have a look at how this works in practice. Nonaligned accesses result in multiple memory
fetches being issued. While waiting for a memory fetch, all threads in a warp are stalled until all
memory fetches are returned from the hardware. Thus, to achieve the best throughput you need to issue
a small number of large memory fetch requests, as a result of aligned and sequential coalesced accesses.

Used Data

FIGURE 6.18
Padding achieved with cudaMallocPitch.

Padding

170

CHAPTER 6 Memory Handling with CUDA

So what happens if you have data that is interleaved in some way, for example, a structure?
typedef struct
{
unsigned int a;
unsigned int b;
unsigned int c;
unsigned int d;
} MY_TYPE_T;
MY_TYPE_T some_array[1024]; /* 1024 * 4 bytes ¼ 4K */

Index 0
Element A

Index 0
Element B

Index 0
Element C

Index 0
Element D

Index 1
Element A

Index 1
Element B

Index 1
Element C

Index N
Element
A, B, C, D

Index 1
Element D

FIGURE 6.19
Array elements in memory.

Figure 6.19 shows how C will lay this structure out in memory.
Elements are laid out in memory in the sequence in which they are defined within the structure. The
access pattern for such a structure is shown in Figure 6.20. As you can see from the figure, the
addresses of the structure elements are not contiguous in memory. This means you get no coalescing
and the memory bandwidth suddenly drops off by an order of magnitude. Depending on the size of our
data elements, it may be possible to have each thread read a larger value and then internally within the
threads mask off the necessary bits. For example, if you have byte-based data you can do the following:

A

B

C

D

A

B

C

D

A

B

C

D

Word
0

Word
1

Word
2

Word
3

Word
4

Word
5

Word
6

Word
7

Word
8

Word
9

Word
10

Word
11

TID
0

TID
1

TID
2

FIGURE 6.20
Words accessed by thread (no coalescing).

Global Memory

const
const
const
const
const

unsigned
unsigned
unsigned
unsigned
unsigned

int value_u32
char value_01
char value_02
char value_03
char value_04

¼
¼
¼
¼
¼

171

some_data[tid];
(value_u32 & 0x000000FF) );
(value_u32 & 0x0000FF00) >> 8 );
(value_u32 & 0x00FF0000) >> 16 );
(value_u32 & 0xFF000000) >> 24 );

It’s also possible to maintain the one thread to one data element mapping by simply treating the array
of structure elements as an array of words. We can then allocate one thread to each element of the
structure. This type of solution is, however, not suitable if there is some data flow relationship between the
structure members, so thread 1 needs the x, y, and z coordinate of a structure, for example. In this case, it’s
best to reorder the data, perhaps in the loading or transfer phase on the CPU, into N discrete arrays. In this
way, the arrays individually sit concurrently in memory. We can simply access array a, b, c, or d instead of
the struct->a notation we’d use with a structure dereference. Instead of an interleaved and uncoalesced
pattern, we get four coalesced accesses from each thread into different memory regions, maintaining
optimal global memory bandwidth usage.
Let’s look at an example of global memory reads. In this example, we’ll sum the values of all the
elements in the structure using the two methods. First, we’ll add all the values from an array of
structures and then from a structure of arrays.
// Define the number of elements we’ll use
#define NUM_ELEMENTS 4096
// Define an interleaved type
// 16 bytes, 4 bytes per member
typedef struct
{
u32 a;
u32 b;
u32 c;
u32 d;
} INTERLEAVED_T;
// Define an array type based on the interleaved structure
typedef INTERLEAVED_T INTERLEAVED_ARRAY_T[NUM_ELEMENTS];
// Alternative - structure of arrays
typedef u32 ARRAY_MEMBER_T[NUM_ELEMENTS];
typedef struct
{
ARRAY_MEMBER_T a;
ARRAY_MEMBER_T b;
ARRAY_MEMBER_T c;
ARRAY_MEMBER_T d;
} NON_INTERLEAVED_T;

In this section of code, we declare two types; the first is INTERLEAVED_T, an array of structures of
which the members are a to d. We then declare NON_INTERLEAVED_T as a structure that contains four

172

CHAPTER 6 Memory Handling with CUDA

arrays, a to d. As the types are named, with the first one we expect the data to be interleaved in memory.
With the second one, we expect a number of contiguous memory areas.
Let’s look first at the CPU code.
__host__ float add_test_non_interleaved_cpu(
NON_INTERLEAVED_T * const host_dest_ptr,
const NON_INTERLEAVED_T * const host_src_ptr,
const u32 iter,
const u32 num_elements)
{
float start_time ¼ get_time();
for (u32 tid ¼ 0; tid < num_elements; tidþþ)
{
for (u32 i¼0; ia[tid] þ¼ host_src_ptr->a[tid];
host_dest_ptr->b[tid] þ¼ host_src_ptr->b[tid];
host_dest_ptr->c[tid] þ¼ host_src_ptr->c[tid];
host_dest_ptr->d[tid] þ¼ host_src_ptr->d[tid];
}
}
const float delta ¼ get_time() - start_time;
return delta;
}
__host__ float add_test_interleaved_cpu(
INTERLEAVED_T * const host_dest_ptr,
const INTERLEAVED_T * const host_src_ptr,
const u32 iter,
const u32 num_elements)
{
float start_time ¼ get_time();
for (u32 tid ¼ 0; tid < num_elements; tidþþ)
{
for (u32 i¼0; ia[tid] þ¼ src_ptr->a[tid];
dest_ptr->b[tid] þ¼ src_ptr->b[tid];
dest_ptr->c[tid] þ¼ src_ptr->c[tid];
dest_ptr->d[tid] þ¼ src_ptr->d[tid];

174

CHAPTER 6 Memory Handling with CUDA

}
}
}

The caller of the GPU function is a fairly standard copy to device and time routine. I’ll list here only
the interleaved version, as the two functions are largely identical.
__host__ float add_test_interleaved(
INTERLEAVED_T * const host_dest_ptr,
const INTERLEAVED_T * const host_src_ptr,
const u32 iter,
const u32 num_elements)
{
// Set launch params
const u32 num_threads ¼ 256;
const u32 num_blocks ¼ (num_elements þ (num_threads-1)) / num_threads;
// Allocate memory on the device
const size_t num_bytes ¼ (sizeof(INTERLEAVED_T) * num_elements);
INTERLEAVED_T * device_dest_ptr;
INTERLEAVED_T * device_src_ptr;
CUDA_CALL(cudaMalloc((void **) &device_src_ptr, num_bytes));
CUDA_CALL(cudaMalloc((void **) &device_dest_ptr, num_bytes));
// Create a stop and stop event for timing
cudaEvent_t kernel_start, kernel_stop;
cudaEventCreate(&kernel_start, 0);
cudaEventCreate(&kernel_stop, 0);
// Create a non zero stream
cudaStream_t test_stream;
CUDA_CALL(cudaStreamCreate(&test_stream));
// Copy src data to GPU
CUDA_CALL(cudaMemcpy(device_src_ptr, host_src_ptr, num_bytes,
cudaMemcpyHostToDevice));
// Push start event ahread of kernel call
CUDA_CALL(cudaEventRecord(kernel_start, 0));
// Call the GPU kernel
add_kernel_interleaved<<>>(device_dest_ptr, device_src_ptr,
iter, num_elements);
// Push stop event after of kernel call
CUDA_CALL(cudaEventRecord(kernel_stop, 0));

Global Memory

175

// Wait for stop event
CUDA_CALL(cudaEventSynchronize(kernel_stop));
// Get delta between start and stop,
// i.e. the kernel execution time
float delta ¼ 0.0F;
CUDA_CALL(cudaEventElapsedTime(&delta, kernel_start, kernel_stop));
// Clean up
CUDA_CALL(cudaFree(device_src_ptr));
CUDA_CALL(cudaFree(device_dest_ptr));
CUDA_CALL(cudaEventDestroy(kernel_start));
CUDA_CALL(cudaEventDestroy(kernel_stop));
CUDA_CALL(cudaStreamDestroy(test_stream));
return delta;
}

When we run this code, we achive the following results:
Running Interleaved /
ID:0 GeForce GTX 470:
ID:0 GeForce GTX 470:

Non Interleaved memory test using 65536 bytes (4096 elements)
Interleaved time: 181.83ms
Non Interleaved time: 45.13ms

ID:1 GeForce 9800 GT:
ID:1 GeForce 9800 GT:

Interleaved time: 2689.15ms
Non Interleaved time: 234.98ms

ID:2 GeForce GTX 260:
ID:2 GeForce GTX 260:

Interleaved time: 444.16ms
Non Interleaved time: 139.35ms

ID:3 GeForce GTX 460:
ID:3 GeForce GTX 460:

Interleaved time: 199.15ms
Non Interleaved time: 63.49ms

CPU (serial):
CPU (serial):

Interleaved time: 1216.00ms
Non Interleaved time: 13640.00ms

What we see is quite interesting, and largely to be expected. The interleaved memory access pattern
has an execution time three to four times longer than the noninterleaved pattern on compute 2.x
hardware. The compute 1.3 GTX260 demonstrates a 3 slow down when using the interleaved
memory pattern. The compute 1.1 9800GT, however, exhibits an 11 slow down, due to the more
stringent coalescing requirements for these older devices.
We can look a bit deeper into the memory access pattern between the slow interleaved pattern and
the much faster noninterleaved pattern with a tool such as Parallel Nsight. We can see that the number
of memory transactions (CUDA Memory Statistics experiment) used in the noninterleaved version is
approximately one-quarter that of the interleaved version, resulting in the noninterleaved version
shifting one-quarter of the data to/from memory than the interleaved version does.
One other interesting thing to note is the CPU shows exactly the opposite effect. This may seem
strange, until you think about the access pattern and the cache reuse. A CPU accessing element a in the

176

CHAPTER 6 Memory Handling with CUDA

interleaved example will have brought structure elements b, c, and d into the cache on the access to
a since they will likely be in the same cache line. However, the noninterleaved version will be accessing
memory in four seperate and physically dispersed areas. There would be four times the number of
memory bus transactions and any read-ahead policy the CPU might be using would not be as effective.
Thus, if your existing CPU application uses an interleaved arrangement of structure elements,
simply copying it to a GPU will work, but at a considerable cost due to poor memory coalescing.
Simply reordering the declarations and access mechanism, as we’ve done in this example, could allow
you to achieve a significant speedup for very little effort.

Score boarding
One other interesting property of global memory is that it works with a scoreboard. If we initiate a load
from global memory (e.g., a¼some_array[0]), then all that happens is that the memory fetch is
initiated and local variable a is listed as having a pending memory transaction. Unlike traditional CPU
code, we do not see a stall or even a context switch to another warp until such time as the variable a is
later used in an expression. Only at this time do we actually need the contents of variable a. Thus, the
GPU follows a lazy evaluation model.
You can think of this a bit like ordering a taxi and then getting ready to leave. It may take only five
minutes to get ready, but the taxi may take up to 15 minutes to arrive. By ordering it before we actually
need it, it starts its journey while we are busy on the task of getting ready to leave. If we wait until we are
ready before ordering the taxi, we serialize the task of getting ready to leave with waiting for the taxi.
The same is true of the memory transactions. By comparison, they are like the slow taxi, taking
forever in terms of GPU cycles to arrive. Until such time as we actually need the memory transaction to
have arrived, the GPU can be busy calculating other aspects of the algorithm. This is achieved very
simply by placing the memory fetches at the start of the kernel, and then using them much later during
the kernel. We, in effect, overlap the memory fetch latency with useful GPU computations, reducing
the effect of memory latency on our kernel.

Global memory sorting
Picking up from where we left off with shared memory sorting, how do you think the same algorithm
would work for global memory–based sorting? What needs to be considered? First and foremost, you
need to think about memory coalescing. Our sorting algorithm was specifically developed to run with
the 32 banks of shared memory and accesses the shared memory in columns. If you look again at
Figure 6.8, you’ll see this also achieves coalesced access to global memory if all threads were to read at
once.
The coalesced access occurs during the radix sort, as each thread marches through its own list.
Every thread’s access is coalesced (combined) together by the hardware. Writes are noncoalesced as
the 1 list can vary in size. However, the zeros are both read and written to the same address range, thus
providing coalesced access.
In the merge phase, during the startup condition one value from each list is read from global into
shared memory. In every iteration of the merge, a single value is written out to global memory, and
a single value is read into shared memory to replace the value written out. There is a reasonable amount
of work being done for every memory access. Thus, despite the poor coalescing, the memory latency
should be largely hidden. Let’s look at how this works in practice.

Global Memory

177

Table 6.13 Single SM GMEM Sort (1K Elements)
Threads

GTX470

GTX260

GTX460

1
2
4
8
16
32
64
128
256

33.27
19.21
11.82
9.31
7.41
6.63
6.52
7.06
8.61

66.32
37.53
22.29
16.24
12.52
10.95
10.72
11.63
14.88

27.47
15.87
9.83
7.88
6.36
5.75
5.71
6.29
7.82

What you can see from Table 6.13 and Figure 6.21 is that 32 threads work quite well, but this is
marginally beaten by 64 threads on all the tested devices. It’s likely that having another warp to execute is
hiding a small amount of the latency and will also improve slightly the memory bandwidth utilization.
Moving beyond 64 threads slows things down, so if we now fix the number of threads at 64 and
increase the dataset size what do we see? See Table 6.14 and Figure 6.22 for the results. In fact we see
an almost perfect linear relationship when using a single SM, as we are currently doing.
As Table 6.14 shows, 1024 KB (1 MB) of data takes 1486 ms to sort on the GTX460. This means
we can sort 1 MB of data in around 1.5 seconds (1521 ms exactly) and around 40 MB per minute,
regardless of the size of the data.
A 1 GB dataset would therefore take around 25–26 minutes to sort, which is not very impressive.
So what is the issue? Well currently we’re using just a single block, which in turn limits us to a single
SM. The test GPUs consists of 14 SMs on the GTX470, 27 SMs on the GTX260, and 7 SMs on the
70
60
50
40
30
20
10
0
1

2

4

8

GTX470

FIGURE 6.21
Graph of single SM GMEM sort (1K elements).

16
260GTX

32

64
GTX460

128

256

178

CHAPTER 6 Memory Handling with CUDA

Table 6.14 GMEM Sort by Size
Absolute Time (ms)

Time per KB (ms)

Size (Kb)

GTX470

GTX260

GTX460

GTX470

GTX260

GTX460

1
2
4
8
16
32
64
128
256
512
1024

1.67
3.28
6.51
12.99
25.92
51.81
103.6
207.24
414.74
838.25
1692.07

2.69
5.36
10.73
21.43
42.89
85.82
171.78
343.74
688.04
1377.23
2756.87

1.47
2.89
5.73
11.4
22.75
45.47
90.94
181.89
364.09
737.85
1485.94

1.67
1.64
1.63
1.62
1.62
1.62
1.62
1.62
1.62
1.64
1.65

2.69
2.68
2.68
2.68
2.68
2.68
2.68
2.69
2.69
2.69
2.69

1.47
1.45
1.43
1.43
1.42
1.42
1.42
1.42
1.42
1.44
1.45

GTX460. Clearly, we’re using a small fraction of the real potential of the card. This has been done
largely to simplify the solution, so let’s look now at using multiple blocks.
The output of one SM is a single linear sorted list. The output of two SMs is therefore two linear
sorted lists, which is not what we want. Consider the following dump of output from a two-block
version of the sort. The original values were in reverse sorting order from 0x01 to 0x100. The first value
shown is the array index, followed by the value at that array index.

3
2.5
2
GTX470
1.5

260GTX
GTX460

1
0.5
0
1

2

FIGURE 6.22
GMEM graph sorted by size.

4

8

16

32

64

128 256 512 1024

Global Memory

000:00000041
006:00000047
008:00000049
014:0000004f
016:00000051
022:00000057
024:00000059
030:0000005f
032:00000061
038:00000067
040:00000069
046:0000006f
048:00000071
054:00000077
056:00000079
062:0000007f

001:00000042
007:00000048
009:0000004a
015:00000050
017:00000052
023:00000058
025:0000005a
031:00000060
033:00000062
039:00000068
041:0000006a
047:00000070
049:00000072
055:00000078
057:0000007a
063:00000080

002:00000043 003:00000044 004:00000045 005:00000046

064:00000001
070:00000007
072:00000009
078:0000000f
080:00000011
086:00000017
088:00000019
094:0000001f
096:00000021
102:00000027
104:00000029
110:0000002f
112:00000031
118:00000037
120:00000039
126:0000003f

065:00000002
071:00000008
073:0000000a
079:00000010
081:00000012
087:00000018
089:0000001a
095:00000020
097:00000022
103:00000028
105:0000002a
111:00000030
113:00000032
119:00000038
121:0000003a
127:00000040

066:00000003 067:00000004 068:00000005 069:00000006

179

010:0000004b 011:0000004c 012:0000004d 013:0000004e
018:00000053 019:00000054 020:00000055 021:00000056
026:0000005b 027:0000005c 028:0000005d 029:0000005e
034:00000063 035:00000064 036:00000065 037:00000066
042:0000006b 043:0000006c 044:0000006d 045:0000006e
050:00000073 051:00000074 052:00000075 053:00000076
058:0000007b 059:0000007c 060:0000007d 061:0000007e

074:0000000b 075:0000000c 076:0000000d 077:0000000e
082:00000013 083:00000014 084:00000015 085:00000016
090:0000001b 091:0000001c 092:0000001d 093:0000001e
098:00000023 099:00000024 100:00000025 101:00000026
106:0000002b 107:0000002c 108:0000002d 109:0000002e
114:00000033 115:00000034 116:00000035 117:00000036
122:0000003b 123:0000003c 124:0000003d 125:0000003e

We can see there are two sorted lists here, one from 0x41 to 0x80 and the other from 0x01 to 0x40.
You might say that’s not a great problem, and we just need to merge the list again. This is where we hit
the second issue; think about the memory access on a per-thread basis.
Assume we use just two threads, one per list. Thread 0 accesses element 0. Thread 1 accesses
element 64. It’s not possible for the hardware to coalesce the two accesses, so the hardware has to issue
two independent memory fetches.
Even if we were to do the merge in zero time, assuming we have a maximum of 16 SMs and using all
of them did not flood the bandwidth of the device, in the best case we’d get 16  40 MB/min ¼ 640
MB/min or around 10.5 MB/s. Perhaps an alternative approach is required.

Sample sort
Sample sort tries to get around the problem of merge sort, that is, that you have to perform a merge
step. It works on the principle of splitting the data into N independent blocks of data such that each

180

CHAPTER 6 Memory Handling with CUDA

block is partially sorted and we can guarantee the numbers in block N are less than those in block N þ 1
and larger than those in block N  1.
We’ll look first at an example using three processors sorting 24 data items. The first phase selects S
equidistant samples from the dataset. S is chosen as a fraction of N, the total number of elements in the
entire dataset. It is important that S is representative of the dataset. Equidistant points are best used
where the data is reasonably uniformly distributed over the data range. If the data contains large peaks
that are not very wide in terms of sample points, a higher number of samples may have to be used, or
one where the samples concentrate around the known peaks. We’ll chose equidistant points and
assume the more common uniform distribution of points.
The samples are then sorted such that the lowest value is first in the list, assuming an ascending
order sort. The sample data is then split into bins according to how many processors are available. The
data is scanned to determine how many samples fit in each bin. The number of samples in each bin is
then added to form a prefix sum that is used to index into an array.
A prefix sum is simply the sum of all elements prior to the current element. Looking at the example,
we can see nine elements were allocated to bin 0. Therefore, the start of the second dataset is element
9. The next list size, as it happens from the dataset, was also nine. Nine plus the previous sum is 18, and
thus we know the index of the next dataset and so on.
The data is then shuffled, so all the bin 0 elements are written to the first index of the prefix sum
(zero), bin 1 written to the next, and so on. This achieves a partial sort of the data such that all the
samples in bin N  1 are less than those in bin N, which in turn are less than those in bin N þ 1. The
bins are then dispatched to P processors that sort the lists in parallel. If an in-place sort is used, then the
list is sorted once the last block of data is sorted, without any merge step. Figure 6.24 is this same
example using six processing elements.
Notice that when we used three processors based on six samples, the bin sizes were 9, 9, 6. With six
processors the bin sizes are 6, 3, 5, 4, 1, 5. What we’re actually interested in is the largest value, as on
P processors the largest block will determine the total time taken. In this example, the maximum is
0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

10

3

9

21

7

12

7

1

4

2

23

20

19

4

57

8

9

92

35

13

1

17

66

47

9

7

23

57

35

17

Select Samples

7

9

17

23

35

57

Sort Samples

20

21

22

23

92

35

66

47

Sort on P0 to P2

66

92

Append Lists

9 entries
0..8

9 entries
9..22

0

0

1

2

3

7

7

1

2

9

18

Count Bins
Starting Index Points

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

1

4

2

4

8

1

10

9

21

12

20

19

13

9

17

23

57

Sort on P0
1

6 entries
>= 23

3

4

Sort on P1
4

7

7

8

FIGURE 6.23
Sample sort using three processors.

9

9

10

12

13

Sort on P2
17

19

20

21

23

35

47

57

Global Memory

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

10

3

9

21

7

12

7

1

4

2

23

20

19

4

57

8

9

92

35

13

1

17

66

47

9

7

23

57

35

17

Select Samples

7

9

17

23

35

57

Sort Samples

21

22

23

66

92

0

1

2

3

4

6 entries
0..6

3 entries
7..8

5 entries
9..16

0

6

9

5

6

1

1

2

3

7

8

9

14

10

Sort on P1

Sort on P0

4

4

7

7

8

4 entries
17..22

11

1 entry
23..34

18

12

13

9

10

12

Count Bins
Starting Index Points

22

14

15

16

17

13

17

19

20

18

19

Sort
on
P4

Sort on P3

Sort on P2

9

5 entries
>= 57

181

21

23

20

Sort on P0 to P5

Sort on P5

35

47

57

Append Lists

FIGURE 6.24
Sample sort using six processors.

reduced from nine elements to six elements, so a doubling of the number of processors has reduced the
maximum number of data points by only one-third.
The actual distribution will depend very much on the dataset. The most common dataset is actually
a mostly sorted list or one that is sorted with some new data items that must be added. This tends to
give a fairly equal distribution for most datasets. For problem datasets it’s possible to adjust the
sampling policy accordingly.
With a GPU we don’t just have six processors; we have N SMs, each of which we need to run
a number of blocks on. Each block would ideally be around 256 threads based simply on ideal memory
latency hiding, although we saw that 64 threads worked best with the radix sort we developed earlier in
the chapter. With the GTX470 device, we have 14 SMs with a maximum of eight blocks per SM.
Therefore, we need at least 112 blocks just to keep every SM busy. We’ll find out in practice which is
the best in due course. It is likely we will need substantially more blocks to load balance the work.
The first task, however, is to develop a CPU version of the sample sort algorithm and to understand
it. We’ll look at each operation in turn and how it could be converted to a parallel solution.
To follow the development of the code in the subsequent sections, it’s important you understand the
sample sort algorithm we just covered. It’s one of the more complex sorting algorithms and was chosen
both for performance reasons and also because it allows us to look at a real problem involving difficult
issues in terms of GPU implementation. If you browsed over the algorithm, please re-read the last few
pages until you are sure you understand how the algorithm works before proceeding.

Selecting samples
The first part of the sample sort is to select N samples from the source data. The CPU version works
with a standard loop where the source data loop index is incremented by sample_interval elements.
The sample index counter, however, is incremented only by one per iteration.
__host__ TIMER_T select_samples_cpu(
u32 * const sample_data,

182

CHAPTER 6 Memory Handling with CUDA

const u32 sample_interval,
const u32 num_elements,
const u32 * const src_data)
{
const TIMER_T start_time ¼ get_time();
u32 sample_idx ¼ 0;
for (u32 src_idx¼0; src_idx>>(sample_data,
sample_interval, src_data);
cuda_error_check(prefix, "Error invoking select_samples_gpu_kernel");

Global Memory

183

const TIMER_T func_time ¼ stop_device_timer();
return func_time;
}

Finally, to work out the index into the source data we simply multiply our sample data index (tid)
by the size of the sample interval. For the sake of simplicity we’ll only look at the case where the
dataset sizes are multiples of one another.
Notice both the CPU and GPU versions return the time taken for the operation, something we’ll do
in each section of the sort to know the various timings of each operation.

Sorting the samples
Next we need to sort the samples we’ve selected. On the CPU we can simply call the qsort (quicksort)
routine from the standard C library.
__host__ TIMER_T sort_samples_cpu(
u32 * const sample_data,
const u32 num_samples)
{
const TIMER_T start_time ¼ get_time();
qsort(sample_data, num_samples, sizeof(u32),
&compare_func);
const TIMER_T end_time ¼ get_time();
return end_time - start_time;
}

On the GPU, however, these standard libraries are not available, so we’ll use the radix sort we
developed earlier. Note, radix sort is also provided by the Thrust library, so you don’t have to write it as
we’ve done here. I won’t replicate the code here since we’ve already looked at it in detail in the shared
memory section.
One thing to note, however, is the version we developed before does a radix sort on a single SM in
shared memory and then uses a shared memory reduction for the merge operation. This is not an
optimal solution, but we’ll use it for at least the initial tests.

Counting the sample bins
Next we need to know how many values exist in each sample bin. The CPU code for this is as follows:
__host__ TIMER_T count_bins_cpu(const u32 num_samples,
const u32 num_elements,
const u32 * const src_data,
const u32 * const sample_data,
u32 * const bin_count)
{
const TIMER_T start_time ¼ get_time();
for (u32 src_idx¼0; src_idx

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : No
Warning                         : Duplicate 'CrossMarkDomains' entry in dictionary (ignored)
Crossmark Domain Exclusive      : true
Crossmark Major Version Date    : 2010-04-23
Elsevier Book PDF Specifications: 1.28
Subject                         : CUDA Programming: A Developers Guide to Parallel Computing with GPUs, (2013) 5pp. 978-0-12-415933-4
Author                          : Cook, Shane
Robots                          : noindex
Cross Mark Domains              : 1
Modify Date                     : 2012:10:22 12:23:00+05:30
Create Date                     : 2012:10:22 11:40:55+05:30
EBX PUBLISHER                   : Elsevier Science
Page Layout                     : SinglePage
Page Mode                       : UseOutlines
Page Count                      : 591
XMP Toolkit                     : Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39
Producer                        : Adobe PDF Library 9.0
Trapped                         : False
Creator Tool                    : Elsevier
Metadata Date                   : 2012:10:22 12:23+05:30
Marked                          : True
Aggregation Type                : book
Copyright                       : © 2013 Elsevier Inc. All rights reserved.
Cover Display Date              : 2013
ISBN                            : 978-0-12-415933-4
Format                          : application/pdf
Publisher                       : Elsevier Inc.
Description                     : CUDA Programming: A Developers Guide to Parallel Computing with GPUs, (2013) 5pp. 978-0-12-415933-4
Creator                         : Shane Cook
Title                           : CUDA Programming: A Developers Guide to Parallel Computing with GPUs
Major Version Date              : 2010-04-23
Document ID                     : uuid:5312f5e2-2ca8-4642-b14a-c5976f7cbdb7
Instance ID                     : uuid:31587088-37c5-43b4-a59c-62218847f295
EXIF Metadata provided by EXIF.tools

Navigation menu