Vivado HLS Optimization Methodology Guide (UG1270) Ug1270 Opt

User Manual:

Open the PDF directly: View PDF .
Page Count: 136 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Vivado HLS Optimization Methodology Guide

Vivado HLS Opmizaon

Methodology Guide

UG1270 (v2017.4) December 20, 2017

Revision History

The following table shows the revision history for this document.

Date Version Revision

12/20/2017

2017.4 Initial Xilinx release.

Vivado HLS Optimization Methodology Guide 3

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Revision History

Vivado HLS Optimization Methodology Guide 4

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Table of Contents

Revision History...............................................................................................................3

Chapter 1: Introduction.............................................................................................. 9

HLS Pragmas................................................................................................................................9

OpenCL Attributes.....................................................................................................................11

Directives....................................................................................................................................12

Chapter 2: Optimizing the Hardware Function........................................... 15

Hardware Function Optimization Methodology....................................................................16

Baseline The Hardware Functions...........................................................................................18

Optimization for Metrics.......................................................................................................... 19

Pipeline for Performance......................................................................................................... 20

Chapter 3: Optimize Structures for Performance...................................... 25

Reducing Latency...................................................................................................................... 28

Reducing Area............................................................................................................................29

Design Optimization Workflow................................................................................................31

Chapter 4: Data Access Patterns..........................................................................33

Algorithm with Poor Data Access Patterns............................................................................ 33

Algorithm With Optimal Data Access Patterns......................................................................42

Chapter 5: Standard Horizontal Convolution............................................... 45

Optimal Horizontal Convolution..............................................................................................48

Optimal Vertical Convolution...................................................................................................50

Optimal Border Pixel Convolution.......................................................................................... 52

Optimal Data Access Patterns................................................................................................. 54

Appendix A: OpenCL Attributes............................................................................55

always_inline.............................................................................................................................. 56

opencl_unroll_hint..................................................................................................................... 57

reqd_work_group_size.............................................................................................................. 58

vec_type_hint..............................................................................................................................60

Vivado HLS Optimization Methodology Guide 5

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

work_group_size_hint................................................................................................................61

xcl_array_partition..................................................................................................................... 63

xcl_array_reshape...................................................................................................................... 65

xcl_data_pack............................................................................................................................. 68

xcl_dataflow................................................................................................................................69

xcl_dependence......................................................................................................................... 71

xcl_max_work_group_size.........................................................................................................73

xcl_pipeline_loop........................................................................................................................75

xcl_pipeline_workitems.............................................................................................................76

xcl_reqd_pipe_depth..................................................................................................................77

xcl_zero_global_work_offset.....................................................................................................79

Appendix B: HLS Pragmas........................................................................................81

pragma HLS allocation..............................................................................................................82

pragma HLS array_map............................................................................................................84

pragma HLS array_partition.....................................................................................................87

pragma HLS array_reshape......................................................................................................89

pragma HLS clock......................................................................................................................91

pragma HLS data_pack............................................................................................................. 93

pragma HLS dataflow............................................................................................................... 95

pragma HLS dependence.........................................................................................................98

pragma HLS expression_balance.......................................................................................... 100

pragma HLS function_instantiate..........................................................................................101

pragma HLS inline...................................................................................................................103

pragma HLS interface.............................................................................................................106

pragma HLS latency................................................................................................................111

pragma HLS loop_flatten........................................................................................................113

pragma HLS loop_merge........................................................................................................115

pragma HLS loop_tripcount................................................................................................... 116

pragma HLS occurrence.........................................................................................................118

pragma HLS pipeline.............................................................................................................. 120

pragma HLS protocol..............................................................................................................122

pragma HLS reset....................................................................................................................123

pragma HLS resource............................................................................................................. 124

pragma HLS stream................................................................................................................ 126

pragma HLS top.......................................................................................................................128

pragma HLS unroll.................................................................................................................. 129

Appendix C: Additional Resources and Legal Notices........................... 133

Vivado HLS Optimization Methodology Guide 6

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

References................................................................................................................................133

Please Read: Important Legal Notices................................................................................. 134

Vivado HLS Optimization Methodology Guide 7

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Vivado HLS Optimization Methodology Guide 8

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 1

Introduction

This guide provides details on how to perform opmizaons using Vivado HLS. The opmizaon

process consists of direcves which specify which opmizaons are performed and a

methodology which shows how opmizaons may be applied in a determinisc and ecient

manner.

HLS Pragmas

Optimizations in Vivado HLS

In both SDAccel and SDSoC projects, the hardware kernel must be synthesized from the OpenCL,

C, or C++ language, into RTL that can be implemented into the programmable logic of a Xilinx

device. Vivado HLS synthesizes the RTL from the OpenCL, C, and C++ language descripons.

Vivado HLS is intended to work with your SDAccel or SDSoC Development Environment project

without interacon. However, Vivado HLS also provides pragmas that can be used to opmize

the design: reduce latency, improve throughput performance, and reduce area and device

resource ulizaon of the resulng RTL code. These pragmas can be added directly to the source

code for the kernel.

IMPORTANT!:

Although the SDSoC environment supports the use of HLS pragmas, it does not support pragmas

applied to any argument of the funcon interface (interface, array paron, or data_pack pragmas).

Refer to "Opmizing the Hardware Funcon" in the SDSoC Environment Opmizaon Guide (UG1235)

for more informaon.

The Vivado HLS pragmas include the opmizaon types specied below:

Vivado HLS Optimization Methodology Guide 9

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Table 1: Vivado HLS Pragmas by Type

Type Attributes

Kernel Optimization •pragma HLS allocation

•pragma HLS clock

•pragma HLS expression_balance

•pragma HLS latency

•pragma HLS reset

•pragma HLS resource

•pragma HLS top

Function Inlining •pragma HLS inline

•pragma HLS function_instantiate

Interface Synthesis •pragma HLS interface

•pragma HLS protocol

Task-level Pipeline •pragma HLS dataflow

•pragma HLS stream

Pipeline •pragma HLS pipeline

•pragma HLS occurrence

Loop Unrolling •pragma HLS unroll

•pragma HLS dependence

Loop Optimization •pragma HLS loop_flatten

•pragma HLS loop_merge

•pragma HLS loop_tripcount

Array Optimization •pragma HLS array_map

•pragma HLS array_partition

•pragma HLS array_reshape

Structure Packing •pragma HLS data_pack

Chapter 1: Introduction

Vivado HLS Optimization Methodology Guide 10

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

OpenCL Attributes

Optimizations in OpenCL

This secon describes OpenCL aributes that can be added to source code to assist system

opmizaon by the SDAccel compiler, xocc, the SDSoC system compilers, sdscc and sds++,

and Vivado HLS synthesis.

SDx provides OpenCL aributes to opmize your code for data movement and kernel

performance. The goal of data movement opmizaon is to maximize the system level data

throughput by maximizing interface bandwidth ulizaon and DDR bandwidth ulizaon. The

goal of kernel computaon opmizaon is to create processing logic that can consume all the

data as soon as they arrive at kernel interfaces. This is generally achieved by expanding the

processing code to match the data path with techniques such as funcon inlining and pipelining,

loop unrolling, array paroning, dataowing, etc.

The OpenCL aributes include the types specied below:

Table 2: OpenCL __attributes__ by Type

Type Attributes

Kernel Size •reqd_work_group_size

•vec_type_hint

•work_group_size_hint

•xcl_max_work_group_size

•xcl_zero_global_work_offset

Function Inlining •always_inline

Task-level Pipeline •xcl_dataflow

•xcl_reqd_pipe_depth

Pipeline •xcl_pipeline_loop

•xcl_pipeline_workitems

Loop Unrolling •opencl_unroll_hint

Array Optimization •xcl_array_partition

•xcl_array_reshape

Note: Array variables only accept a single array

opmizaon aribute.

Chapter 1: Introduction

Vivado HLS Optimization Methodology Guide 11

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

TIP: The SDAccel and SDSoC compilers also support many of the standard aributes supported by

gcc

, such as

always_inline

noinline

unroll

, and

nounroll

Directives

To view details on the aributes in the following table see the Command Reference secon in

UG902.

Note: Refer to Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more details.

Table 3: Vivado HLS Pragmas by Type

Type Attributes

Kernel Optimization •set_directive_allocation

•set_directive_clock

•set_directive_expression_balance

•set_directive_latency

•set_directive_reset

•set_directive_resource

•set_directive_top

Function Inlining •set_directive_inline

•set_directive_function_instantiate

Interface Synthesis •set_directive_interface

•set_directive_protocol

Task-level Pipeline •set_directive_dataflow

•set_directive_stream

Pipeline •set_directive_pipeline

•set_directive_occurrence

Loop Unrolling •set_directive_unroll

•set_directive_dependence

Loop Optimization •set_directive_loop_flatten

•set_directive_loop_merge

•set_directive_loop_tripcount

Chapter 1: Introduction

Vivado HLS Optimization Methodology Guide 12

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Table 3: Vivado HLS Pragmas by Type (cont'd)

Type Attributes

Array Optimization •set_directive_array_map

•set_directive_array_partition

•set_directive_array_reshape

Structure Packing •set_directive_data_pack

Chapter 1: Introduction

Vivado HLS Optimization Methodology Guide 13

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 1: Introduction

Vivado HLS Optimization Methodology Guide 14

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 2

Optimizing the Hardware Function

The SDSoC environment employs heterogeneous cross-compilaon, with ARM CPU-specic

cross compilers for the Zynq-7000 SoC and Zynq UltraScale+ MPSoC CPUs, and Vivado HLS as a

PL cross-compiler for hardware funcons. This secon explains the default behavior and

opmizaon direcves associated with the Vivado HLS cross-compiler.

The default behavior of Vivado HLS is to execute funcons and loops in a sequenal manner

such that the hardware is an accurate reecon of the C/C++ code. Opmizaon direcves can

be used to enhance the performance of the hardware funcon, allowing pipelining which

substanally increases the performance of the funcons. This chapter outlines a general

methodology for opmizing your design for high performance.

There are many possible goals when trying to opmize a design using Vivado HLS. The

methodology assumes you want to create a design with the highest possible performance,

processing one sample of new input data every clock cycle, and so addresses those opmizaons

before the ones used for reducing latency or resources.

Detailed explanaons of the opmizaons discussed here are provided in Vivado Design Suite

User Guide: High-Level Synthesis (UG902).

It is highly recommended to review the methodology and obtain a global perspecve of hardware

funcon opmizaon before reviewing the details of specic opmizaon.

Vivado HLS Optimization Methodology Guide 15

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Hardware Function Optimization

Methodology

Hardware funcons are synthesized into hardware in the Programmable Logic (PL) by the Vivado

HLS compiler. This compiler automacally translates C/C++ code into an FPGA hardware

implementaon, and as with all compilers, does so using compiler defaults. In addion to the

compiler defaults, Vivado HLS provides a number of opmizaons that are applied to the C/C++

code through the use of pragmas in the code. This chapter explains the opmizaons that can be

applied and a recommended methodology for applying them.

The are two ows for opmizing the hardware funcons.

• Top-down ow: In this ow, program decomposion into hardware funcons proceeds top-

down within the SDSoC environment, leng the system compiler create pipelines of

funcons that automacally operate in dataow mode. The microarchitecture for each

hardware funcon is opmized using Vivado HLS.

•Boom-up ow: In this ow, the hardware funcons are opmized in isolaon from the

system using the Vivado HLS compiler provided in the Vivado Design suite. The hardware

funcons are analyzed, opmizaons direcves can be applied to create an implementaon

other than the default, and the resulng opmized hardware funcons are then incorporated

into the SDSoC environment.

The boom-up ow is oen used in organizaons where the soware and hardware are

opmized by dierent teams and can be used by soware programmers who wish to take

advantage of exisng hardware implementaons from within their organizaon or from partners.

Both ows are supported, and the same opmizaon methodology is used in either case. Both

workows result in the same high-performance system. Xilinx sees the choice as a workow

decision made by individual teams and organizaons and provides no recommendaon on which

ow to use. Examples of both ows are provided in this link in the SDSoC Environment

Opmizaon Guide (UG1235).

The opmizaon methodology for hardware funcons is shown in the gure below.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 16

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Simulate Design - Validate The C function

Synthesize Design - Baseline design

1: Initial Optimizations - Define interfaces (and data packing)

- Define loop trip counts

2: Pipeline for Performance - Pipeline and dataflow

3: Optimize Structures for Performance - Partition memories and ports

- Remove false dependencies

4: Reduce Latency - Optionally specify latency requirements

5: Improve Area - Optionally recover resources through sharing

X15638-110617

The gure above details all the steps in the methodology and the subsequent secons in this

chapter explain the opmizaons in detail.

IMPORTANT!: Designs will reach the opmum performance aer step 3.

Step 4 is used to minimize, or specically control, the latency through the design and is only

required for applicaons where this is of concern. Step 5 explains how to reduce the resources

required for hardware implementaon and is typically only applied when larger hardware

funcons fail to implement in the available resources. The FPGA has a xed number of resources,

and there is typically no benet in creang a smaller implementaon if the performance goals

have been met.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 17

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Baseline The Hardware Functions

Before seeking to perform any hardware funcon opmizaon, it is important to understand the

performance achieved with the exisng code and compiler defaults, and appreciate how

performance is measured. This is achieved by selecng the funcons to implement hardware and

building the project.

Aer the project has been built, a report is available in the reports secon of the IDE (and

provided at <project name>/<build_config>/_sds/vhls/<hw_function>/

solution/syn/report/<hw_function>.rpt). This report details the performance

esmates and ulizaon esmates.

The key factors in the performance esmates are the ming, interval, and latency in that order.

• The ming summary shows the target and esmated clock frequency. If the esmated clock

frequency is greater than the target, the hardware will not funcon at this clock frequency. The

clock frequency should be reduced by using the Data Moon Network Clock Frequency

opon in the Project Sengs. Alternavely, because this is only an esmate at this point in

the ow, it might be possible to proceed through the remainder of the ow if the esmate

only exceeds the target by 20%. Further opmizaons are applied when the bitstream is

generated, and it might sll be possible to sasfy the ming requirements. However, this is an

indicaon that the hardware funcon is not guaranteed to meet ming.

• The iniaon interval (II) is the number of clock cycles before the funcon can accept new

inputs and is generally the most crical performance metric in any system. In an ideal hardware

funcon, the hardware processes data at the rate of one sample per clock cycle. If the largest

data set passed into the hardware is size N (e.g., my_array[N]), the most opmal II is N + 1.

This means the hardware funcon processes N data samples in N clock cycles and can accept

new data one clock cycle aer all N samples are processed. It is possible to create a hardware

funcon with an II < N, however, this requires greater resources in the PL with typically lile

benet. The hardware funcon will oen be ideal as it consumes and produces data at a rate

faster than the rest of the system.

• The loop iniaon interval is the number of clock cycles before the next iteraon of a loop

starts to process data. This metric becomes important as you delve deeper into the analysis to

locate and remove performance bolenecks.

• The latency is the number of clock cycles required for the funcon to compute all output

values. This is simply the lag from when data is applied unl when it is ready. For most

applicaons this is of lile concern, especially when the latency of the hardware funcon

vastly exceeds that of the soware or system funcons such as DMA. It is, however, a

performance metric that you should review and conrm is not an issue for your applicaon.

• The loop iteraon latency is the number of clock cycles it takes to complete one iteraon of a

loop, and the loop latency is the number of cycles to execute all iteraons of the loop.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 18

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

The Area Esmates secon of the report details how many resources are required in the PL to

implement the hardware funcon and how many are available on the device. The key metric here

is the Ulizaon (%). The Ulizaon (%) should not exceed 100% for any of the resources. A gure

greater than 100% means there are not enough resources to implement the hardware funcon,

and a larger FPGA device might be required. As with the ming, at this point in the ow, this is an

esmate. If the numbers are only slightly over 100%, it might be possible for the hardware to be

opmized during bitstream creaon.

You should already have an understanding of the required performance of your system and what

metrics are required from the hardware funcons. However, even if you are unfamiliar with

hardware concepts such as clock cycles, you are now aware that the highest performing

hardware funcons have an II = N + 1, where N is the largest data set processed by the funcon.

With an understanding of the current design performance and a set of baseline performance

metrics, you can now proceed to apply opmizaon direcves to the hardware funcons.

Optimization for Metrics

The following table shows the rst direcve you should think about adding to your design.

Table 4: Optimization Strategy Step 1: Optimization For Metrics

Directives and Configurations Description

LOOP_TRIPCOUNT Used for loops that have variable bounds. Provides an

estimate for the loop iteration count. This has no impact on

synthesis, only on reporting.

A common issue when hardware funcons are rst compiled is report les showing the latency

and interval as a queson mark “?” rather than as numerical values. If the design has loops with

variable loop bounds, the compiler cannot determine the latency or II and uses the “?” to indicate

this condion. Variable loop bounds are where the loop iteraon limit cannot be resolved at

compile me, as when the loop iteraon limit is an input argument to the hardware funcon,

such as variable height, width, or depth parameters.

To resolve this condion, use the hardware funcon report to locate the lowest level loop which

fails to report a numerical value and use the LOOP_TRIPCOUNT direcve to apply an esmated

tripcount. The tripcount is the minimum, average, and/or maximum number of expected

iteraons. This allows values for latency and interval to be reported and allows implementaons

with dierent opmizaons to be compared.

Because the LOOP_TRIPCOUNT value is only used for reporng, and has no impact on the

resulng hardware implementaon, any value can be used. However, an accurate expected value

results in more useful reports.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 19

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Pipeline for Performance

The next stage in creang a high-performance design is to pipeline the funcons, loops, and

operaons. Pipelining results in the greatest level of concurrency and the highest level of

performance. The following table shows the direcves you can use for pipelining.

Table 5: Optimization Strategy Step 1: Optimization Strategy Step 2: Pipeline for

Performance

Directives and Configurations Description

PIPELINE Reduces the initiation interval by allowing the concurrent

execution of operations within a loop or function.

DATAFLOW Enables task level pipelining, allowing functions and loops

to execute concurrently. Used to minimize interval.

RESOURCE Specifies pipelining on the hardware resource used to

implement a variable (array, arithmetic operation).

Config Compile Allows loops to be automatically pipelined based on their

iteration count when using the bottom-up flow.

At this stage of the opmizaon process, you want to create as much concurrent operaon as

possible. You can apply the PIPELINE direcve to funcons and loops. You can use the

DATAFLOW direcve at the level that contains the funcons and loops to make them work in

parallel. Although rarely required, the RESOURCE direcve can be used to squeeze out the

highest levels of performance.

A recommended strategy is to work from the boom up and be aware of the following:

• Some funcons and loops contain sub-funcons. If the sub-funcon is not pipelined, the

funcon above it might show limited improvement when it is pipelined. The non-pipelined

sub-funcon will be the liming factor.

• Some funcons and loops contain sub-loops. When you use the PIPELINE direcve, the

direcve automacally unrolls all loops in the hierarchy below. This can create a great deal of

logic. It might make more sense to pipeline the loops in the hierarchy below.

• For cases where it does make sense to pipeline the upper hierarchy and unroll any loops lower

in the hierarchy, loops with variable bounds cannot be unrolled, and any loops and funcons

in the hierarchy above these loops cannot be pipelined. To address this issue, pipeline these

loops wih variable bounds, and use the DATAFLOW opmizaon to ensure the pipelined

loops operate concurrently to maximize the performance of the tasks that contains the loops.

Alternavely, rewrite the loop to remove the variable bound. Apply a maximum upper bound

with a condional break.

The basic strategy at this point in the opmizaon process is to pipeline the tasks (funcons and

loops) as much as possible. For detailed informaon on which funcons and loops to pipeline,

refer to Hardware Funcon Pipeline Strategies.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 20

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Although not commonly used, you can also apply pipelining at the operator level. For example,

wire roung in the FPGA can introduce large and unancipated delays that make it dicult for

the design to be implemented at the required clock frequency. In this case, you can use the

RESOURCE direcve to pipeline specic operaons such as mulpliers, adders, and block RAM

to add addional pipeline register stages at the logic level and allow the hardware funcon to

process data at the highest possible performance level without the need for recursion.

Note: The Cong commands are used to change the opmizaon default sengs and are only available

from within Vivado HLS when using a boom-up ow. Refer to Vivado Design Suite User Guide: High-Level

Synthesis (UG902) for more details.

Hardware Function Pipeline Strategies

The key opmizaon direcves for obtaining a high-performance design are the PIPELINE and

DATAFLOW direcves. This secon discusses in detail how to apply these direcves for various

C code architectures.

Fundamentally, there are two types of C/C++ funcons: those that are frame-based and those

that are sampled-based. No maer which coding style is used, the hardware funcon can be

implemented with the same performance in both cases. The dierence is only in how the

opmizaon direcves are applied.

Frame-Based C Code

The primary characterisc of a frame-based coding style is that the funcon processes mulple

data samples - a frame of data – typically supplied as an array or pointer with data accessed

through pointer arithmec during each transacon (a transacon is considered to be one

complete execuon of the C funcon). In this coding style, the data is typically processed

through a series of loops or nested loops.

An example outline of frame-based C code is shown below.

void foo(

data_t in1[HEIGHT][WIDTH],

data_t in2[HEIGHT][WIDTH],

data_t out[HEIGHT][WIDTH] {

Loop1: for(int i = 0; i < HEIGHT; i++) {

Loop2: for(int j = 0; j < WIDTH; j++) {

out[i][j] = in1[i][j] * in2[i][j];

Loop3: for(int k = 0; k < NUM_BITS; k++) {

. . . .

}

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 21

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

When seeking to pipeline any C/C++ code for maximum performance in hardware, you want to

place the pipeline opmizaon direcve at the level where a sample of data is processed.

The above example is representave of code used to process an image or video frame and can be

used to highlight how to eecvely pipeline hardware funcons. Two sets of input are provided

as frames of data to the funcon, and the output is also a frame of data. There are mulple

locaons where this funcon can be pipelined:

• At the level of funcon foo.

• At the level of loop Loop1.

• At the level of loop Loop2.

• At the level of loop Loop3.

Reviewing the advantages and disadvantages of placing the PIPELINE direcve at each of these

locaons helps explain the best locaon to place the pipeline direcve for your code.

Funcon Level: The funcon accepts a frame of data as input (in1 and in2). If the funcon is

pipelined with II = 1—read a new set of inputs every clock cycle—this informs the compiler to

read all HEIGHT*WIDTH values of in1 and in2 in a single clock cycle. It is unlikely this is the

design you want.

If the PIPELINE direcve is applied to funcon foo, all loops in the hierarchy below this level

must be unrolled. This is a requirement for pipelining, namely, there cannot be sequenal logic

inside the pipeline. This would create HEIGHT*WIDTH*NUM_ELEMENT copies of the logic,

which would lead to a large design.

Because the data is accessed in a sequenal manner, the arrays on the interface to the hardware

funcon can be implemented as mulple types of hardware interface:

• Block RAM interface

• AXI4 interface

• AXI4-Lite interface

• AXI4-Stream interface

• FIFO interface

A block RAM interface can be implemented as a dual-port interface supplying two samples per

clock. The other interface types can only supply one sample per clock. This would result in a

boleneck. There would be a large highly parallel hardware design unable to process all the data

in parallel and would lead to a waste of hardware resources.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 22

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Loop1 Level: The logic in Loop1 processes an enre row of the two-dimensional matrix. Placing

the PIPELINE direcve here would create a design which seeks to process one row in each clock

cycle. Again, this would unroll the loops below and create addional logic. However, the only way

to make use of the addional hardware would be to transfer an enre row of data each clock

cycle: an array of HEIGHT data words, with each word being WIDTH*<number of bits in data_t>

bits wide.

Because it is unlikely the host code running on the PS can process such large data words, this

would again result in a case where there are many highly parallel hardware resources that cannot

operate in parallel due to bandwidth limitaons.

Loop2 Level: The logic in Loop2 seeks to process one sample from the arrays. In an image

algorithm, this is the level of a single pixel. This is the level to pipeline if the design is to process

one sample per clock cycle. This is also the rate at which the interfaces consume and produce

data to and from the PS.

This will cause Loop3 to be completely unrolled but to process one sample per clock. It is a

requirement that all the operaons in Loop3 execute in parallel. In a typical design, the logic in

Loop3 is a shi register or is processing bits within a word. To execute at one sample per clock,

you want these processes to occur in parallel and hence you want to unroll the loop. The

hardware funcon created by pipelining Loop2 processes one data sample per clock and creates

parallel logic only where needed to achieve the required level of data throughput.

Loop3 Level: As stated above, given that Loop2 operates on each data sample or pixel, Loop3 will

typically be doing bit-level or data shiing tasks, so this level is doing mulple operaons per

pixel. Pipelining this level would mean performing each operaon in this loop once per clock and

thus NUM_BITS clocks per pixel: processing at the rate of mulple clocks per pixel or data

sample.

For example, Loop3 might contain a shi register holding the previous pixels required for a

windowing or convoluon algorithm. Adding the PIPELINE direcve at this level informs the

complier to shi one data value every clock cycle. The design would only return to the logic in

Loop2 and read the next inputs aer NUM_BITS iteraons resulng in a very slow data

processing rate.

The ideal locaon to pipeline in this example is Loop2.

When dealing with frame-based code you will want to pipeline at the loop level and typically

pipeline the loop that operates at the level of a sample. If in doubt, place a print command into

the C code and to conrm this is the level you wish to execute on each clock cycle.

For cases where there are mulple loops at the same level of hierarchy—the example above

shows only a set of nested loops—the best locaon to place the PIPELINE direcve can be

determined for each loop and then the DATAFLOW direcve applied to the funcon to ensure

each of the loops executes in a concurrent manner.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 23

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Sample-Based C Code

An example outline of sample-based C code is shown below. The primary characterisc of this

coding style is that the funcon processes a single data sample during each transacon.

void foo (data_t *in, data_t *out) {

static data_t acc;

Loop1: for (int i=N-1;i>=0;i--) {

acc+= ..some calculation..;

}

*out=acc>>N;

}

Another characterisc of sample-based coding style is that the funcon oen contains a stac

variable: a variable whose value must be remembered between invocaons of the funcon, such

as an accumulator or sample counter.

With sample-based code, the locaon of the PIPELINE direcve is clear, namely, to achieve an II

= 1 and process one data value each clock cycle, for which the funcon must be pipelined.

This unrolls any loops inside the funcon and creates addional hardware logic, but there is no

way around this. If Loop1 is pipelined, it takes a minimum of N clock cycles to complete. Only

then can the funcon read the next x input value.

When dealing with C code that processes at the sample level, the strategy is always to pipeline

the funcon.

In this type of coding style, the loops are typically operang on arrays and performing a shi

elements as discussed in Chapter 3: Opmize Structures for Performance to ensure all samples

are shied in a single clock cycle. If the array is implemented in a block RAM, only a maximum of

two samples can be read or wrien in each clock cycle, creang a data processing boleneck.

The soluon here is to pipeline funcon foo. Doing so results in a design that processes one

sample per clock.

Chapter 2: Optimizing the Hardware Function

Vivado HLS Optimization Methodology Guide 24

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 3

Optimize Structures for

Performance

C code can contain descripons that prevent a funcon or loop from being pipelined with the

required performance. This is oen implied by the structure of the C code or the default logic

structures used to implement the PL logic. In some cases, this might require a code modicaon,

but in most cases these issues can be addressed using addional opmizaon direcves.

The following example shows a case where an opmizaon direcve is used to improve the

structure of the implementaon and the performance of pipelining. In this inial example, the

PIPELINE direcve is added to a loop to improve the performance of the loop. This example code

shows a loop being used inside a funcon.

#include "bottleneck.h"

dout_t bottleneck(...) {

...

SUM_LOOP: for(i=3;i<N;i=i+4) {

#pragma HLS PIPELINE

sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];

}

...

}

When the code above is compiled into hardware, the following message appears as output:

INFO: [SCHED 61] Pipelining loop 'SUM_LOOP'.

WARNING: [SCHED 69] Unable to schedule 'load' operation ('mem_load_2',

bottleneck.c:62) on array 'mem' due to limited memory ports.

INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.

The issue in this example is that arrays are implemented using the ecient block RAM resources

in the PL fabric. This results in a small cost-ecient fast design. The disadvantage of block RAM

is that, like other memories such as DDR or SRAM, they have a limited number of data ports,

typically a maximum of two.

In the code above, four data values from mem are required to compute the value of sum. Because

mem is an array and implemented in a block RAM that only has two data ports, only two values

can be read (or wrien) in each clock cycle. With this conguraon, it is impossible to compute

the value of sum in one clock cycle and thus consume or produce data with an II of 1 (process

one data sample per clock).

Vivado HLS Optimization Methodology Guide 25

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

The memory port limitaon issue can be solved by using the ARRAY_PARTITION direcve on the

mem array. This direcve parons arrays into smaller arrays, improving the data structure by

providing more data ports and allowing a higher performance pipeline.

With the addional direcve shown below, array mem is paroned into two dual-port memories

so that all four reads can occur in one clock cycle. There are mulple opons to paroning an

array. In this case, cyclic paroning with a factor of two ensures the rst paron contains

elements 0, 2, 4, etc., from the original array and the second paron contains elements 1, 3, 5,

etc. Because the paroning ensures there are now two dual-port block RAMs (with a total of

four data ports), this allows elements 0, 1, 2, and 3 to be read in a single clock cycle.

Note: The ARRAY_PARTITION direcve cannot be used on arrays which are arguments of the funcon

selected as an accelerator.

#include "bottleneck.h"

dout_t bottleneck(...) {

#pragma HLS ARRAY_PARTITION variable=mem cyclic factor=2 dim=1

...

SUM_LOOP: for(i=3;i<N;i=i+4) {

#pragma HLS PIPELINE

sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];

}

...

}

Other such issues might be encountered when trying to pipeline loops and funcons. The

following table lists the direcves that are likely to address these issues by helping to reduce

bolenecks in data structures.

Table 6: Optimization Strategy Step 3: Optimize Structures for Performance

Directives and Configurations Description

ARRAY_PARTITION Partitions large arrays into multiple smaller arrays or into

individual registers to improve access to data and remove

block RAM bottlenecks.

DEPENDENCE Provides additional information that can overcome loop-

carry dependencies and allow loops to be pipelined (or

pipelined with lower intervals).

INLINE Inlines a function, removing all function hierarchy. Enables

logic optimization across function boundaries and improves

latency/interval by reducing function call overhead.

UNROLL Unrolls for-loops to create multiple independent operations

rather than a single collection of operations, allowing

greater hardware parallelism. This also allows for partial

unrolling of loops.

Config Array Partition This configuration determines how arrays are automatically

partitioned, including global arrays, and if the partitioning

impacts array ports.

Config Compile Controls synthesis specific optimizations such as the

automatic loop pipelining and floating point math

optimizations.

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 26

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Table 6: Optimization Strategy Step 3: Optimize Structures for Performance (cont'd)

Directives and Configurations Description

Config Schedule Determines the effort level to use during the synthesis

scheduling phase, the verbosity of the output messages,

and to specify if II should be relaxed in pipelined tasks to

achieve timing.

Config Unroll Allows all loops below the specified number of loop

iterations to be automatically unrolled.

In addion to the ARRAY_PARTITION direcve, the conguraon for array paroning can be

used to automacally paron arrays.

The DEPENDENCE direcve might be required to remove implied dependencies when pipelining

loops. Such dependencies are reported by message SCHED-68.

@W [SCHED-68] Target II not met due to carried dependence(s)

The INLINE direcve removes funcon boundaries. This can be used to bring logic or loops up

one level of hierarchy. It might be more ecient to pipeline the logic in a funcon by including it

in the funcon above it, and merging loops into the funcon above them where the DATAFLOW

opmizaon can be used to execute all the loops concurrently without the overhead of the

intermediate sub-funcon call. This might lead to a higher performing design.

The UNROLL direcve might be required for cases where a loop cannot be pipelined with the

required II. If a loop can only be pipelined with II = 4, it will constrain the other loops and

funcons in the system to be limited to II = 4. In some cases, it might be worth unrolling or

parally unrolling the loop to creang more logic and remove a potenal boleneck. If the loop

can only achieve II = 4, unrolling the loop by a factor of 4 creates logic that can process four

iteraons of the loop in parallel and achieve II = 1.

The Cong commands are used to change the opmizaon default sengs and are only available

from within Vivado HLS when using a boom-up ow. Refer to Vivado Design Suite User Guide:

High-Level Synthesis (UG902) for more details.

If opmizaon direcves cannot be used to improve the iniaon interval, it might require

changes to the code. Examples of this are discussed in Vivado Design Suite User Guide: High-Level

Synthesis (UG902).

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 27

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Reducing Latency

When the compiler nishes minimizing the iniaon interval (II), it automacally seeks to

minimize the latency. The opmizaon direcves listed in the following table can help specify a

parcular latency or inform the compiler to achieve a latency lower than the one produced,

namely, instruct the compiler to sasfy the latency direcve even if it results in a higher II. This

could result in a lower performance design.

Latency direcve are generally not required because most applicaons have a required

throughput but no required latency. When hardware funcons are integrated with a processor,

the latency of the processor is generally the liming factor in the system.

If the loops and funcons are not pipelined, the throughput is limited by the latency because the

task does not start reading the next set of inputs unl the current task has completed.

Table 7: Optimization Strategy Step 4: Reduce Latency

Directive Description

LATENCY Allows a minimum and maximum latency constraint to be

specified.

LOOP_FLATTEN Allows nested loops to be collapsed into a single loop. This

removes the loop transition overhead and improves the

latency. Nested loops are automatically flattened when the

PIPELINE directive is applied.

LOOP_MERGE Merges consecutive loops to reduce overall latency, increase

logic resource sharing, and improve logic optimization.

The loop opmizaon direcves can be used to aen a loop hierarchy or merge consecuve

loops together. The benet to the latency is due to the fact that it typically costs a clock cycle in

the control logic to enter and leave the logic created by a loop. The fewer the number of

transions between loops, the lesser number of clock cycles a design takes to complete.

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 28

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Reducing Area

In hardware, the number of resources required to implement a logic funcon is referred to as the

design area. Design area also refers to how much area the resource used on the xed-size PL

fabric. The area is of importance when the hardware is too large to be implemented in the target

device, and when the hardware funcon consumes a very high percentage (> 90%) of the

available area. This can result in dicules when trying to wire the hardware logic together

because the wires themselves require resources.

Aer meeng the required performance target (or II), the next step might be to reduce the area

while maintaining the same performance. This step can be opmal because there is nothing to be

gained by reducing the area if the hardware funcon is operang at the required performance

and no other hardware funcons are to be implemented in the remaining space in the PL.

The most common area opmizaon is the opmizaon of dataow memory channels to reduce

the number of block RAM resources required to implement the hardware funcon. Each device

has a limited number of block RAM resources.

If you used the DATAFLOW opmizaon and the compiler cannot determine whether the tasks

in the design are streaming data, it implements the memory channels between dataow tasks

using ping-pong buers. These require two block RAMs each of size N, where N is the number of

samples to be transferred between the tasks (typically the size of the array passed between

tasks). If the design is pipelined and the data is in fact streaming from one task to the next with

values produced and consumed in a sequenal manner, you can greatly reduce the area by using

the STREAM direcve to specify that the arrays are to be implemented in a streaming manner

that uses a simple FIFO for which you can specify the depth. FIFOs with a small depth are

implemented using registers and the PL fabric has many registers.

For most applicaons, the depth can be specied as 1, resulng in the memory channel being

implemented as a simple register. If, however, the algorithm implements data compression or

extrapolaon where some tasks consume more data than they produce or produce more data

than they consume, some arrays must be specied with a higher depth:

• For tasks which produce and consume data at the same rate, specify the array between them

to stream with a depth of 1.

• For tasks which reduce the data rate by a factor of X-to-1, specify arrays at the input of the

task to stream with a depth of X. All arrays prior to this in the funcon should also have a

depth of X to ensure the hardware funcon does not stall because the FIFOs are full.

• For tasks which increase the data rate by a factor of 1-to-Y, specify arrays at the output of the

task to stream with a depth of Y. All arrays aer this in the funcon should also have a depth

of Y to ensure the hardware funcon does not stall because the FIFOs are full.

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 29

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Note: If the depth is set too small, the symptom will be the hardware funcon will stall (hang) during

Hardware Emulaon resulng in lower performance, or even deadlock in some cases, due to full FIFOs

causing the rest of the system to wait.

The following table lists the other direcves to consider when aempng to minimize the

resources used to implement the design.

Table 8: Optimization Strategy Step 5: Reduce Area

Directives and Configurations Description

ALLOCATION Specifies a limit for the number of operations, hardware

resources, or functions used. This can force the sharing of

hardware resources but might increase latency.

ARRAY_MAP Combines multiple smaller arrays into a single large array to

help reduce the number of block RAM resources.

ARRAY_RESHAPE Reshapes an array from one with many elements to one

with greater word width. Useful for improving block RAM

accesses without increasing the number of block RAM.

DATA_PACK Packs the data fields of an internal struct into a single scalar

with a wider word width, allowing a single control signal to

control all fields.

LOOP_MERGE Merges consecutive loops to reduce overall latency, increase

sharing, and improve logic optimization.

OCCURRENCE Used when pipelining functions or loops to specify that the

code in a location is executed at a lesser rate than the code

in the enclosing function or loop.

RESOURCE Specifies that a specific hardware resource (core) is used to

implement a variable (array, arithmetic operation).

STREAM Specifies that a specific memory channel is to be

implemented as a FIFO with an optional specific depth.

Config Bind Determines the effort level to use during the synthesis

binding phase and can be used to globally minimize the

number of operations used.

Config Dataflow This configuration specifies the default memory channel

and FIFO depth in dataflow optimization.

The ALLOCATION and RESOURCE direcves are used to limit the number of operaons and to

select which cores (hardware resources) are used to implement the operaons. For example, you

could limit the funcon or loop to using only one mulplier and specify it to be implemented

using a pipelined mulplier.

If the ARRAY_PARITION direcve is used to improve the iniaon interval you might want to

consider using the ARRAY_RESHAPE direcve instead. The ARRAY_RESHAPE opmizaon

performs a similar task to array paroning, however, the reshape opmizaon recombines the

elements created by paroning into a single block RAM with wider data ports. This might

prevent an increase in the number of block RAM resources required.

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 30

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

If the C code contains a series of loops with similar indexing, merging the loops with the

LOOP_MERGE direcve might allow some opmizaons to occur. Finally, in cases where a

secon of code in a pipeline region is only required to operate at an iniaon interval lower than

the rest of the region, the OCCURENCE direcve is used to indicate that this logic can be

opmized to execute at a lower rate.

Note: The Cong commands are used to change the opmizaon default sengs and are only available

from within Vivado HLS when using a boom-up ow. Refer to Vivado Design Suite User Guide: High-Level

Synthesis (UG902) for more details.

Design Optimization Workflow

Before performing any opmizaons it is recommended to create a new build conguraon

within the project. Using dierent build conguraons allows one set of results to be compared

against a dierent set of results. In addion to the standard Debug and Release conguraons,

custom conguraons with more useful names (e.g., Opt_ver1 and UnOpt_ver) might be created

in the Project Sengs window using the Manage Build Conguraons for the Project toolbar

buon.

Dierent build conguraons allow you to compare not only the results, but also the log les and

even output RTL les used to implement the FPGA (the RTL les are only recommended for

users very familiar with hardware design).

The basic opmizaon strategy for a high-performance design is:

• Create an inial or baseline design.

• Pipeline the loops and funcons. Apply the DATAFLOW opmizaon to execute loops and

funcons concurrently.

• Address any issues that limit pipelining, such as array bolenecks and loop dependencies (with

ARRAY_PARTITION and DEPENDENCE direcves).

• Specify a specic latency or reduce the size of the dataow memory channels and use the

ALLOCATION and RESOUCES direcves to further reduce area.

Note: It might somemes be necessary to make adjustments to the code to meet performance.

In summary, the goal is to always meet performance rst, before reducing area. If the strategy is

to create a design with the fewest resources, simply omit the steps to improving performance,

although the baseline results might be very close to the smallest possible design.

Throughout the opmizaon process it is highly recommended to review the console output (or

log le) aer compilaon. When the compiler cannot reach the specied performance goals of an

opmizaon, it automacally relaxes the goals (except the clock frequency) and creates a design

with the goals that can be sased. It is important to review the output from the compilaon log

les and reports to understand what opmizaons have been performed.

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 31

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

For specic details on applying opmizaons, refer to Vivado Design Suite User Guide: High-Level

Synthesis (UG902).

Chapter 3: Optimize Structures for Performance

Vivado HLS Optimization Methodology Guide 32

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 4

Data Access Patterns

An FPGA is selected to implement the C code due to the superior performance of the FPGA - the

massively parallel architecture of an FPGA allows it to perform operaons much faster than the

inherently sequenal operaons of a processor, and users typically wish to take advantage of

that performance.

The focus here is on understanding the impact that the access paerns inherent in the C code

might have on the results. Although the access paerns of most concern are those into and out

of the hardware funcon, it is worth considering the access paerns within funcons as any

bolenecks within the hardware funcon will negavely impact the transfer rate into and out of

the funcon.

To highlight how some data access paerns can negavely impact performance and demonstrate

how other paerns can be used to fully embrace the parallelism and high performance

capabilies of an FPGA, this secon reviews an image convoluon algorithm.

• The rst part reviews the algorithm and highlights the data access aspects that limit the

performance in an FPGA.

• The second part shows how the algorithm might be wrien to achieve the highest

performance possible.

Algorithm with Poor Data Access Patterns

A standard convoluon funcon applied to an image is used here to demonstrate how the C

code can negavely impact the performance that is possible from an FPGA. In this example, a

horizontal and then vercal convoluon is performed on the data. Because the data at the edge

of the image lies outside the convoluon windows, the nal step is to address the data around

the border.

The algorithm structure can be summarized as follows:

• A horizontal convoluon.

• Followed by a vercal convoluon.

Vivado HLS Optimization Methodology Guide 33

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

• Followed by a manipulaon of the border pixels.

static void convolution_orig(

int width,

int height,

const T *src,

T *dst,

const T *hcoeff,

const T *vcoeff) {

T local[MAX_IMG_ROWS*MAX_IMG_COLS];

// Horizontal convolution

HconvH:for(int col = 0; col < height; col++){

HconvWfor(int row = border_width; row < width - border_width; row++){

Hconv:for(int i = - border_width; i <= border_width; i++){

}

// Vertical convolution

VconvH:for(int col = border_width; col < height - border_width; col++){

VconvW:for(int row = 0; row < width; row++){

Vconv:for(int i = - border_width; i <= border_width; i++){

}

// Border pixels

Top_Border:for(int col = 0; col < border_width; col++){

}

Side_Border:for(int col = border_width; col < height - border_width; col+

+){

}

Bottom_Border:for(int col = height - border_width; col < height; col++){

}

Standard Horizontal Convolution

The rst step in this is to perform the convoluon in the horizontal direcon as shown in the

following gure.

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 34

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

First Output Second Output Final Output

src

Hsamp

local

Hcoeff

Hsamp

Hcoeff

Hsamp

Hcoeff

X14296-121417

The convoluon is performed using K samples of data and K convoluon coecients. In the

gure above, K is shown as 5, however, the value of K is dened in the code. To perform the

convoluon, a minimum of K data samples are required. The convoluon window cannot start at

the rst pixel because the window would need to include pixels that are outside the image.

By performing a symmetric convoluon, the rst K data samples from input src can be

convolved with the horizontal coecients and the rst output calculated. To calculate the second

output, the next set of K data samples is used. This calculaon proceeds along each row unl the

nal output is wrien.

The C code for performing this operaon is shown below.

const int conv_size = K;

const int border_width = int(conv_size / 2);

#ifndef __SYNTHESIS__

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 35

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];

#else // Static storage allocation for HLS, dynamic otherwise

T local[MAX_IMG_ROWS*MAX_IMG_COLS];

#endif

Clear_Local:for(int i = 0; i < height * width; i++){

local[i]=0;

}

// Horizontal convolution

HconvH:for(int col = 0; col < height; col++){

HconvWfor(int row = border_width; row < width - border_width; row++){

int pixel = col * width + row;

Hconv:for(int i = - border_width; i <= border_width; i++){

local[pixel] += src[pixel + i] * hcoeff[i + border_width];

}

The code is straighorward and intuive. There are, however, some issues with this C code that

will negavely impact the quality of the hardware results.

The rst issue is the large storage requirements during C compilaon. The intermediate results in

the algorithm are stored in an internal local array. This requires an array of HEIGHT*WIDTH,

which for a standard video image of 1920*1080 will hold 2,073,600 values.

• For the cross-compilers targeng Zynq®-7000 All Programmable SoC or Zynq UltraScale+™

MPSoC, as well as many host systems, this amount of local storage can lead to stack

overows at run me (for example, running on the target device, or running co-sim ows

within Vivado HLS). The data for a local array is placed on the stack and not the heap, which is

managed by the OS. When cross-compiling with arm-linux-gnueabihf-g++ use the -

Wl,"-z stacksize=4194304" linker opon to allocate sucent stack space. (Note that

the syntax for this opon varies for dierent linkers.) When a funcon will only be run in

hardware, a useful way to avoid such issues is to use the __SYNTHESIS__ macro. This macro is

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 36

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

automacally dened by the system compiler when the hardware funcon is synthesized into

hardware. The code shown above uses dynamic memory allocaon during C simulaon to

avoid any compilaon issues and only uses stac storage during synthesis. A downside of

using this macro is the code veried by C simulaon is not the same code that is synthesized.

In this case, however, the code is not complex and the behavior will be the same.

• The main issue with this local array is the quality of the FPGA implementaon. Because this is

an array it will be implemented using internal FPGA block RAM. This is a very large memory to

implement inside the FPGA. It might require a larger and more costly FPGA device. The use of

block RAM can be minimized by using the DATAFLOW opmizaon and streaming the data

through small ecient FIFOs, but this will require the data to be used in a streaming

sequenal manner. There is currently no such requirement.

The next issue relates to the performance: the inializaon for the local array. The loop

Clear_Local is used to set the values in array local to zero. Even if this loop is pipelined in the

hardware to execute in a high-performance manner, this operaon sll requires approximately

two million clock cycles (HEIGHT*WIDTH) to implement. While this memory is being inialized,

the system cannot perform any image processing. This same inializaon of the data could be

performed using a temporary variable inside loop HConv to inialize the accumulaon before the

write.

Finally, the throughput of the data, and thus the system performance, is fundamentally limited by

the data access paern.

• To create the rst convolved output, the rst K values are read from the input.

• To calculate the second output, a new value is read and then the same K-1 values are re-read.

One of the keys to a high-performance FPGA is to minimize the access to and from the PS. Each

access for data, which has previously been fetched, negavely impacts the performance of the

system. An FPGA is capable of performing many concurrent calculaons at once and reaching

very high performance, but not while the ow of data is constantly interrupted by re-reading

values.

Note: To maximize performance, data should only be accessed once from the PS and small units of local

storage - small to medium sized arrays - should be used for data which must be reused.

With the code shown above, the data cannot be connuously streamed directly from the

processor using a DMA operaon because the data is required to be re-read me and again.

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 37

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Standard Vertical Convolution

The next step is to perform the vercal convoluon shown in the following gure.

First Output Second Output Final Output

local

Vsamp

dst

Vcoeff

Vsamp

Vcoeff

Vsamp

Vconv

X14299-110617

The process for the vercal convoluon is similar to the horizontal convoluon. A set of K data

samples is required to convolve with the convoluon coecients, Vcoe in this case. Aer the

rst output is created using the rst K samples in the vercal direcon, the next set of K values is

used to create the second output. The process connues down through each column unl the

nal output is created.

Aer the vercal convoluon, the image is now smaller than the source image src due to both

the horizontal and vercal border eect.

The code for performing these operaons is shown below.

Clear_Dst:for(int i = 0; i < height * width; i++){

dst[i]=0;

}

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 38

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

// Vertical convolution

VconvH:for(int col = border_width; col < height - border_width; col++){

VconvW:for(int row = 0; row < width; row++){

int pixel = col * width + row;

Vconv:for(int i = - border_width; i <= border_width; i++){

int offset = i * width;

dst[pixel] += local[pixel + offset] * vcoeff[i + border_width];

}

This code highlights similar issues to those already discussed with the horizontal convoluon

code.

• Many clock cycles are spent to set the values in the output image dst to zero. In this case,

approximately another two million cycles for a 1920*1080 image size.

• There are mulple accesses per pixel to re-read data stored in array local.

• There are mulple writes per pixel to the output array/port dst.

The access paerns in the code above in fact creates the requirement to have such a large local

array. The algorithm requires the data on row K to be available to perform the rst calculaon.

Processing data down the rows before proceeding to the next column requires the enre image

to be stored locally. This requires that all values be stored and results in large local storage on the

FPGA.

In addion, when you reach the stage where you wish to use compiler direcves to opmize the

performance of the hardware funcon, the ow of data between the horizontal and vercal loop

cannot be managed via a FIFO (a high-performance and low-resource unit) because the data is

not streamed out of array local: a FIFO can only be used with sequenal access paerns.

Instead, this code which requires arbitrary/random accesses requires a ping-pong block RAM to

improve performance. This doubles the memory requirements for the implementaon of the local

array to approximately four million data samples, which is too large for an FPGA.

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 39

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Standard Border Pixel Convolution

The nal step in performing the convoluon is to create the data around the border. These pixels

can be created by simply reusing the nearest pixel in the convolved output. The following gures

shows how this is achieved.

Top Left Top Row Top Right

Left and Right Edges Bottom Left and Bottom Row Bottom Right

dst

X14294-121417

The border region is populated with the nearest valid value. The following code performs the

operaons shown in the gure.

int border_width_offset = border_width * width;

int border_height_offset = (height - border_width - 1) * width;

// Border pixels

Top_Border:for(int col = 0; col < border_width; col++){

int offset = col * width;

for(int row = 0; row < border_width; row++){

int pixel = offset + row;

dst[pixel] = dst[border_width_offset + border_width];

}

for(int row = border_width; row < width - border_width; row++){

int pixel = offset + row;

dst[pixel] = dst[border_width_offset + row];

}

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 40

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

for(int row = width - border_width; row < width; row++){

int pixel = offset + row;

dst[pixel] = dst[border_width_offset + width - border_width - 1];

}

Side_Border:for(int col = border_width; col < height - border_width; col++)

{

int offset = col * width;

for(int row = 0; row < border_width; row++){

int pixel = offset + row;

dst[pixel] = dst[offset + border_width];

}

for(int row = width - border_width; row < width; row++){

int pixel = offset + row;

dst[pixel] = dst[offset + width - border_width - 1];

}

Bottom_Border:for(int col = height - border_width; col < height; col++){

int offset = col * width;

for(int row = 0; row < border_width; row++){

int pixel = offset + row;

dst[pixel] = dst[border_height_offset + border_width];

}

for(int row = border_width; row < width - border_width; row++){

int pixel = offset + row;

dst[pixel] = dst[border_height_offset + row];

}

for(int row = width - border_width; row < width; row++){

int pixel = offset + row;

dst[pixel] = dst[border_height_offset + width - border_width - 1];

}

The code suers from the same repeated access for data. The data stored outside the FPGA in

the array dst must now be available to be read as input data re-read mulple mes. Even in the

rst loop, dst[border_width_offset + border_width] is read mulple mes but the

values of border_width_offset and border_width do not change.

This code is very intuive to both read and write. When implemented with the SDSoC

environment it is approximately 120M clock cycles, which meets or slightly exceeds the

performance of a CPU. However, as shown in the next secon, opmal data access paerns

ensure this same algorithm can be implemented on the FPGA at a rate of one pixel per clock

cycle, or approximately 2M clock cycles.

The summary from this review is that the following poor data access paerns negavely impact

the performance and size of the FPGA implementaon:

•Mulple accesses to read and then re-read data. Use local storage where possible.

• Accessing data in an arbitrary or random access manner. This requires the data to be stored

locally in arrays and costs resources.

•Seng default values in arrays costs clock cycles and performance.

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 41

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Algorithm With Optimal Data Access

Patterns

The key to implemenng the convoluon example reviewed in the previous secon as a high-

performance design with minimal resources is to:

• Maximize the ow of data through the system. Refrain from using any coding techniques or

algorithm behavior that inhibits the connuous ow of data.

• Maximize the reuse of data. Use local caches to ensure there are no requirements to re-read

data and the incoming data can keep owing.

• Embrace condional branching. This is expensive on a CPU, GPU, or DSP but opmal in an

FPGA.

The rst step is to understand how data ows through the system into and out of the FPGA. The

convoluon algorithm is performed on an image. When data from an image is produced and

consumed, it is transferred in a standard raster-scan manner as shown in the following gure.

Width

Height

X14298-121417

If the data is transferred to the FPGA in a streaming manner, the FPGA should process it in a

streaming manner and transfer it back from the FPGA in this manner.

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 42

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

The convoluon algorithm shown below embraces this style of coding. At this level of abstracon

a concise view of the code is shown. However, there are now intermediate buers, hconv and

vconv, between each loop. Because these are accessed in a streaming manner, they are

opmized into single registers in the nal implementaon.

template<typename T, int K>

static void convolution_strm(

int width,

int height,

T src[TEST_IMG_ROWS][TEST_IMG_COLS],

T dst[TEST_IMG_ROWS][TEST_IMG_COLS],

const T *hcoeff,

const T *vcoeff)

{

T hconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];

T vconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];

T *phconv, *pvconv;

// These assertions let HLS know the upper bounds of loops

assert(height < MAX_IMG_ROWS);

assert(width < MAX_IMG_COLS);

assert(vconv_xlim < MAX_IMG_COLS - (K - 1));

// Horizontal convolution

HConvH:for(int col = 0; col < height; col++) {

HConvW:for(int row = 0; row < width; row++) {

HConv:for(int i = 0; i < K; i++) {

}

// Vertical convolution

VConvH:for(int col = 0; col < height; col++) {

VConvW:for(int row = 0; row < vconv_xlim; row++) {

VConv:for(int i = 0; i < K; i++) {

}

Border:for (int i = 0; i < height; i++) {

for (int j = 0; j < width; j++) {

}

All three processing loops now embrace condional branching to ensure the connuous

processing of data.

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 43

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 4: Data Access Patterns

Vivado HLS Optimization Methodology Guide 44

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Chapter 5

Standard Horizontal Convolution

The rst step in this is to perform the convoluon in the horizontal direcon as shown in the

following gure.

Vivado HLS Optimization Methodology Guide 45

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

First Output Second Output Final Output

src

Hsamp

local

Hcoeff

Hsamp

Hcoeff

Hsamp

Hcoeff

X14296-121417

The convoluon is performed using K samples of data and K convoluon coecients. In the

gure above, K is shown as 5, however, the value of K is dened in the code. To perform the

convoluon, a minimum of K data samples are required. The convoluon window cannot start at

the rst pixel because the window would need to include pixels that are outside the image.

By performing a symmetric convoluon, the rst K data samples from input src can be

convolved with the horizontal coecients and the rst output calculated. To calculate the second

output, the next set of K data samples is used. This calculaon proceeds along each row unl the

nal output is wrien.

The C code for performing this operaon is shown below.

const int conv_size = K;

const int border_width = int(conv_size / 2);

#ifndef __SYNTHESIS__

Chapter 5: Standard Horizontal Convolution

Vivado HLS Optimization Methodology Guide 46

UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]