Vivado HLS Optimization Methodology Guide (UG1270) Ug1270 Opt

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 136 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Vivado HLS Opmizaon
Methodology Guide
UG1270 (v2017.4) December 20, 2017
Revision History
The following table shows the revision history for this document.
Date Version Revision
12/20/2017
2017.4 Initial Xilinx release.
Vivado HLS Optimization Methodology Guide 3
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Revision History
Vivado HLS Optimization Methodology Guide 4
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table of Contents
Revision History...............................................................................................................3
Chapter 1: Introduction.............................................................................................. 9
HLS Pragmas................................................................................................................................9
OpenCL Attributes.....................................................................................................................11
Directives....................................................................................................................................12
Chapter 2: Optimizing the Hardware Function........................................... 15
Hardware Function Optimization Methodology....................................................................16
Baseline The Hardware Functions...........................................................................................18
Optimization for Metrics.......................................................................................................... 19
Pipeline for Performance......................................................................................................... 20
Chapter 3: Optimize Structures for Performance...................................... 25
Reducing Latency...................................................................................................................... 28
Reducing Area............................................................................................................................29
Design Optimization Workflow................................................................................................31
Chapter 4: Data Access Patterns..........................................................................33
Algorithm with Poor Data Access Patterns............................................................................ 33
Algorithm With Optimal Data Access Patterns......................................................................42
Chapter 5: Standard Horizontal Convolution............................................... 45
Optimal Horizontal Convolution..............................................................................................48
Optimal Vertical Convolution...................................................................................................50
Optimal Border Pixel Convolution.......................................................................................... 52
Optimal Data Access Patterns................................................................................................. 54
Appendix A: OpenCL Attributes............................................................................55
always_inline.............................................................................................................................. 56
opencl_unroll_hint..................................................................................................................... 57
reqd_work_group_size.............................................................................................................. 58
vec_type_hint..............................................................................................................................60
Vivado HLS Optimization Methodology Guide 5
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
work_group_size_hint................................................................................................................61
xcl_array_partition..................................................................................................................... 63
xcl_array_reshape...................................................................................................................... 65
xcl_data_pack............................................................................................................................. 68
xcl_dataflow................................................................................................................................69
xcl_dependence......................................................................................................................... 71
xcl_max_work_group_size.........................................................................................................73
xcl_pipeline_loop........................................................................................................................75
xcl_pipeline_workitems.............................................................................................................76
xcl_reqd_pipe_depth..................................................................................................................77
xcl_zero_global_work_offset.....................................................................................................79
Appendix B: HLS Pragmas........................................................................................81
pragma HLS allocation..............................................................................................................82
pragma HLS array_map............................................................................................................84
pragma HLS array_partition.....................................................................................................87
pragma HLS array_reshape......................................................................................................89
pragma HLS clock......................................................................................................................91
pragma HLS data_pack............................................................................................................. 93
pragma HLS dataflow............................................................................................................... 95
pragma HLS dependence.........................................................................................................98
pragma HLS expression_balance.......................................................................................... 100
pragma HLS function_instantiate..........................................................................................101
pragma HLS inline...................................................................................................................103
pragma HLS interface.............................................................................................................106
pragma HLS latency................................................................................................................111
pragma HLS loop_flatten........................................................................................................113
pragma HLS loop_merge........................................................................................................115
pragma HLS loop_tripcount................................................................................................... 116
pragma HLS occurrence.........................................................................................................118
pragma HLS pipeline.............................................................................................................. 120
pragma HLS protocol..............................................................................................................122
pragma HLS reset....................................................................................................................123
pragma HLS resource............................................................................................................. 124
pragma HLS stream................................................................................................................ 126
pragma HLS top.......................................................................................................................128
pragma HLS unroll.................................................................................................................. 129
Appendix C: Additional Resources and Legal Notices........................... 133
Vivado HLS Optimization Methodology Guide 6
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
References................................................................................................................................133
Please Read: Important Legal Notices................................................................................. 134
Vivado HLS Optimization Methodology Guide 7
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Vivado HLS Optimization Methodology Guide 8
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 1
Introduction
This guide provides details on how to perform opmizaons using Vivado HLS. The opmizaon
process consists of direcves which specify which opmizaons are performed and a
methodology which shows how opmizaons may be applied in a determinisc and ecient
manner.
HLS Pragmas
Optimizations in Vivado HLS
In both SDAccel and SDSoC projects, the hardware kernel must be synthesized from the OpenCL,
C, or C++ language, into RTL that can be implemented into the programmable logic of a Xilinx
device. Vivado HLS synthesizes the RTL from the OpenCL, C, and C++ language descripons.
Vivado HLS is intended to work with your SDAccel or SDSoC Development Environment project
without interacon. However, Vivado HLS also provides pragmas that can be used to opmize
the design: reduce latency, improve throughput performance, and reduce area and device
resource ulizaon of the resulng RTL code. These pragmas can be added directly to the source
code for the kernel.
IMPORTANT!:
Although the SDSoC environment supports the use of HLS pragmas, it does not support pragmas
applied to any argument of the funcon interface (interface, array paron, or data_pack pragmas).
Refer to "Opmizing the Hardware Funcon" in the SDSoC Environment Opmizaon Guide (UG1235)
for more informaon.
The Vivado HLS pragmas include the opmizaon types specied below:
Vivado HLS Optimization Methodology Guide 9
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table 1: Vivado HLS Pragmas by Type
Type Attributes
Kernel Optimization pragma HLS allocation
pragma HLS clock
pragma HLS expression_balance
pragma HLS latency
pragma HLS reset
pragma HLS resource
pragma HLS top
Function Inlining pragma HLS inline
pragma HLS function_instantiate
Interface Synthesis pragma HLS interface
pragma HLS protocol
Task-level Pipeline pragma HLS dataflow
pragma HLS stream
Pipeline pragma HLS pipeline
pragma HLS occurrence
Loop Unrolling pragma HLS unroll
pragma HLS dependence
Loop Optimization pragma HLS loop_flatten
pragma HLS loop_merge
pragma HLS loop_tripcount
Array Optimization pragma HLS array_map
pragma HLS array_partition
pragma HLS array_reshape
Structure Packing pragma HLS data_pack
Chapter 1: Introduction
Vivado HLS Optimization Methodology Guide 10
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
OpenCL Attributes
Optimizations in OpenCL
This secon describes OpenCL aributes that can be added to source code to assist system
opmizaon by the SDAccel compiler, xocc, the SDSoC system compilers, sdscc and sds++,
and Vivado HLS synthesis.
SDx provides OpenCL aributes to opmize your code for data movement and kernel
performance. The goal of data movement opmizaon is to maximize the system level data
throughput by maximizing interface bandwidth ulizaon and DDR bandwidth ulizaon. The
goal of kernel computaon opmizaon is to create processing logic that can consume all the
data as soon as they arrive at kernel interfaces. This is generally achieved by expanding the
processing code to match the data path with techniques such as funcon inlining and pipelining,
loop unrolling, array paroning, dataowing, etc.
The OpenCL aributes include the types specied below:
Table 2: OpenCL __attributes__ by Type
Type Attributes
Kernel Size reqd_work_group_size
vec_type_hint
work_group_size_hint
xcl_max_work_group_size
xcl_zero_global_work_offset
Function Inlining always_inline
Task-level Pipeline xcl_dataflow
xcl_reqd_pipe_depth
Pipeline xcl_pipeline_loop
xcl_pipeline_workitems
Loop Unrolling opencl_unroll_hint
Array Optimization xcl_array_partition
xcl_array_reshape
Note: Array variables only accept a single array
opmizaon aribute.
Chapter 1: Introduction
Vivado HLS Optimization Methodology Guide 11
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
TIP: The SDAccel and SDSoC compilers also support many of the standard aributes supported by
gcc
, such as
always_inline
,
noinline
,
unroll
, and
nounroll
.
Directives
To view details on the aributes in the following table see the Command Reference secon in
UG902.
Note: Refer to Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more details.
Table 3: Vivado HLS Pragmas by Type
Type Attributes
Kernel Optimization set_directive_allocation
set_directive_clock
set_directive_expression_balance
set_directive_latency
set_directive_reset
set_directive_resource
set_directive_top
Function Inlining set_directive_inline
set_directive_function_instantiate
Interface Synthesis set_directive_interface
set_directive_protocol
Task-level Pipeline set_directive_dataflow
set_directive_stream
Pipeline set_directive_pipeline
set_directive_occurrence
Loop Unrolling set_directive_unroll
set_directive_dependence
Loop Optimization set_directive_loop_flatten
set_directive_loop_merge
set_directive_loop_tripcount
Chapter 1: Introduction
Vivado HLS Optimization Methodology Guide 12
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table 3: Vivado HLS Pragmas by Type (cont'd)
Type Attributes
Array Optimization set_directive_array_map
set_directive_array_partition
set_directive_array_reshape
Structure Packing set_directive_data_pack
Chapter 1: Introduction
Vivado HLS Optimization Methodology Guide 13
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 1: Introduction
Vivado HLS Optimization Methodology Guide 14
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 2
Optimizing the Hardware Function
The SDSoC environment employs heterogeneous cross-compilaon, with ARM CPU-specic
cross compilers for the Zynq-7000 SoC and Zynq UltraScale+ MPSoC CPUs, and Vivado HLS as a
PL cross-compiler for hardware funcons. This secon explains the default behavior and
opmizaon direcves associated with the Vivado HLS cross-compiler.
The default behavior of Vivado HLS is to execute funcons and loops in a sequenal manner
such that the hardware is an accurate reecon of the C/C++ code. Opmizaon direcves can
be used to enhance the performance of the hardware funcon, allowing pipelining which
substanally increases the performance of the funcons. This chapter outlines a general
methodology for opmizing your design for high performance.
There are many possible goals when trying to opmize a design using Vivado HLS. The
methodology assumes you want to create a design with the highest possible performance,
processing one sample of new input data every clock cycle, and so addresses those opmizaons
before the ones used for reducing latency or resources.
Detailed explanaons of the opmizaons discussed here are provided in Vivado Design Suite
User Guide: High-Level Synthesis (UG902).
It is highly recommended to review the methodology and obtain a global perspecve of hardware
funcon opmizaon before reviewing the details of specic opmizaon.
Vivado HLS Optimization Methodology Guide 15
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Hardware Function Optimization
Methodology
Hardware funcons are synthesized into hardware in the Programmable Logic (PL) by the Vivado
HLS compiler. This compiler automacally translates C/C++ code into an FPGA hardware
implementaon, and as with all compilers, does so using compiler defaults. In addion to the
compiler defaults, Vivado HLS provides a number of opmizaons that are applied to the C/C++
code through the use of pragmas in the code. This chapter explains the opmizaons that can be
applied and a recommended methodology for applying them.
The are two ows for opmizing the hardware funcons.
Top-down ow: In this ow, program decomposion into hardware funcons proceeds top-
down within the SDSoC environment, leng the system compiler create pipelines of
funcons that automacally operate in dataow mode. The microarchitecture for each
hardware funcon is opmized using Vivado HLS.
Boom-up ow: In this ow, the hardware funcons are opmized in isolaon from the
system using the Vivado HLS compiler provided in the Vivado Design suite. The hardware
funcons are analyzed, opmizaons direcves can be applied to create an implementaon
other than the default, and the resulng opmized hardware funcons are then incorporated
into the SDSoC environment.
The boom-up ow is oen used in organizaons where the soware and hardware are
opmized by dierent teams and can be used by soware programmers who wish to take
advantage of exisng hardware implementaons from within their organizaon or from partners.
Both ows are supported, and the same opmizaon methodology is used in either case. Both
workows result in the same high-performance system. Xilinx sees the choice as a workow
decision made by individual teams and organizaons and provides no recommendaon on which
ow to use. Examples of both ows are provided in this link in the SDSoC Environment
Opmizaon Guide (UG1235).
The opmizaon methodology for hardware funcons is shown in the gure below.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 16
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Simulate Design - Validate The C function
Synthesize Design - Baseline design
1: Initial Optimizations - Define interfaces (and data packing)
- Define loop trip counts
2: Pipeline for Performance - Pipeline and dataflow
3: Optimize Structures for Performance - Partition memories and ports
- Remove false dependencies
4: Reduce Latency - Optionally specify latency requirements
5: Improve Area - Optionally recover resources through sharing
X15638-110617
The gure above details all the steps in the methodology and the subsequent secons in this
chapter explain the opmizaons in detail.
IMPORTANT!: Designs will reach the opmum performance aer step 3.
Step 4 is used to minimize, or specically control, the latency through the design and is only
required for applicaons where this is of concern. Step 5 explains how to reduce the resources
required for hardware implementaon and is typically only applied when larger hardware
funcons fail to implement in the available resources. The FPGA has a xed number of resources,
and there is typically no benet in creang a smaller implementaon if the performance goals
have been met.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 17
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Baseline The Hardware Functions
Before seeking to perform any hardware funcon opmizaon, it is important to understand the
performance achieved with the exisng code and compiler defaults, and appreciate how
performance is measured. This is achieved by selecng the funcons to implement hardware and
building the project.
Aer the project has been built, a report is available in the reports secon of the IDE (and
provided at <project name>/<build_config>/_sds/vhls/<hw_function>/
solution/syn/report/<hw_function>.rpt). This report details the performance
esmates and ulizaon esmates.
The key factors in the performance esmates are the ming, interval, and latency in that order.
The ming summary shows the target and esmated clock frequency. If the esmated clock
frequency is greater than the target, the hardware will not funcon at this clock frequency. The
clock frequency should be reduced by using the Data Moon Network Clock Frequency
opon in the Project Sengs. Alternavely, because this is only an esmate at this point in
the ow, it might be possible to proceed through the remainder of the ow if the esmate
only exceeds the target by 20%. Further opmizaons are applied when the bitstream is
generated, and it might sll be possible to sasfy the ming requirements. However, this is an
indicaon that the hardware funcon is not guaranteed to meet ming.
The iniaon interval (II) is the number of clock cycles before the funcon can accept new
inputs and is generally the most crical performance metric in any system. In an ideal hardware
funcon, the hardware processes data at the rate of one sample per clock cycle. If the largest
data set passed into the hardware is size N (e.g., my_array[N]), the most opmal II is N + 1.
This means the hardware funcon processes N data samples in N clock cycles and can accept
new data one clock cycle aer all N samples are processed. It is possible to create a hardware
funcon with an II < N, however, this requires greater resources in the PL with typically lile
benet. The hardware funcon will oen be ideal as it consumes and produces data at a rate
faster than the rest of the system.
The loop iniaon interval is the number of clock cycles before the next iteraon of a loop
starts to process data. This metric becomes important as you delve deeper into the analysis to
locate and remove performance bolenecks.
The latency is the number of clock cycles required for the funcon to compute all output
values. This is simply the lag from when data is applied unl when it is ready. For most
applicaons this is of lile concern, especially when the latency of the hardware funcon
vastly exceeds that of the soware or system funcons such as DMA. It is, however, a
performance metric that you should review and conrm is not an issue for your applicaon.
The loop iteraon latency is the number of clock cycles it takes to complete one iteraon of a
loop, and the loop latency is the number of cycles to execute all iteraons of the loop.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 18
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The Area Esmates secon of the report details how many resources are required in the PL to
implement the hardware funcon and how many are available on the device. The key metric here
is the Ulizaon (%). The Ulizaon (%) should not exceed 100% for any of the resources. A gure
greater than 100% means there are not enough resources to implement the hardware funcon,
and a larger FPGA device might be required. As with the ming, at this point in the ow, this is an
esmate. If the numbers are only slightly over 100%, it might be possible for the hardware to be
opmized during bitstream creaon.
You should already have an understanding of the required performance of your system and what
metrics are required from the hardware funcons. However, even if you are unfamiliar with
hardware concepts such as clock cycles, you are now aware that the highest performing
hardware funcons have an II = N + 1, where N is the largest data set processed by the funcon.
With an understanding of the current design performance and a set of baseline performance
metrics, you can now proceed to apply opmizaon direcves to the hardware funcons.
Optimization for Metrics
The following table shows the rst direcve you should think about adding to your design.
Table 4: Optimization Strategy Step 1: Optimization For Metrics
Directives and Configurations Description
LOOP_TRIPCOUNT Used for loops that have variable bounds. Provides an
estimate for the loop iteration count. This has no impact on
synthesis, only on reporting.
A common issue when hardware funcons are rst compiled is report les showing the latency
and interval as a queson mark “?” rather than as numerical values. If the design has loops with
variable loop bounds, the compiler cannot determine the latency or II and uses the “?” to indicate
this condion. Variable loop bounds are where the loop iteraon limit cannot be resolved at
compile me, as when the loop iteraon limit is an input argument to the hardware funcon,
such as variable height, width, or depth parameters.
To resolve this condion, use the hardware funcon report to locate the lowest level loop which
fails to report a numerical value and use the LOOP_TRIPCOUNT direcve to apply an esmated
tripcount. The tripcount is the minimum, average, and/or maximum number of expected
iteraons. This allows values for latency and interval to be reported and allows implementaons
with dierent opmizaons to be compared.
Because the LOOP_TRIPCOUNT value is only used for reporng, and has no impact on the
resulng hardware implementaon, any value can be used. However, an accurate expected value
results in more useful reports.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 19
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Pipeline for Performance
The next stage in creang a high-performance design is to pipeline the funcons, loops, and
operaons. Pipelining results in the greatest level of concurrency and the highest level of
performance. The following table shows the direcves you can use for pipelining.
Table 5: Optimization Strategy Step 1: Optimization Strategy Step 2: Pipeline for
Performance
Directives and Configurations Description
PIPELINE Reduces the initiation interval by allowing the concurrent
execution of operations within a loop or function.
DATAFLOW Enables task level pipelining, allowing functions and loops
to execute concurrently. Used to minimize interval.
RESOURCE Specifies pipelining on the hardware resource used to
implement a variable (array, arithmetic operation).
Config Compile Allows loops to be automatically pipelined based on their
iteration count when using the bottom-up flow.
At this stage of the opmizaon process, you want to create as much concurrent operaon as
possible. You can apply the PIPELINE direcve to funcons and loops. You can use the
DATAFLOW direcve at the level that contains the funcons and loops to make them work in
parallel. Although rarely required, the RESOURCE direcve can be used to squeeze out the
highest levels of performance.
A recommended strategy is to work from the boom up and be aware of the following:
Some funcons and loops contain sub-funcons. If the sub-funcon is not pipelined, the
funcon above it might show limited improvement when it is pipelined. The non-pipelined
sub-funcon will be the liming factor.
Some funcons and loops contain sub-loops. When you use the PIPELINE direcve, the
direcve automacally unrolls all loops in the hierarchy below. This can create a great deal of
logic. It might make more sense to pipeline the loops in the hierarchy below.
For cases where it does make sense to pipeline the upper hierarchy and unroll any loops lower
in the hierarchy, loops with variable bounds cannot be unrolled, and any loops and funcons
in the hierarchy above these loops cannot be pipelined. To address this issue, pipeline these
loops wih variable bounds, and use the DATAFLOW opmizaon to ensure the pipelined
loops operate concurrently to maximize the performance of the tasks that contains the loops.
Alternavely, rewrite the loop to remove the variable bound. Apply a maximum upper bound
with a condional break.
The basic strategy at this point in the opmizaon process is to pipeline the tasks (funcons and
loops) as much as possible. For detailed informaon on which funcons and loops to pipeline,
refer to Hardware Funcon Pipeline Strategies.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 20
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Although not commonly used, you can also apply pipelining at the operator level. For example,
wire roung in the FPGA can introduce large and unancipated delays that make it dicult for
the design to be implemented at the required clock frequency. In this case, you can use the
RESOURCE direcve to pipeline specic operaons such as mulpliers, adders, and block RAM
to add addional pipeline register stages at the logic level and allow the hardware funcon to
process data at the highest possible performance level without the need for recursion.
Note: The Cong commands are used to change the opmizaon default sengs and are only available
from within Vivado HLS when using a boom-up ow. Refer to Vivado Design Suite User Guide: High-Level
Synthesis (UG902) for more details.
Hardware Function Pipeline Strategies
The key opmizaon direcves for obtaining a high-performance design are the PIPELINE and
DATAFLOW direcves. This secon discusses in detail how to apply these direcves for various
C code architectures.
Fundamentally, there are two types of C/C++ funcons: those that are frame-based and those
that are sampled-based. No maer which coding style is used, the hardware funcon can be
implemented with the same performance in both cases. The dierence is only in how the
opmizaon direcves are applied.
Frame-Based C Code
The primary characterisc of a frame-based coding style is that the funcon processes mulple
data samples - a frame of data – typically supplied as an array or pointer with data accessed
through pointer arithmec during each transacon (a transacon is considered to be one
complete execuon of the C funcon). In this coding style, the data is typically processed
through a series of loops or nested loops.
An example outline of frame-based C code is shown below.
void foo(
data_t in1[HEIGHT][WIDTH],
data_t in2[HEIGHT][WIDTH],
data_t out[HEIGHT][WIDTH] {
Loop1: for(int i = 0; i < HEIGHT; i++) {
Loop2: for(int j = 0; j < WIDTH; j++) {
out[i][j] = in1[i][j] * in2[i][j];
Loop3: for(int k = 0; k < NUM_BITS; k++) {
. . . .
}
}
}
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 21
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
When seeking to pipeline any C/C++ code for maximum performance in hardware, you want to
place the pipeline opmizaon direcve at the level where a sample of data is processed.
The above example is representave of code used to process an image or video frame and can be
used to highlight how to eecvely pipeline hardware funcons. Two sets of input are provided
as frames of data to the funcon, and the output is also a frame of data. There are mulple
locaons where this funcon can be pipelined:
At the level of funcon foo.
At the level of loop Loop1.
At the level of loop Loop2.
At the level of loop Loop3.
Reviewing the advantages and disadvantages of placing the PIPELINE direcve at each of these
locaons helps explain the best locaon to place the pipeline direcve for your code.
Funcon Level: The funcon accepts a frame of data as input (in1 and in2). If the funcon is
pipelined with II = 1—read a new set of inputs every clock cycle—this informs the compiler to
read all HEIGHT*WIDTH values of in1 and in2 in a single clock cycle. It is unlikely this is the
design you want.
If the PIPELINE direcve is applied to funcon foo, all loops in the hierarchy below this level
must be unrolled. This is a requirement for pipelining, namely, there cannot be sequenal logic
inside the pipeline. This would create HEIGHT*WIDTH*NUM_ELEMENT copies of the logic,
which would lead to a large design.
Because the data is accessed in a sequenal manner, the arrays on the interface to the hardware
funcon can be implemented as mulple types of hardware interface:
Block RAM interface
AXI4 interface
AXI4-Lite interface
AXI4-Stream interface
FIFO interface
A block RAM interface can be implemented as a dual-port interface supplying two samples per
clock. The other interface types can only supply one sample per clock. This would result in a
boleneck. There would be a large highly parallel hardware design unable to process all the data
in parallel and would lead to a waste of hardware resources.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 22
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Loop1 Level: The logic in Loop1 processes an enre row of the two-dimensional matrix. Placing
the PIPELINE direcve here would create a design which seeks to process one row in each clock
cycle. Again, this would unroll the loops below and create addional logic. However, the only way
to make use of the addional hardware would be to transfer an enre row of data each clock
cycle: an array of HEIGHT data words, with each word being WIDTH*<number of bits in data_t>
bits wide.
Because it is unlikely the host code running on the PS can process such large data words, this
would again result in a case where there are many highly parallel hardware resources that cannot
operate in parallel due to bandwidth limitaons.
Loop2 Level: The logic in Loop2 seeks to process one sample from the arrays. In an image
algorithm, this is the level of a single pixel. This is the level to pipeline if the design is to process
one sample per clock cycle. This is also the rate at which the interfaces consume and produce
data to and from the PS.
This will cause Loop3 to be completely unrolled but to process one sample per clock. It is a
requirement that all the operaons in Loop3 execute in parallel. In a typical design, the logic in
Loop3 is a shi register or is processing bits within a word. To execute at one sample per clock,
you want these processes to occur in parallel and hence you want to unroll the loop. The
hardware funcon created by pipelining Loop2 processes one data sample per clock and creates
parallel logic only where needed to achieve the required level of data throughput.
Loop3 Level: As stated above, given that Loop2 operates on each data sample or pixel, Loop3 will
typically be doing bit-level or data shiing tasks, so this level is doing mulple operaons per
pixel. Pipelining this level would mean performing each operaon in this loop once per clock and
thus NUM_BITS clocks per pixel: processing at the rate of mulple clocks per pixel or data
sample.
For example, Loop3 might contain a shi register holding the previous pixels required for a
windowing or convoluon algorithm. Adding the PIPELINE direcve at this level informs the
complier to shi one data value every clock cycle. The design would only return to the logic in
Loop2 and read the next inputs aer NUM_BITS iteraons resulng in a very slow data
processing rate.
The ideal locaon to pipeline in this example is Loop2.
When dealing with frame-based code you will want to pipeline at the loop level and typically
pipeline the loop that operates at the level of a sample. If in doubt, place a print command into
the C code and to conrm this is the level you wish to execute on each clock cycle.
For cases where there are mulple loops at the same level of hierarchy—the example above
shows only a set of nested loops—the best locaon to place the PIPELINE direcve can be
determined for each loop and then the DATAFLOW direcve applied to the funcon to ensure
each of the loops executes in a concurrent manner.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 23
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Sample-Based C Code
An example outline of sample-based C code is shown below. The primary characterisc of this
coding style is that the funcon processes a single data sample during each transacon.
void foo (data_t *in, data_t *out) {
static data_t acc;
Loop1: for (int i=N-1;i>=0;i--) {
acc+= ..some calculation..;
}
*out=acc>>N;
}
Another characterisc of sample-based coding style is that the funcon oen contains a stac
variable: a variable whose value must be remembered between invocaons of the funcon, such
as an accumulator or sample counter.
With sample-based code, the locaon of the PIPELINE direcve is clear, namely, to achieve an II
= 1 and process one data value each clock cycle, for which the funcon must be pipelined.
This unrolls any loops inside the funcon and creates addional hardware logic, but there is no
way around this. If Loop1 is pipelined, it takes a minimum of N clock cycles to complete. Only
then can the funcon read the next x input value.
When dealing with C code that processes at the sample level, the strategy is always to pipeline
the funcon.
In this type of coding style, the loops are typically operang on arrays and performing a shi
register or line buer funcons. It is not uncommon to paron these arrays into individual
elements as discussed in Chapter 3: Opmize Structures for Performance to ensure all samples
are shied in a single clock cycle. If the array is implemented in a block RAM, only a maximum of
two samples can be read or wrien in each clock cycle, creang a data processing boleneck.
The soluon here is to pipeline funcon foo. Doing so results in a design that processes one
sample per clock.
Chapter 2: Optimizing the Hardware Function
Vivado HLS Optimization Methodology Guide 24
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 3
Optimize Structures for
Performance
C code can contain descripons that prevent a funcon or loop from being pipelined with the
required performance. This is oen implied by the structure of the C code or the default logic
structures used to implement the PL logic. In some cases, this might require a code modicaon,
but in most cases these issues can be addressed using addional opmizaon direcves.
The following example shows a case where an opmizaon direcve is used to improve the
structure of the implementaon and the performance of pipelining. In this inial example, the
PIPELINE direcve is added to a loop to improve the performance of the loop. This example code
shows a loop being used inside a funcon.
#include "bottleneck.h"
dout_t bottleneck(...) {
...
SUM_LOOP: for(i=3;i<N;i=i+4) {
#pragma HLS PIPELINE
sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];
}
...
}
When the code above is compiled into hardware, the following message appears as output:
INFO: [SCHED 61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 69] Unable to schedule 'load' operation ('mem_load_2',
bottleneck.c:62) on array 'mem' due to limited memory ports.
INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
I
The issue in this example is that arrays are implemented using the ecient block RAM resources
in the PL fabric. This results in a small cost-ecient fast design. The disadvantage of block RAM
is that, like other memories such as DDR or SRAM, they have a limited number of data ports,
typically a maximum of two.
In the code above, four data values from mem are required to compute the value of sum. Because
mem is an array and implemented in a block RAM that only has two data ports, only two values
can be read (or wrien) in each clock cycle. With this conguraon, it is impossible to compute
the value of sum in one clock cycle and thus consume or produce data with an II of 1 (process
one data sample per clock).
Vivado HLS Optimization Methodology Guide 25
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The memory port limitaon issue can be solved by using the ARRAY_PARTITION direcve on the
mem array. This direcve parons arrays into smaller arrays, improving the data structure by
providing more data ports and allowing a higher performance pipeline.
With the addional direcve shown below, array mem is paroned into two dual-port memories
so that all four reads can occur in one clock cycle. There are mulple opons to paroning an
array. In this case, cyclic paroning with a factor of two ensures the rst paron contains
elements 0, 2, 4, etc., from the original array and the second paron contains elements 1, 3, 5,
etc. Because the paroning ensures there are now two dual-port block RAMs (with a total of
four data ports), this allows elements 0, 1, 2, and 3 to be read in a single clock cycle.
Note: The ARRAY_PARTITION direcve cannot be used on arrays which are arguments of the funcon
selected as an accelerator.
#include "bottleneck.h"
dout_t bottleneck(...) {
#pragma HLS ARRAY_PARTITION variable=mem cyclic factor=2 dim=1
...
SUM_LOOP: for(i=3;i<N;i=i+4) {
#pragma HLS PIPELINE
sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];
}
...
}
Other such issues might be encountered when trying to pipeline loops and funcons. The
following table lists the direcves that are likely to address these issues by helping to reduce
bolenecks in data structures.
Table 6: Optimization Strategy Step 3: Optimize Structures for Performance
Directives and Configurations Description
ARRAY_PARTITION Partitions large arrays into multiple smaller arrays or into
individual registers to improve access to data and remove
block RAM bottlenecks.
DEPENDENCE Provides additional information that can overcome loop-
carry dependencies and allow loops to be pipelined (or
pipelined with lower intervals).
INLINE Inlines a function, removing all function hierarchy. Enables
logic optimization across function boundaries and improves
latency/interval by reducing function call overhead.
UNROLL Unrolls for-loops to create multiple independent operations
rather than a single collection of operations, allowing
greater hardware parallelism. This also allows for partial
unrolling of loops.
Config Array Partition This configuration determines how arrays are automatically
partitioned, including global arrays, and if the partitioning
impacts array ports.
Config Compile Controls synthesis specific optimizations such as the
automatic loop pipelining and floating point math
optimizations.
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 26
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table 6: Optimization Strategy Step 3: Optimize Structures for Performance (cont'd)
Directives and Configurations Description
Config Schedule Determines the effort level to use during the synthesis
scheduling phase, the verbosity of the output messages,
and to specify if II should be relaxed in pipelined tasks to
achieve timing.
Config Unroll Allows all loops below the specified number of loop
iterations to be automatically unrolled.
In addion to the ARRAY_PARTITION direcve, the conguraon for array paroning can be
used to automacally paron arrays.
The DEPENDENCE direcve might be required to remove implied dependencies when pipelining
loops. Such dependencies are reported by message SCHED-68.
@W [SCHED-68] Target II not met due to carried dependence(s)
The INLINE direcve removes funcon boundaries. This can be used to bring logic or loops up
one level of hierarchy. It might be more ecient to pipeline the logic in a funcon by including it
in the funcon above it, and merging loops into the funcon above them where the DATAFLOW
opmizaon can be used to execute all the loops concurrently without the overhead of the
intermediate sub-funcon call. This might lead to a higher performing design.
The UNROLL direcve might be required for cases where a loop cannot be pipelined with the
required II. If a loop can only be pipelined with II = 4, it will constrain the other loops and
funcons in the system to be limited to II = 4. In some cases, it might be worth unrolling or
parally unrolling the loop to creang more logic and remove a potenal boleneck. If the loop
can only achieve II = 4, unrolling the loop by a factor of 4 creates logic that can process four
iteraons of the loop in parallel and achieve II = 1.
The Cong commands are used to change the opmizaon default sengs and are only available
from within Vivado HLS when using a boom-up ow. Refer to Vivado Design Suite User Guide:
High-Level Synthesis (UG902) for more details.
If opmizaon direcves cannot be used to improve the iniaon interval, it might require
changes to the code. Examples of this are discussed in Vivado Design Suite User Guide: High-Level
Synthesis (UG902).
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 27
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Reducing Latency
When the compiler nishes minimizing the iniaon interval (II), it automacally seeks to
minimize the latency. The opmizaon direcves listed in the following table can help specify a
parcular latency or inform the compiler to achieve a latency lower than the one produced,
namely, instruct the compiler to sasfy the latency direcve even if it results in a higher II. This
could result in a lower performance design.
Latency direcve are generally not required because most applicaons have a required
throughput but no required latency. When hardware funcons are integrated with a processor,
the latency of the processor is generally the liming factor in the system.
If the loops and funcons are not pipelined, the throughput is limited by the latency because the
task does not start reading the next set of inputs unl the current task has completed.
Table 7: Optimization Strategy Step 4: Reduce Latency
Directive Description
LATENCY Allows a minimum and maximum latency constraint to be
specified.
LOOP_FLATTEN Allows nested loops to be collapsed into a single loop. This
removes the loop transition overhead and improves the
latency. Nested loops are automatically flattened when the
PIPELINE directive is applied.
LOOP_MERGE Merges consecutive loops to reduce overall latency, increase
logic resource sharing, and improve logic optimization.
The loop opmizaon direcves can be used to aen a loop hierarchy or merge consecuve
loops together. The benet to the latency is due to the fact that it typically costs a clock cycle in
the control logic to enter and leave the logic created by a loop. The fewer the number of
transions between loops, the lesser number of clock cycles a design takes to complete.
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 28
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Reducing Area
In hardware, the number of resources required to implement a logic funcon is referred to as the
design area. Design area also refers to how much area the resource used on the xed-size PL
fabric. The area is of importance when the hardware is too large to be implemented in the target
device, and when the hardware funcon consumes a very high percentage (> 90%) of the
available area. This can result in dicules when trying to wire the hardware logic together
because the wires themselves require resources.
Aer meeng the required performance target (or II), the next step might be to reduce the area
while maintaining the same performance. This step can be opmal because there is nothing to be
gained by reducing the area if the hardware funcon is operang at the required performance
and no other hardware funcons are to be implemented in the remaining space in the PL.
The most common area opmizaon is the opmizaon of dataow memory channels to reduce
the number of block RAM resources required to implement the hardware funcon. Each device
has a limited number of block RAM resources.
If you used the DATAFLOW opmizaon and the compiler cannot determine whether the tasks
in the design are streaming data, it implements the memory channels between dataow tasks
using ping-pong buers. These require two block RAMs each of size N, where N is the number of
samples to be transferred between the tasks (typically the size of the array passed between
tasks). If the design is pipelined and the data is in fact streaming from one task to the next with
values produced and consumed in a sequenal manner, you can greatly reduce the area by using
the STREAM direcve to specify that the arrays are to be implemented in a streaming manner
that uses a simple FIFO for which you can specify the depth. FIFOs with a small depth are
implemented using registers and the PL fabric has many registers.
For most applicaons, the depth can be specied as 1, resulng in the memory channel being
implemented as a simple register. If, however, the algorithm implements data compression or
extrapolaon where some tasks consume more data than they produce or produce more data
than they consume, some arrays must be specied with a higher depth:
For tasks which produce and consume data at the same rate, specify the array between them
to stream with a depth of 1.
For tasks which reduce the data rate by a factor of X-to-1, specify arrays at the input of the
task to stream with a depth of X. All arrays prior to this in the funcon should also have a
depth of X to ensure the hardware funcon does not stall because the FIFOs are full.
For tasks which increase the data rate by a factor of 1-to-Y, specify arrays at the output of the
task to stream with a depth of Y. All arrays aer this in the funcon should also have a depth
of Y to ensure the hardware funcon does not stall because the FIFOs are full.
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 29
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Note: If the depth is set too small, the symptom will be the hardware funcon will stall (hang) during
Hardware Emulaon resulng in lower performance, or even deadlock in some cases, due to full FIFOs
causing the rest of the system to wait.
The following table lists the other direcves to consider when aempng to minimize the
resources used to implement the design.
Table 8: Optimization Strategy Step 5: Reduce Area
Directives and Configurations Description
ALLOCATION Specifies a limit for the number of operations, hardware
resources, or functions used. This can force the sharing of
hardware resources but might increase latency.
ARRAY_MAP Combines multiple smaller arrays into a single large array to
help reduce the number of block RAM resources.
ARRAY_RESHAPE Reshapes an array from one with many elements to one
with greater word width. Useful for improving block RAM
accesses without increasing the number of block RAM.
DATA_PACK Packs the data fields of an internal struct into a single scalar
with a wider word width, allowing a single control signal to
control all fields.
LOOP_MERGE Merges consecutive loops to reduce overall latency, increase
sharing, and improve logic optimization.
OCCURRENCE Used when pipelining functions or loops to specify that the
code in a location is executed at a lesser rate than the code
in the enclosing function or loop.
RESOURCE Specifies that a specific hardware resource (core) is used to
implement a variable (array, arithmetic operation).
STREAM Specifies that a specific memory channel is to be
implemented as a FIFO with an optional specific depth.
Config Bind Determines the effort level to use during the synthesis
binding phase and can be used to globally minimize the
number of operations used.
Config Dataflow This configuration specifies the default memory channel
and FIFO depth in dataflow optimization.
The ALLOCATION and RESOURCE direcves are used to limit the number of operaons and to
select which cores (hardware resources) are used to implement the operaons. For example, you
could limit the funcon or loop to using only one mulplier and specify it to be implemented
using a pipelined mulplier.
If the ARRAY_PARITION direcve is used to improve the iniaon interval you might want to
consider using the ARRAY_RESHAPE direcve instead. The ARRAY_RESHAPE opmizaon
performs a similar task to array paroning, however, the reshape opmizaon recombines the
elements created by paroning into a single block RAM with wider data ports. This might
prevent an increase in the number of block RAM resources required.
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 30
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
If the C code contains a series of loops with similar indexing, merging the loops with the
LOOP_MERGE direcve might allow some opmizaons to occur. Finally, in cases where a
secon of code in a pipeline region is only required to operate at an iniaon interval lower than
the rest of the region, the OCCURENCE direcve is used to indicate that this logic can be
opmized to execute at a lower rate.
Note: The Cong commands are used to change the opmizaon default sengs and are only available
from within Vivado HLS when using a boom-up ow. Refer to Vivado Design Suite User Guide: High-Level
Synthesis (UG902) for more details.
Design Optimization Workflow
Before performing any opmizaons it is recommended to create a new build conguraon
within the project. Using dierent build conguraons allows one set of results to be compared
against a dierent set of results. In addion to the standard Debug and Release conguraons,
custom conguraons with more useful names (e.g., Opt_ver1 and UnOpt_ver) might be created
in the Project Sengs window using the Manage Build Conguraons for the Project toolbar
buon.
Dierent build conguraons allow you to compare not only the results, but also the log les and
even output RTL les used to implement the FPGA (the RTL les are only recommended for
users very familiar with hardware design).
The basic opmizaon strategy for a high-performance design is:
Create an inial or baseline design.
Pipeline the loops and funcons. Apply the DATAFLOW opmizaon to execute loops and
funcons concurrently.
Address any issues that limit pipelining, such as array bolenecks and loop dependencies (with
ARRAY_PARTITION and DEPENDENCE direcves).
Specify a specic latency or reduce the size of the dataow memory channels and use the
ALLOCATION and RESOUCES direcves to further reduce area.
Note: It might somemes be necessary to make adjustments to the code to meet performance.
In summary, the goal is to always meet performance rst, before reducing area. If the strategy is
to create a design with the fewest resources, simply omit the steps to improving performance,
although the baseline results might be very close to the smallest possible design.
Throughout the opmizaon process it is highly recommended to review the console output (or
log le) aer compilaon. When the compiler cannot reach the specied performance goals of an
opmizaon, it automacally relaxes the goals (except the clock frequency) and creates a design
with the goals that can be sased. It is important to review the output from the compilaon log
les and reports to understand what opmizaons have been performed.
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 31
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
For specic details on applying opmizaons, refer to Vivado Design Suite User Guide: High-Level
Synthesis (UG902).
Chapter 3: Optimize Structures for Performance
Vivado HLS Optimization Methodology Guide 32
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 4
Data Access Patterns
An FPGA is selected to implement the C code due to the superior performance of the FPGA - the
massively parallel architecture of an FPGA allows it to perform operaons much faster than the
inherently sequenal operaons of a processor, and users typically wish to take advantage of
that performance.
The focus here is on understanding the impact that the access paerns inherent in the C code
might have on the results. Although the access paerns of most concern are those into and out
of the hardware funcon, it is worth considering the access paerns within funcons as any
bolenecks within the hardware funcon will negavely impact the transfer rate into and out of
the funcon.
To highlight how some data access paerns can negavely impact performance and demonstrate
how other paerns can be used to fully embrace the parallelism and high performance
capabilies of an FPGA, this secon reviews an image convoluon algorithm.
The rst part reviews the algorithm and highlights the data access aspects that limit the
performance in an FPGA.
The second part shows how the algorithm might be wrien to achieve the highest
performance possible.
Algorithm with Poor Data Access Patterns
A standard convoluon funcon applied to an image is used here to demonstrate how the C
code can negavely impact the performance that is possible from an FPGA. In this example, a
horizontal and then vercal convoluon is performed on the data. Because the data at the edge
of the image lies outside the convoluon windows, the nal step is to address the data around
the border.
The algorithm structure can be summarized as follows:
A horizontal convoluon.
Followed by a vercal convoluon.
Vivado HLS Optimization Methodology Guide 33
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Followed by a manipulaon of the border pixels.
static void convolution_orig(
int width,
int height,
const T *src,
T *dst,
const T *hcoeff,
const T *vcoeff) {
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
// Horizontal convolution
HconvH:for(int col = 0; col < height; col++){
HconvWfor(int row = border_width; row < width - border_width; row++){
Hconv:for(int i = - border_width; i <= border_width; i++){
}
}
}
// Vertical convolution
VconvH:for(int col = border_width; col < height - border_width; col++){
VconvW:for(int row = 0; row < width; row++){
Vconv:for(int i = - border_width; i <= border_width; i++){
}
}
}
// Border pixels
Top_Border:for(int col = 0; col < border_width; col++){
}
Side_Border:for(int col = border_width; col < height - border_width; col+
+){
}
Bottom_Border:for(int col = height - border_width; col < height; col++){
}
}
Standard Horizontal Convolution
The rst step in this is to perform the convoluon in the horizontal direcon as shown in the
following gure.
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 34
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
First Output Second Output Final Output
src
Hsamp
local
Hcoeff
Hsamp
Hcoeff
Hsamp
Hcoeff
X14296-121417
The convoluon is performed using K samples of data and K convoluon coecients. In the
gure above, K is shown as 5, however, the value of K is dened in the code. To perform the
convoluon, a minimum of K data samples are required. The convoluon window cannot start at
the rst pixel because the window would need to include pixels that are outside the image.
By performing a symmetric convoluon, the rst K data samples from input src can be
convolved with the horizontal coecients and the rst output calculated. To calculate the second
output, the next set of K data samples is used. This calculaon proceeds along each row unl the
nal output is wrien.
The C code for performing this operaon is shown below.
const int conv_size = K;
const int border_width = int(conv_size / 2);
#ifndef __SYNTHESIS__
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 35
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];
#else // Static storage allocation for HLS, dynamic otherwise
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
#endif
Clear_Local:for(int i = 0; i < height * width; i++){
local[i]=0;
}
// Horizontal convolution
HconvH:for(int col = 0; col < height; col++){
HconvWfor(int row = border_width; row < width - border_width; row++){
int pixel = col * width + row;
Hconv:for(int i = - border_width; i <= border_width; i++){
local[pixel] += src[pixel + i] * hcoeff[i + border_width];
}
}
}
The code is straighorward and intuive. There are, however, some issues with this C code that
will negavely impact the quality of the hardware results.
The rst issue is the large storage requirements during C compilaon. The intermediate results in
the algorithm are stored in an internal local array. This requires an array of HEIGHT*WIDTH,
which for a standard video image of 1920*1080 will hold 2,073,600 values.
For the cross-compilers targeng Zynq®-7000 All Programmable SoC or Zynq UltraScale+
MPSoC, as well as many host systems, this amount of local storage can lead to stack
overows at run me (for example, running on the target device, or running co-sim ows
within Vivado HLS). The data for a local array is placed on the stack and not the heap, which is
managed by the OS. When cross-compiling with arm-linux-gnueabihf-g++ use the -
Wl,"-z stacksize=4194304" linker opon to allocate sucent stack space. (Note that
the syntax for this opon varies for dierent linkers.) When a funcon will only be run in
hardware, a useful way to avoid such issues is to use the __SYNTHESIS__ macro. This macro is
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 36
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
automacally dened by the system compiler when the hardware funcon is synthesized into
hardware. The code shown above uses dynamic memory allocaon during C simulaon to
avoid any compilaon issues and only uses stac storage during synthesis. A downside of
using this macro is the code veried by C simulaon is not the same code that is synthesized.
In this case, however, the code is not complex and the behavior will be the same.
The main issue with this local array is the quality of the FPGA implementaon. Because this is
an array it will be implemented using internal FPGA block RAM. This is a very large memory to
implement inside the FPGA. It might require a larger and more costly FPGA device. The use of
block RAM can be minimized by using the DATAFLOW opmizaon and streaming the data
through small ecient FIFOs, but this will require the data to be used in a streaming
sequenal manner. There is currently no such requirement.
The next issue relates to the performance: the inializaon for the local array. The loop
Clear_Local is used to set the values in array local to zero. Even if this loop is pipelined in the
hardware to execute in a high-performance manner, this operaon sll requires approximately
two million clock cycles (HEIGHT*WIDTH) to implement. While this memory is being inialized,
the system cannot perform any image processing. This same inializaon of the data could be
performed using a temporary variable inside loop HConv to inialize the accumulaon before the
write.
Finally, the throughput of the data, and thus the system performance, is fundamentally limited by
the data access paern.
To create the rst convolved output, the rst K values are read from the input.
To calculate the second output, a new value is read and then the same K-1 values are re-read.
One of the keys to a high-performance FPGA is to minimize the access to and from the PS. Each
access for data, which has previously been fetched, negavely impacts the performance of the
system. An FPGA is capable of performing many concurrent calculaons at once and reaching
very high performance, but not while the ow of data is constantly interrupted by re-reading
values.
Note: To maximize performance, data should only be accessed once from the PS and small units of local
storage - small to medium sized arrays - should be used for data which must be reused.
With the code shown above, the data cannot be connuously streamed directly from the
processor using a DMA operaon because the data is required to be re-read me and again.
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 37
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Standard Vertical Convolution
The next step is to perform the vercal convoluon shown in the following gure.
First Output Second Output Final Output
local
Vsamp
dst
Vcoeff
Vsamp
Vcoeff
Vsamp
Vconv
X14299-110617
The process for the vercal convoluon is similar to the horizontal convoluon. A set of K data
samples is required to convolve with the convoluon coecients, Vcoe in this case. Aer the
rst output is created using the rst K samples in the vercal direcon, the next set of K values is
used to create the second output. The process connues down through each column unl the
nal output is created.
Aer the vercal convoluon, the image is now smaller than the source image src due to both
the horizontal and vercal border eect.
The code for performing these operaons is shown below.
Clear_Dst:for(int i = 0; i < height * width; i++){
dst[i]=0;
}
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 38
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
// Vertical convolution
VconvH:for(int col = border_width; col < height - border_width; col++){
VconvW:for(int row = 0; row < width; row++){
int pixel = col * width + row;
Vconv:for(int i = - border_width; i <= border_width; i++){
int offset = i * width;
dst[pixel] += local[pixel + offset] * vcoeff[i + border_width];
}
}
}
This code highlights similar issues to those already discussed with the horizontal convoluon
code.
Many clock cycles are spent to set the values in the output image dst to zero. In this case,
approximately another two million cycles for a 1920*1080 image size.
There are mulple accesses per pixel to re-read data stored in array local.
There are mulple writes per pixel to the output array/port dst.
The access paerns in the code above in fact creates the requirement to have such a large local
array. The algorithm requires the data on row K to be available to perform the rst calculaon.
Processing data down the rows before proceeding to the next column requires the enre image
to be stored locally. This requires that all values be stored and results in large local storage on the
FPGA.
In addion, when you reach the stage where you wish to use compiler direcves to opmize the
performance of the hardware funcon, the ow of data between the horizontal and vercal loop
cannot be managed via a FIFO (a high-performance and low-resource unit) because the data is
not streamed out of array local: a FIFO can only be used with sequenal access paerns.
Instead, this code which requires arbitrary/random accesses requires a ping-pong block RAM to
improve performance. This doubles the memory requirements for the implementaon of the local
array to approximately four million data samples, which is too large for an FPGA.
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 39
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Standard Border Pixel Convolution
The nal step in performing the convoluon is to create the data around the border. These pixels
can be created by simply reusing the nearest pixel in the convolved output. The following gures
shows how this is achieved.
Top Left Top Row Top Right
Left and Right Edges Bottom Left and Bottom Row Bottom Right
dst
dst
X14294-121417
The border region is populated with the nearest valid value. The following code performs the
operaons shown in the gure.
int border_width_offset = border_width * width;
int border_height_offset = (height - border_width - 1) * width;
// Border pixels
Top_Border:for(int col = 0; col < border_width; col++){
int offset = col * width;
for(int row = 0; row < border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_width_offset + border_width];
}
for(int row = border_width; row < width - border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_width_offset + row];
}
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 40
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
for(int row = width - border_width; row < width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_width_offset + width - border_width - 1];
}
}
Side_Border:for(int col = border_width; col < height - border_width; col++)
{
int offset = col * width;
for(int row = 0; row < border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[offset + border_width];
}
for(int row = width - border_width; row < width; row++){
int pixel = offset + row;
dst[pixel] = dst[offset + width - border_width - 1];
}
}
Bottom_Border:for(int col = height - border_width; col < height; col++){
int offset = col * width;
for(int row = 0; row < border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_height_offset + border_width];
}
for(int row = border_width; row < width - border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_height_offset + row];
}
for(int row = width - border_width; row < width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_height_offset + width - border_width - 1];
}
}
The code suers from the same repeated access for data. The data stored outside the FPGA in
the array dst must now be available to be read as input data re-read mulple mes. Even in the
rst loop, dst[border_width_offset + border_width] is read mulple mes but the
values of border_width_offset and border_width do not change.
This code is very intuive to both read and write. When implemented with the SDSoC
environment it is approximately 120M clock cycles, which meets or slightly exceeds the
performance of a CPU. However, as shown in the next secon, opmal data access paerns
ensure this same algorithm can be implemented on the FPGA at a rate of one pixel per clock
cycle, or approximately 2M clock cycles.
The summary from this review is that the following poor data access paerns negavely impact
the performance and size of the FPGA implementaon:
Mulple accesses to read and then re-read data. Use local storage where possible.
Accessing data in an arbitrary or random access manner. This requires the data to be stored
locally in arrays and costs resources.
Seng default values in arrays costs clock cycles and performance.
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 41
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Algorithm With Optimal Data Access
Patterns
The key to implemenng the convoluon example reviewed in the previous secon as a high-
performance design with minimal resources is to:
Maximize the ow of data through the system. Refrain from using any coding techniques or
algorithm behavior that inhibits the connuous ow of data.
Maximize the reuse of data. Use local caches to ensure there are no requirements to re-read
data and the incoming data can keep owing.
Embrace condional branching. This is expensive on a CPU, GPU, or DSP but opmal in an
FPGA.
The rst step is to understand how data ows through the system into and out of the FPGA. The
convoluon algorithm is performed on an image. When data from an image is produced and
consumed, it is transferred in a standard raster-scan manner as shown in the following gure.
Width
Height
X14298-121417
If the data is transferred to the FPGA in a streaming manner, the FPGA should process it in a
streaming manner and transfer it back from the FPGA in this manner.
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 42
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The convoluon algorithm shown below embraces this style of coding. At this level of abstracon
a concise view of the code is shown. However, there are now intermediate buers, hconv and
vconv, between each loop. Because these are accessed in a streaming manner, they are
opmized into single registers in the nal implementaon.
template<typename T, int K>
static void convolution_strm(
int width,
int height,
T src[TEST_IMG_ROWS][TEST_IMG_COLS],
T dst[TEST_IMG_ROWS][TEST_IMG_COLS],
const T *hcoeff,
const T *vcoeff)
{
T hconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];
T vconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];
T *phconv, *pvconv;
// These assertions let HLS know the upper bounds of loops
assert(height < MAX_IMG_ROWS);
assert(width < MAX_IMG_COLS);
assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
// Horizontal convolution
HConvH:for(int col = 0; col < height; col++) {
HConvW:for(int row = 0; row < width; row++) {
HConv:for(int i = 0; i < K; i++) {
}
}
}
// Vertical convolution
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
VConv:for(int i = 0; i < K; i++) {
}
}
}
Border:for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
}
}
All three processing loops now embrace condional branching to ensure the connuous
processing of data.
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 43
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 4: Data Access Patterns
Vivado HLS Optimization Methodology Guide 44
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Chapter 5
Standard Horizontal Convolution
The rst step in this is to perform the convoluon in the horizontal direcon as shown in the
following gure.
Vivado HLS Optimization Methodology Guide 45
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
First Output Second Output Final Output
src
Hsamp
local
Hcoeff
Hsamp
Hcoeff
Hsamp
Hcoeff
X14296-121417
The convoluon is performed using K samples of data and K convoluon coecients. In the
gure above, K is shown as 5, however, the value of K is dened in the code. To perform the
convoluon, a minimum of K data samples are required. The convoluon window cannot start at
the rst pixel because the window would need to include pixels that are outside the image.
By performing a symmetric convoluon, the rst K data samples from input src can be
convolved with the horizontal coecients and the rst output calculated. To calculate the second
output, the next set of K data samples is used. This calculaon proceeds along each row unl the
nal output is wrien.
The C code for performing this operaon is shown below.
const int conv_size = K;
const int border_width = int(conv_size / 2);
#ifndef __SYNTHESIS__
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 46
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];
#else // Static storage allocation for HLS, dynamic otherwise
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
#endif
Clear_Local:for(int i = 0; i < height * width; i++){
local[i]=0;
}
// Horizontal convolution
HconvH:for(int col = 0; col < height; col++){
HconvWfor(int row = border_width; row < width - border_width; row++){
int pixel = col * width + row;
Hconv:for(int i = - border_width; i <= border_width; i++){
local[pixel] += src[pixel + i] * hcoeff[i + border_width];
}
}
}
The code is straighorward and intuive. There are, however, some issues with this C code that
will negavely impact the quality of the hardware results.
The rst issue is the large storage requirements during C compilaon. The intermediate results in
the algorithm are stored in an internal local array. This requires an array of HEIGHT*WIDTH,
which for a standard video image of 1920*1080 will hold 2,073,600 values.
For the cross-compilers targeng Zynq®-7000 All Programmable SoC or Zynq UltraScale+
MPSoC, as well as many host systems, this amount of local storage can lead to stack
overows at run me (for example, running on the target device, or running co-sim ows
within Vivado HLS). The data for a local array is placed on the stack and not the heap, which is
managed by the OS. When cross-compiling with arm-linux-gnueabihf-g++ use the -
Wl,"-z stacksize=4194304" linker opon to allocate sucent stack space. (Note that
the syntax for this opon varies for dierent linkers.) When a funcon will only be run in
hardware, a useful way to avoid such issues is to use the __SYNTHESIS__ macro. This macro is
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 47
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
automacally dened by the system compiler when the hardware funcon is synthesized into
hardware. The code shown above uses dynamic memory allocaon during C simulaon to
avoid any compilaon issues and only uses stac storage during synthesis. A downside of
using this macro is the code veried by C simulaon is not the same code that is synthesized.
In this case, however, the code is not complex and the behavior will be the same.
The main issue with this local array is the quality of the FPGA implementaon. Because this is
an array it will be implemented using internal FPGA block RAM. This is a very large memory to
implement inside the FPGA. It might require a larger and more costly FPGA device. The use of
block RAM can be minimized by using the DATAFLOW opmizaon and streaming the data
through small ecient FIFOs, but this will require the data to be used in a streaming
sequenal manner. There is currently no such requirement.
The next issue relates to the performance: the inializaon for the local array. The loop
Clear_Local is used to set the values in array local to zero. Even if this loop is pipelined in the
hardware to execute in a high-performance manner, this operaon sll requires approximately
two million clock cycles (HEIGHT*WIDTH) to implement. While this memory is being inialized,
the system cannot perform any image processing. This same inializaon of the data could be
performed using a temporary variable inside loop HConv to inialize the accumulaon before the
write.
Finally, the throughput of the data, and thus the system performance, is fundamentally limited by
the data access paern.
To create the rst convolved output, the rst K values are read from the input.
To calculate the second output, a new value is read and then the same K-1 values are re-read.
One of the keys to a high-performance FPGA is to minimize the access to and from the PS. Each
access for data, which has previously been fetched, negavely impacts the performance of the
system. An FPGA is capable of performing many concurrent calculaons at once and reaching
very high performance, but not while the ow of data is constantly interrupted by re-reading
values.
Note: To maximize performance, data should only be accessed once from the PS and small units of local
storage - small to medium sized arrays - should be used for data which must be reused.
With the code shown above, the data cannot be connuously streamed directly from the
processor using a DMA operaon because the data is required to be re-read me and again.
Optimal Horizontal Convolution
To perform the calculaon in a more ecient manner for FPGA implementaon, the horizontal
convoluon is computed as shown in the following gure.
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 48
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
First Calculation First Output Final Output
src
Hwin
hconv
Hconv
Hsamp
Hconv
Hsamp
Hconv
X14297-110617
The algorithm must use the K previous samples to compute the convoluon result. It therefore
copies the sample into a temporary cache hwin. This use of local storage means there is no need
to re-read values from the PS and interrupt the ow of data. For the rst calculaon there are
not enough values in hwin to compute a result, so condionally, no output values are wrien.
The algorithm keeps reading input samples and caching them into hwin. Each me it reads a new
sample, it pushes an unneeded sample out of hwin. The rst me an output value can be wrien
is aer the Kth input has been read. An output value can now be wrien. The algorithm proceeds
in this manner along the rows unl the nal sample has been read. At that point, only the last K
samples are stored in hwin: all that is required to compute the convoluon.
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 49
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
As shown below, the code to perform these operaons uses both local storage to prevent re-
reads from the PL – the reads from local storage can be performed in parallel in the nal
implementaon – and the extensive use of condional branching to ensure each new data
sample can be processed in a dierent manner.
// Horizontal convolution
phconv=hconv_buffer; // set / reset pointer to start of buffer
// These assertions let HLS know the upper bounds of loops
assert(height < MAX_IMG_ROWS);
assert(width < MAX_IMG_COLS);
assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
HConvH:for(int col = 0; col < height; col++) {
HConvW:for(int row = 0; row < width; row++) {
#pragma HLS PIPELINE
T in_val = *src++;
// Reset pixel value on-the-fly - eliminates an O(height*width) loop
T out_val = 0;
HConv:for(int i = 0; i < K; i++) {
hwin[i] = i < K - 1 ? hwin[i + 1] : in_val;
out_val += hwin[i] * hcoeff[i];
}
if (row >= K - 1) {
*phconv++=out_val;
}
}
}
An interesng point to note in the code above is the use of the temporary variable out_val to
perform the convoluon calculaon. This variable is set to zero before the calculaon is
performed, negang the need to spend two million clock cycles to reset the values, as in the
previous example.
Throughout the enre process, the samples in the src input are processed in a raster-streaming
manner. Every sample is read in turn. The outputs from the task are either discarded or used, but
the task keeps constantly compung. This represents a dierence from code wrien to perform
on a CPU.
Optimal Vertical Convolution
The vercal convoluon represents a challenge to the streaming data model preferred by an
FPGA. The data must be accessed by column but you do not wish to store the enre image. The
soluon is to use line buers, as shown in the following gure.
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 50
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
First Calculation First Output Final Output
hconv
vconv
Vconv Vconv Vconv
X14300-110617
Once again, the samples are read in a streaming manner, this me from the local buer hconv.
The algorithm requires at least K-1 lines of data before it can process the rst sample. All the
calculaons performed before this are discarded through the use of condionals.
A line buer allows K-1 lines of data to be stored. Each me a new sample is read, another
sample is pushed out the line buer. An interesng point to note here is that the newest sample
is used in the calculaon, and then the sample is stored into the line buer and the old sample
ejected out. This ensures that only K-1 lines are required to be cached rather than an unknown
number of lines, and minimizes the use of local storage. Although a line buer does require
mulple lines to be stored locally, the convoluon kernel size K is always much less than the
1080 lines in a full video image.
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 51
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The rst calculaon can be performed when the rst sample on the Kth line is read. The
algorithm then proceeds to output values unl the nal pixel is read.
// Vertical convolution
phconv=hconv_buffer; // set/reset pointer to start of buffer
pvconv=vconv_buffer; // set/reset pointer to start of buffer
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
#pragma HLS DEPENDENCE variable=linebuf inter false
#pragma HLS PIPELINE
T in_val = *phconv++;
// Reset pixel value on-the-fly - eliminates an O(height*width) loop
T out_val = 0;
VConv:for(int i = 0; i < K; i++) {
T vwin_val = i < K - 1 ? linebuf[i][row] : in_val;
out_val += vwin_val * vcoeff[i];
if (i > 0)
linebuf[i - 1][row] = vwin_val;
}
if (col >= K - 1) {
*pvconv++ = out_val;
}
}
The code above once again processes all the samples in the design in a streaming manner. The
task is constantly running. Following a coding style where you minimize the number of re-reads
(or re-writes) forces you to cache the data locally. This is an ideal strategy when targeng an
FPGA.
Optimal Border Pixel Convolution
The nal step in the algorithm is to replicate the edge pixels into the border region. To ensure the
constant ow of data and data reuse, the algorithm makes use of local caching. The following
gure shows how the border samples are aligned into the image.
Each sample is read from the vconv output from the vercal convoluon.
The sample is then cached as one of four possible pixel types.
The sample is then wrien to the output stream.
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 52
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
First Output Middle Output Final Output
vconv
Left Edge Border
dst
Right Edge Raw Pixel
Left Edge Border
Right Edge Raw Pixel
Left Edge Border
Right Edge Raw Pixel
Border
Raw Pixel
Border
Right Edge
Left Edge
X14295-110617
The code for determining the locaon of the border pixels is shown here.
// Border pixels
pvconv=vconv_buffer; // set/reset pointer to start of buffer
Border:for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
T pix_in, l_edge_pix, r_edge_pix, pix_out;
#pragma HLS PIPELINE
if (i == 0 || (i > border_width && i < height - border_width)) {
// read a pixel out of the video stream and cache it for
// immediate use and later replication purposes
if (j < width - (K - 1)) {
pix_in = *pvconv++;
borderbuf[j] = pix_in;
}
if (j == 0) {
l_edge_pix = pix_in;
}
if (j == width - K) {
r_edge_pix = pix_in;
}
}
// Select output value from the appropriate cache resource
if (j <= border_width) {
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 53
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pix_out = l_edge_pix;
} else if (j >= width - border_width - 1) {
pix_out = r_edge_pix;
} else {
pix_out = borderbuf[j - border_width];
}
*dst++=pix_out;
}
}
A notable dierence with this new code is the extensive use of condionals inside the tasks. This
allows the task, aer it is pipelined, to connuously process data. The result of the condionals
does not impact the execuon of the pipeline. The result will impact the output values, but the
pipeline with keep processing as long as input samples are available.
Optimal Data Access Patterns
The following summarizes how to ensure your data access paerns result in the most opmal
performance on an FPGA.
Minimize data input reads. Aer data has been read into the block, it can easily feed many
parallel paths but the inputs to the hardware funcon can be bolenecks to performance.
Read data once and use a local cache if the data must be reused.
Minimize accesses to arrays, especially large arrays. Arrays are implemented in block RAM
which like I/O ports only have a limited number of ports and can be bolenecks to
performance. Arrays can be paroned into smaller arrays and even individual registers but
paroning large arrays will result in many registers being used. Use small localized caches to
hold results such as accumulaons and then write the nal result to the array.
Seek to perform condional branching inside pipelined tasks rather than condionally execute
tasks, even pipelined tasks. Condionals are implemented as separate paths in the pipeline.
Allowing the data from one task to ow into the next task with the condional performed
inside the next task will result in a higher performing system.
Minimize output writes for the same reason as input reads, namely, that ports are bolenecks.
Replicang addional accesses only pushes the issue further back into the system.
For C code which processes data in a streaming manner, consider employing a coding style that
promotes read-once/write-once to funcon arguments because this ensures the funcon can be
eciently implemented in an FPGA. It is much more producve to design an algorithm in C that
results in a high-performance FPGA implementaon than debug why the FPGA is not operang
at the performance required.
Chapter 5: Standard Horizontal Convolution
Vivado HLS Optimization Methodology Guide 54
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Appendix A
OpenCL Attributes
Optimizations in OpenCL
This secon describes OpenCL aributes that can be added to source code to assist system
opmizaon by the SDAccel compiler, xocc, the SDSoC system compilers, sdscc and sds++,
and Vivado HLS synthesis.
SDx provides OpenCL aributes to opmize your code for data movement and kernel
performance. The goal of data movement opmizaon is to maximize the system level data
throughput by maximizing interface bandwidth ulizaon and DDR bandwidth ulizaon. The
goal of kernel computaon opmizaon is to create processing logic that can consume all the
data as soon as they arrive at kernel interfaces. This is generally achieved by expanding the
processing code to match the data path with techniques such as funcon inlining and pipelining,
loop unrolling, array paroning, dataowing, etc.
The OpenCL aributes include the types specied below:
Table 9: OpenCL __attributes__ by Type
Type Attributes
Kernel Size reqd_work_group_size
vec_type_hint
work_group_size_hint
xcl_max_work_group_size
xcl_zero_global_work_offset
Function Inlining always_inline
Task-level Pipeline xcl_dataflow
xcl_reqd_pipe_depth
Pipeline xcl_pipeline_loop
xcl_pipeline_workitems
Loop Unrolling opencl_unroll_hint
Vivado HLS Optimization Methodology Guide 55
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table 9: OpenCL __attributes__ by Type (cont'd)
Type Attributes
Array Optimization xcl_array_partition
xcl_array_reshape
Note: Array variables only accept a single array
opmizaon aribute.
TIP: The SDAccel and SDSoC compilers also support many of the standard aributes supported by
gcc
, such as
always_inline
,
noinline
,
unroll
, and
nounroll
.
always_inline
Description
The always_inline aribute indicates that a funcon must be inlined. This aribute is a
standard feature of GCC, and a standard feature of the SDx compilers.
This aribute enables a compiler opmizaon to have a funcon inlined into the calling funcon.
The inlined funcon is dissolved and no longer appears as a separate level of hierarchy in the
RTL.
In some cases, inlining a funcon allows operaons within the funcon to be shared and
opmized more eecvely with surrounding operaons in the calling funcon. However, an
inlined funcon can no longer be shared with other funcons, so the logic may be duplicated
between the inlined funcon and a separate instance of the funcon which can be more broadly
shared. While this can improve performance, this will also increase the area required for
implemenng the RTL.
In some cases the compiler may choose to ignore the always_inline aribute and not inline a
funcon.
By default, inlining is only performed on the next level of funcon hierarchy, not sub-funcons.
Syntax
Place the aribute in the OpenCL source before the funcon denion to always have it inlined
whenever the funcon is called.
__attribute__((always_inline))
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 56
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Examples
This example adds the always_inline aribute to funcon foo:
__attribute__((always_inline))
void foo ( a, b, c, d ) {
...
}
See Also
hps://gcc.gnu.org
SDAccel Environment Opmizaon Guide (UG1207)
opencl_unroll_hint
Description
IMPORTANT!: This is a compiler hint which the compiler may ignore.
Loop unrolling is the rst opmizaon technique available in SDAccel. The purpose of the loop
unroll opmizaon is to expose concurrency to the compiler. This newly exposed concurrency
reduces latency and improves performance, but also consumes more FPGA fabric resources.
The opencl_unroll_hint aribute is part of the OpenCL Language Specicaon, and
species that loops (for, while, do) can be unrolled by the OpenCL compiler. See "Unrolling
Loops" in SDAccel Environment Opmizaon Guide (UG1207) for more informaon.
The opencl_unroll_hint aribute qualier must appear immediately before the loop to be
aected. You can use this aribute to specify full unrolling of the loop, paral unrolling by a
specied amount, or to disable unrolling of the loop.
Syntax
Place the aribute in the OpenCL source before the loop denion:
__attribute__((opencl_unroll_hint(
n
)))
Where:
n is an oponal loop unrolling factor and must be a posive integer, or compile me constant
expression. An unroll factor of 1 disables unrolling.
TIP: If n is not specied, the compiler automacally determines the unrolling factor for the loop.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 57
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 1
The following example unrolls the for loop by a factor of 2. This results in two parallel loop
iteraons instead of four sequenal iteraons for the compute unit to complete the operaon.
__attribute__((opencl_unroll_hint(2)))
for(int i = 0; i < LENGTH; i++) {
bufc[i] = bufa[i] * bufb[i];
}
Conceptually the compiler transforms the loop above to the code below:
for(int i = 0; i < LENGTH; i+=2) {
bufc[i] = bufa[i] * bufb[i];
bufc[i+1] = bufa[i+1] * bufb[i+1];
}
See Also
SDAccel Environment Opmizaon Guide (UG1207)
hps://www.khronos.org/
The OpenCL C Specicaon
reqd_work_group_size
Description
When OpenCL kernels are submied for execuon on an OpenCL device, they execute within an
index space, called an ND range, which can have 1, 2, or 3 dimensions. This is called the global
size in the OpenCL API. The work-group size denes the amount of the ND range that can be
processed by a single invocaon of a kernel compute unit. The work-group size is also called the
local size in the OpenCL API. The OpenCL compiler can determine the work-group size based on
the properes of the kernel and selected device. Once the work-group size (local size) has been
determined, the ND range (global size) is divided automacally into work-groups, and the work-
groups are scheduled for execuon on the device.
Although the OpenCL compiler can dene the work-group size, the specicaon of the
reqd_work_group_size aribute on the kernel to dene the work-group size is highly
recommended for FPGA implementaons of the kernel. The aribute is recommended for
performance opmizaon during the generaon of the custom logic for a kernel. See "OpenCL
Execuon Model" in SDAccel Environment Opmizaon Guide (UG1207) for more informaon.
TIP: In the case of an FPGA implementaon, the specicaon of the
reqd_work_group_size
aribute is highly recommended as it can be used for performance opmizaon during the generaon
of the custom logic for a kernel.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 58
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
OpenCL kernel funcons are executed exactly one me for each point in the ND range index
space. This unit of work for each point in the ND range is called a work-item. Work-items are
organized into work-groups, which are the unit of work scheduled onto compute units. The
oponal reqd_work_group_size denes the work-group size of a compute unit that must be
used as the local_work_size argument to clEnqueueNDRangeKernel. This allows the
compiler to opmize the generated code appropriately for this kernel.
Syntax
Place this aribute before the kernel denion, or before the primary funcon specied for the
kernel:
__attribute__((reqd_work_group_size(
X
,
Y
,
Z
)))
Where:
X, Y, Z: Species the ND range of the kernel. This represents each dimension of a three
dimensional matrix specifying the size of the work-group for the kernel.
Examples
The following OpenCL API C kernel code shows a vector addion design where two arrays of
data are summed into a third array. The required size of the work-group is 16x1x1. This kernel
will execute 16 mes to produce a valid result.
#include <clc.h>
// For VHLS OpenCL C kernels, the full work group is synthesized
__attribute__ ((reqd_work_group_size(16, 1, 1)))
__kernel void
vadd(__global int* a,
__global int* b,
__global int* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
See Also
SDAccel Environment Opmizaon Guide (UG1207)
hps://www.khronos.org/
The OpenCL C Specicaon
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 59
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
vec_type_hint
Description
IMPORTANT!: This is a compiler hint which the compiler may ignore.
The oponal __attribute__((vec_type_hint(
<type>
))) is part of the OpenCL
Language Specicaon, and is a hint to the OpenCL compiler represenng the computaonal
width of the kernel, providing a basis for calculang processor bandwidth ulizaon when the
compiler is looking to autovectorize the code.
By default, the kernel is assumed to have the __attribute__((vec_type_hint(int)))
qualier. This lets you specify a dierent vectorizaon type.
Implicit in autovectorizaon is the assumpon that any libraries called from the kernel must be
re-compilable at run me to handle cases where the compiler decides to merge or separate
workitems. This probably means that such libraries can never be hard coded binaries or that hard
coded binaries must be accompanied either by source or some re-targetable intermediate
representaon. This may be a code security queson for some.
Syntax
Place this aribute before the kernel denion, or before the primary funcon specied for the
kernel:
__attribute__((vec_type_hint(
<type>
)))
Where:
<type>: is one of the built-in vector types listed in the following table, or the constuent scalar
element types.
Note: When not specied, the kernel is assumed to have an INT type.
Table 10: Vector Types
Type Description
charnA vector of n 8-bit signed two’s complement integer values.
ucharnA vector of n 8-bit unsigned integer values.
shortnA vector of n 16-bit signed two’s complement integer values.
ushortnA vector of n 16-bit unsigned integer values.
intnA vector of n 32-bit signed two’s complement integer values.
uintnA vector of n 32-bit unsigned integer values.
longnA vector of n 64-bit signed two’s complement integer values.
ulongnA vector of n 64-bit unsigned integer values.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 60
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table 10: Vector Types (cont'd)
Type Description
floatnA vector of n 32-bit floating-point values.
doublenA vector of n 64-bit floating-point values.
Note: n is assumed to be 1 when not specied. The vector data type names dened above where n is any
value other than 2, 3, 4, 8 and 16, are also reserved. That is to say, n can only be specied as 2,3,4,8, and
16.
Examples
The following example autovectorizes assuming double-wide integer as the basic computaon
width:
#include <clc.h>
// For VHLS OpenCL C kernels, the full work group is synthesized
__attribute__((vec_type_hint(double)))
__attribute__ ((reqd_work_group_size(16, 1, 1)))
__kernel void
...
See Also
SDAccel Environment Opmizaon Guide (UG1207)
hps://www.khronos.org/
The OpenCL C Specicaon
work_group_size_hint
Description
IMPORTANT!: This is a compiler hint which the compiler may ignore.
The work-group size in the OpenCL standard denes the size of the ND range space that can be
handled by a single invocaon of a kernel compute unit. When OpenCL kernels are submied for
execuon on an OpenCL device, they execute within an index space, called an ND range, which
can have 1, 2, or 3 dimensions. See "OpenCL Execuon Model" in SDAccel Environment
Opmizaon Guide (UG1207) for more informaon.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 61
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
OpenCL kernel funcons are executed exactly one me for each point in the ND range index
space. This unit of work for each point in the ND range is called a work-item. Unlike for loops in
C, where loop iteraons are executed sequenally and in-order, an OpenCL runme and device is
free to execute work-items in parallel and in any order.
Work-items are organized into work-groups, which are the unit of work scheduled onto compute
units. The oponal work_group_size_hint aribute is part of the OpenCL Language
Specicaon, and is a hint to the compiler that indicates the work-group size value most likely to
be specied by the local_work_size argument to clEnqueueNDRangeKernel. This allows
the compiler to opmize the generated code according to the expected value.
TIP: In the case of an FPGA implementaon, the specicaon of the
reqd_work_group_size
aribute instead of the
work_group_size_hint
is highly recommended as it can be used for
performance opmizaon during the generaon of the custom logic for a kernel.
Syntax
Place this aribute before the kernel denion, or before the primary funcon specied for the
kernel:
__attribute__((work_group_size_hint(
X
,
Y
,
Z
)))
Where:
X, Y, Z: Species the ND range of the kernel. This represents each dimension of a three
dimensional matrix specifying the size of the work-group for the kernel.
Examples
The following example is a hint to the compiler that the kernel will most likely be executed with a
work-group size of 1:
__attribute__((work_group_size_hint(1, 1, 1)))
__kernel void
...
See Also
SDAccel Environment Opmizaon Guide (UG1207)
hps://www.khronos.org/
The OpenCL C Specicaon
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 62
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
xcl_array_partition
Description
IMPORTANT!: Array variables only accept one aribute. While
xcl_array_partition
does
support mul-dimensional arrays, you can only reshape one dimension of the array with a single
aribute.
One of the advantages of the FPGA over other compute devices for OpenCL programs is the
ability for the applicaon programmer to customize the memory architecture all throughout the
system and into the compute unit. By default, The SDAccel compiler generates a memory
architecture within the compute unit that maximizes local and private memory bandwidth based
on stac code analysis of the kernel code. Further opmizaon of these memories is possible
based on aributes in the kernel source code, which can be used to specify physical layouts and
implementaons of local and private memories. The aribute in the SDAccel compiler to control
the physical layout of memories in a compute unit is array_partition.
For one dimensional arrays, the array_partition aribute implements an array declared
within kernel code as mulple physical memories instead of a single physical memory. The
selecon of which paroning scheme to use depends on the specic applicaon and its
performance goals. The array paroning schemes available in the SDAccel compiler are
cyclic, block, and complete.
Syntax
Place the aribute with the denion of the array variable:
__attribute__((xcl_array_partition(
<type>
,
<factor>
,
<dimension>
)))
Where:
<type>: Species one of the following paron types:
cyclic: Cyclic paroning is the implementaon of an array as a set of smaller physical
memories that can be accessed simultaneously by the logic in the compute unit. The array
is paroned cyclically by pung one element into each memory before coming back to
the rst memory to repeat the cycle unl the array is fully paroned.
block: Block paroning is the physical implementaon of an array as a set of smaller
memories that can be accessed simultaneously by the logic inside of the compute unit. In
this case, each memory block is lled with elements from the array before moving on to the
next memory.
complete: Complete paroning decomposes the array into individual elements. For a
one-dimensional array, this corresponds to resolving a memory into individual registers.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 63
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The default type is complete.
<factor>: For cyclic type paroning, the factor species how many physical memories to
paron the original array into in the kernel code. For Block type paroning, the factor
species the number of elements from the original array to store in each physical memory.
IMPORTANT!: For
complete
type paroning, the factor is not specied.
<dimension>: Species which array dimension to paron. Specied as an integer from 1 to N.
SDAccel supports arrays of N dimensions and can paron the array on any single dimension.
Example 1
For example, consider the following array declaraon:
int buffer[16];
The integer array, named buer, stores 16 values that are 32-bits wide each. Cyclic paroning
can be applied to this array with the following declaraon:
int buffer[16] __attribute__((xcl_array_partition(cyclic,4,1)));
In this example, the cyclic paron_type aribute tells SDAccel to distribute the contents of the
array among four physical memories. This aribute increases the immediate memory bandwidth
for operaons accessing the array buer by a factor of four.
All arrays inside of a compute unit in the context of SDAccel are capable of sustaining a
maximum of two concurrent accesses. By dividing the original array in the code into four physical
memories, the resulng compute unit can sustain a maximum of eight concurrent accesses to the
array buer.
Example 2
Using the same integer array as found in Example 1, block paroning can be applied to the array
with the following declaraon:
int buffer[16] __attribute__((xcl_array_partition(block,4,1)));
Since the size of the block is four, SDAccel will generate four physical memories, sequenally
lling each memory with data from the array.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 64
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 3
Using the same integer array as found in Example 1, complete paroning can be applied to the
array with the following declaraon:
int buffer[16] __attribute__((xcl_array_partition(complete, 1)));
In this example the array is completely paroned into distributed RAM, or 16 independent
registers in the programmable logic of the kernel. Because complete is the default, the same
eect can also be accomplished with the following declaraon:
int buffer[16] __attribute__((xcl_array_partition));
While this creates an implementaon with the highest possible memory bandwidth, it is not
suited to all applicaons. The way in which data is accessed by the kernel code through either
constant or data dependent indexes aects the amount of supporng logic that SDx has to build
around each register to ensure funconal equivalence with the usage in the original code. As a
general best pracce guideline for SDx, the complete paroning aribute is best suited for
arrays in which at least one dimension of the array is accessed through the use of constant
indexes.
See Also
xcl_array_reshape
pragma HLS array_paron
SDAccel Environment Opmizaon Guide (UG1207)
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_array_reshape
Description
IMPORTANT!: Array variables only accept one aribute. While
xcl_array_reshape
does support
mul-dimensional arrays, you can only reshape one dimension of the array with a single aribute.
Combines array paroning with vercal array mapping.
The ARRAY_RESHAPE aribute combines the eect of ARRAY_PARTITION, breaking an array
into smaller arrays, and concatenang elements of arrays by increasing bit-widths. This reduces
the number of block RAM consumed while providing parallel access to the data. This aribute
creates a new array with fewer elements but with greater bit-width, allowing more data to be
accessed in a single clock cycle.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 65
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Given the following code:
void foo (...) {
int array1[N] __attribute__((xcl_array_reshape(block, 2, 1)));
int array2[N] __attribute__((xcl_array_reshape(cycle, 2, 1)));
int array3[N] __attribute__((xcl_array_reshape(complete, 1)));
...
}
The ARRAY_RESHAPE aribute transforms the arrays into the form shown in the following
gure:
Figure 1: ARRAY_RESHAPE
0 1 2 ... N-3 N-2 N-1
N/2 ... N-2 N-1
0 1 ... (N/2-1)
1 ... N-3 N-1
0 2 ... N-2
block
cyclic
complete
X14307-110217
0 1 2 ... N-3 N-2 N-1
0 1 2 ... N-3 N-2 N-1
array1[N]
array2[N]
array3[N] N-1
N-2
...
1
0
MSB
LSB
MSB
LSB
MSB
LSB
array4[N/2]
array5[N/2]
array6[1]
Syntax
Place the aribute with the denion of the array variable:
__attribute__((xcl_array_reshape(
<type>
,
<factor>
,
<dimension>
)))
Where:
<type>: Species one of the following paron types:
cyclic: Cyclic paroning is the implementaon of an array as a set of smaller physical
memories that can be accessed simultaneously by the logic in the compute unit. The array
is paroned cyclically by pung one element into each memory before coming back to
the rst memory to repeat the cycle unl the array is fully paroned.
block: Block paroning is the physical implementaon of an array as a set of smaller
memories that can be accessed simultaneously by the logic inside of the compute unit. In
this case, each memory block is lled with elements from the array before moving on to the
next memory.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 66
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
complete: Complete paroning decomposes the array into individual elements. For a
one-dimensional array, this corresponds to resolving a memory into individual registers.
The default type is complete.
<factor>: For cyclic type paroning, the factor species how many physical memories to
paron the original array into in the kernel code. For Block type paroning, the factor
species the number of elements from the original array to store in each physical memory.
IMPORTANT!: For
complete
type paroning, the factor should not be specied.
<dimension>: Species which array dimension to paron. Specied as an integer from 1 to N.
SDAccel supports arrays of N dimensions and can paron the array on any single dimension.
Example 1
Reshapes (paron and maps) an 8-bit array with 17 elements, AB[17], into a new 32-bit array
with ve elements using block mapping.
int AB[17] __attribute__((xcl_array_reshape(block,4,1)));
TIP: A factor of 4 indicates that the array should be divided into four. So 17 elements is reshaped into
an array of 5 elements, with four mes the bit-width. In this case, the last element, AB[17], is mapped
to the lower eight bits of the h element, and the rest of the h element is empty.
Example 2
Reshapes the two-dimensional array AB[6][4] into a new array of dimension [6][2], in which
dimension 2 has twice the bit-width:
int AB[6][4] __attribute__((xcl_array_reshape(block,2,2)));
Example 3
Reshapes the three-dimensional 8-bit array, AB[4][2][2] in funcon foo, into a new single
element array (a register), 128 bits wide (4*2*2*8):
int AB[4][2][2] __attribute__((xcl_array_reshape(complete,0)));
TIP: A dimension of 0 means to reshape all dimensions of the array.
See Also
xcl_array_paron
pragma HLS array_reshape
SDAccel Environment Opmizaon Guide (UG1207)
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 67
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_data_pack
Description
Packs the data elds of a struct into a single scalar with a wider word width.
The xcl_data_pack aribute is used for packing all the elements of a struct into a single
wide vector to reduce the memory required for the variable. This allows all members of the
struct to be read and wrien to simultaneously. The bit alignment of the resulng new wide-
word can be inferred from the declaraon order of the struct elds. The rst eld takes the
LSB of the vector, and the nal element of the struct is aligned with the MSB of the vector.
TIP: Any arrays declared inside the
struct
are completely paroned and reshaped into a wide
scalar and packed with other scalar elds.
If a struct contains arrays, those arrays can be opmized using the xcl_array_partition
aribute to paron the array. The xcl_data_pack aribute performs a similar operaon as
the complete paroning of the xcl_array_partition aribute, reshaping the elements in
the struct to a single wide vector.
A struct cannot be opmized with xcl_data_pack and also paroned. The
xcl_data_pack and xcl_array_partition aributes are mutually exclusive.
You should exercise some cauon when using the xcl_data_pack opmizaon on structs with
large arrays. If an array has 4096 elements of type int, this will result in a vector (and port) of
width 4096*32=131072 bits. SDx can create this RTL design, however it is very unlikely logic
synthesis will be able to route this during the FPGA implementaon.
Syntax
Place within the region where the struct variable is dened:
__attribute__((xcl_data_pack(
<variable>
,
<name>
)))
Where:
<variable>: is the variable to be packed.
<name>: Species the name of resultant variable aer packing. If no <name> is specied, the
input <variable> is used.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 68
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 1
Packs struct array AB[17] with three 8-bit eld elds (typedef struct {unsigned char R, G, B;}
pixel) in funcon foo, into a new 17 element array of 24 bits.
typedef struct{
unsigned char R, G, B;
} pixel;
pixel AB[17] __attribute__((xcl_data_pack(AB)));
See Also
pragma HLS data_pack
SDAccel Environment Opmizaon Guide (UG1207)
xcl_dataflow
Description
The xcl_dataflow aribute enables task-level pipelining, allowing funcons and loops to
overlap in their operaon, increasing the concurrency of the RTL implementaon, and increasing
the overall throughput of the design.
All operaons are performed sequenally in a C descripon. In the absence of any direcves that
limit resources (such as pragma HLS allocation), Vivado HLS seeks to minimize latency and
improve concurrency. However, data dependencies can limit this. For example, funcons or loops
that access arrays must nish all read/write accesses to the arrays before they complete. This
prevents the next funcon or loop that consumes the data from starng operaon. The dataow
opmizaon enables the operaons in a funcon or loop to start operaon before the previous
funcon or loop completes all its operaons.
When dataow opmizaon is specied, Vivado HLS analyzes the dataow between sequenal
funcons or loops and create channels (based on pingpong RAMs or FIFOs) that allow consumer
funcons or loops to start operaon before the producer funcons or loops have completed.
This allows funcons or loops to operate in parallel, which decreases latency and improves the
throughput of the RTL.
If no iniaon interval (number of cycles between the start of one funcon or loop and the next)
is specied, Vivado HLS aempts to minimize the iniaon interval and start operaon as soon
as data is available.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 69
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
TIP: Vivado HLS provides dataow conguraon sengs. The
config_dataflow
command
species the default memory channel and FIFO depth used in dataow opmizaon. Refer to the
Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more informaon.
For the DATAFLOW opmizaon to work, the data must ow through the design from one task
to the next. The following coding styles prevent Vivado HLS from performing the DATAFLOW
opmizaon, refer to UG902 for more informaon:
Single-producer-consumer violaons
Bypassing tasks
Feedback between tasks
Condional execuon of tasks
Loops with mulple exit condions
IMPORTANT!: If any of these coding styles are present, Vivado HLS issues a message and does not
perform DATAFLOW opmizaon.
Finally, the DATAFLOW opmizaon has no hierarchical implementaon. If a sub-funcon or
loop contains addional tasks that might benet from the DATAFLOW opmizaon, you must
apply the opmizaon to the loop, the sub-funcon, or inline the sub-funcon.
Syntax
Assign the dataflow aribute before the funcon denion or the loop denion:
__attribute__((xcl_dataflow))
Examples
Species dataow opmizaon within funcon foo.
#pragma HLS dataflow
See Also
pragma HLS dataow
SDAccel Environment Opmizaon Guide (UG1207)
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 70
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
xcl_dependence
Description
The xcl_dependence aribute is used to provide addional informaon that can overcome
loop-carry dependencies and allow loops to be pipelined (or pipelined with lower intervals).
Vivado HLS automacally detects dependencies:
Within loops (loop-independent dependence), or
Between dierent iteraons of a loop (loop-carry dependence).
These dependencies impact when operaons can be scheduled, especially during funcon and
loop pipelining.
Loop-independent dependence: The same element is accessed in the same loop iteraon.
for (i=0;i<N;i++) {
A[i]=x;
y=A[i];
}
Loop-carry dependence: The same element is accessed in a dierent loop iteraon.
for (i=0;i<N;i++) {
A[i]=A[i-1]*2;
}
Under certain complex scenarios automac dependence analysis can be too conservave and fail
to lter out false dependencies. Under certain circumstances, such as variable dependent array
indexing, or when an external requirement needs to be enforced (for example, two inputs are
never the same index), the dependence analysis might be too conservave. The
xcl_dependence aribute allows you to explicitly specify the dependence and resolve a false
dependence.
IMPORTANT!: Specifying a false dependency, when in fact the dependency is not false, can result in
incorrect hardware. Be sure dependencies are correct (true or false) before specifying them.
Syntax
This aribute must be assigned at the declaraon of the variable:
__attribute__((xcl_dependence(
<class>
<type>
<direction>
distance=
<int>
<dependent>
)))
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 71
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Where:
<class>: Species a class of variables in which the dependence needs claricaon. Valid values
include array or pointer.
TIP: <class> is mutually exclusive with
variable=
as you can either specify a variable or a class of
variables.
<type>: Valid values include intra or inter. Species whether the dependence is:
intra: dependence within the same loop iteraon. When dependence <type> is specied
as intra, and <dependent> is false, Vivado HLS may move operaons freely within a loop,
increasing their mobility and potenally improving performance or area. When
<dependent> is specied as true, the operaons must be performed in the order specied.
inter: dependence between dierent loop iteraons. This is the default <type>. If
dependence <type> is specied as inter, and <dependent> is false, it allows Vivado HLS
to perform operaons in parallel if the funcon or loop is pipelined, or the loop is unrolled,
or parally unrolled, and prevents such concurrent operaon when <dependent> is
specied as true.
<direcon>: Valid values include RAW, WAR, or WAW. This is relevant for loop-carry
dependencies only, and species the direcon for a dependence:
RAW (Read-Aer-Write - true dependence) The write instrucon uses a value used by the
read instrucon.
WAR (Write-Aer-Read - an dependence) The read instrucon gets a value that is
overwrien by the write instrucon.
WAW (Write-Aer-Write - output dependence) Two write instrucons write to the same
locaon, in a certain order.
distance=
<int>
: Species the inter-iteraon distance for array access. Relevant only for
loop-carry dependencies where dependence is set to true.
<dependent>: Species whether a dependence needs to be enforced (true) or removed
(false). The default is true.
Example 1
In the following example, Vivado HLS does not have any knowledge about the value of cols and
conservavely assumes that there is always a dependence between the write to buff_A[1]
[col] and the read from buff_A[1][col]. In an algorithm such as this, it is unlikely cols will
ever be zero, but Vivado HLS cannot make assumpons about data dependencies. To overcome
this deciency, you can use the xcl_dependence aribute to state that there is no
dependence between loop iteraons (in this case, for both buff_A and buff_B).
void foo(int rows, int cols, ...)
for (row = 0; row < rows + 1; row++) {
for (col = 0; col < cols + 1; col++)
__attribute__((xcl_pipeline_loop(II=1)))
{
if (col < cols) {
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 72
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
buff_A[2][col] = buff_A[1][col] __attribute__((xcl_dependence(inter
false))); // read from buff_A
buff_A[1][col] = buff_A[0][col]; // write to buff_A
buff_B[1][col] = buff_B[0][col] __attribute__((xcl_dependence(inter
false)));
temp = buff_A[0][col];
}
Example 2
Removes the dependence between Var1 in the same iteraons of loop_1 in funcon foo.
__attribute__((xcl_dependence(intra false)));
Example 3
Denes the dependence on all arrays in loop_2 of funcon foo to inform Vivado HLS that all
reads must happen aer writes (RAW) in the same loop iteraon.
__attribute__((xcl_dependence(array intra RAW true)));
See Also
pragma HLS dependence
SDAccel Environment Opmizaon Guide (UG1207)
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_max_work_group_size
Description
Use this aribute instead of reqd_work_group_size when you need to specify a larger kernel
than the 4K size.
Extends the default maximum work group size supported in SDx by the
reqd_work_group_size aribute. SDx supports work size larger than 4096 with the Xilinx
aribute xcl_max_work_group_size.
Note: The actual workgroup size limit is dependent on the Xilinx device selected for the plaorm.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 73
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Syntax
Place this aribute before the kernel denion, or before the primary funcon specied for the
kernel:
__attribute__((xcl_max_work_group_size(
X
,
Y
,
Z
)))
Where:
X, Y, Z: Species the ND range of the kernel. This represents each dimension of a three
dimensional matrix specifying the size of the work-group for the kernel.
Example 1
Below is the kernel source code for an un-opmized adder. No aributes were specied for this
design other than the work size equal to the size of the matrices (i.e., 64x64). That is, iterang
over an enre workgroup will fully add the input matrices a and b and output the result to
output. All three are global integer pointers, which means each value in the matrices is four bytes
and is stored in o-chip DDR global memory.
#define RANK 64
__kernel __attribute__ ((reqd_work_group_size(RANK, RANK, 1)))
void madd(__global int* a, __global int* b, __global int* output) {
int index = get_local_id(1)*get_local_size(0) + get_local_id(0);
output[index] = a[index] + b[index];
}
This local work size of (64, 64, 1) is the same as the global work size. It should be noted that this
seng creates a total work size of 4096.
Note: This is the largest work size that SDAccel supports with the standard OpenCL aribute
reqd_work_group_size. SDAccel supports work size larger than 4096 with the Xilinx aribute
xcl_max_work_group_size.
Any matrix larger than 64x64 would need to only use one dimension to dene the work size.
That is, a 128x128 matrix could be operated on by a kernel with a work size of (128, 1, 1), where
each invocaon operates on an enre row or column of data.
See Also
SDAccel Environment Opmizaon Guide (UG1207)
hps://www.khronos.org/
The OpenCL C Specicaon
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 74
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
xcl_pipeline_loop
Description
Pipeline a loop to improve latency and throughput. Although loop unrolling exposes concurrency,
it does not address the issue of keeping all elements in a kernel data path busy at all mes. This is
necessary for maximizing kernel throughput and performance. Even in an unrolled case, loop
control dependencies can lead to sequenal behavior. The sequenal behavior of operaons
results in idle hardware and a loss of performance.
Xilinx addresses this issue by introducing a vendor extension on top of the OpenCL 2.0
specicaon for loop pipelining. The Xilinx aribute for loop pipelining is xcl_pipeline_loop.
By default, the SDAccel compiler automacally applies this aribute on the innermost loop
with trip count more than 64 or its parent loop when its trip count is less than or equal 64.
Syntax
Place the aribute in the OpenCL source before the loop denion:
__attribute__((xcl_pipeline_loop))
Examples
The following example pipelines LOOP_1 of funcon vaccum to improve performance:
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vaccum(__global const int* a, __global const int* b, __global int*
result)
{
int tmp = 0;
__attribute__((xcl_pipeline_loop))
LOOP_1: for (int i=0; i < 32; i++) {
tmp += a[i] * b[i];
}
result[0] = tmp;
}
See Also
pragma HLS pipeline
SDAccel Environment Opmizaon Guide (UG1207)
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 75
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
xcl_pipeline_workitems
Description
Pipeline a work item to improve latency and throughput. Work item pipelining is the extension of
loop pipelining to the kernel work group. This is necessary for maximizing kernel throughput and
performance.
Syntax
Place the aribute in the OpenCL source before the elements to pipeline:
__attribute__((xcl_pipeline_workitems))
Example 1
In order to handle the reqd_work_group_size aribute in the following example, SDAccel
automacally inserts a loop nest to handle the three-dimensional characteriscs of the ND range
(3,1,1). As a result of the added loop nest, the execuon prole of this kernel is like an
unpipelined loop. Adding the xcl_pipeline_workitems aribute adds concurrency and
improves the throughput of the code.
kernel
__attribute__ ((reqd_work_group_size(3,1,1)))
void foo(...)
{
...
__attribute__((xcl_pipeline_workitems)) {
int tid = get_global_id(0);
op_Read(tid);
op_Compute(tid);
op_Write(tid);
}
...
}
Example 2
The following example adds the work-item pipeline to the appropriate elements of the kernel:
__kernel __attribute__ ((reqd_work_group_size(8, 8, 1)))
void madd(__global int* a, __global int* b, __global int* output)
{
int rank = get_local_size(0);
__local unsigned int bufa[64];
__local unsigned int bufb[64];
__attribute__((xcl_pipeline_workitems)) {
int x = get_local_id(0);
int y = get_local_id(1);
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 76
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
bufa[x*rank + y] = a[x*rank + y];
bufb[x*rank + y] = b[x*rank + y];
}
barrier(CLK_LOCAL_MEM_FENCE);
__attribute__((xcl_pipeline_workitems)) {
int index = get_local_id(1)*rank + get_local_id(0);
output[index] = bufa[index] + bufb[index];
}
}
See Also
pragma HLS pipeline
SDAccel Environment Opmizaon Guide (UG1207)
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_reqd_pipe_depth
Description
IMPORTANT!: Pipes must be declared in lower case alphanumerics. In addion,
printf()
is not
supported with variables used in pipes.
The OpenCL 2.0 specicaon introduces a new memory object called pipe. A pipe stores data
organized as a FIFO. Pipes can be used to stream data from one kernel to another inside the
FPGA device without having to use the external memory, which greatly improves the overall
system latency.
In the SDAccel development environment, pipes must be stacally dened outside of all kernel
funcons:. The depth of a pipe must be specied by using the xcl_reqd_pipe_depth
aribute in the pipe declaraon:
pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
Pipes can only be accessed using standard OpenCL read_pipe() and write_pipe() built-in
funcons in non-blocking mode, or using Xilinx extended read_pipe_block() and
write_pipe_block() funcons in blocking mode.
IMPORTANT!: A given pipe, can have one and only one producer and consumer in dierent kernels.
Pipe objects are not accessible from the host CPU. The status of pipes can be queried using
OpenCL get_pipe_num_packets() and get_pipe_max_packets() built-in funcons. See
The OpenCL C Specicaon from Khronos OpenCL Working Group for more details on these built-
in funcons.
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 77
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Syntax
This aribute must be assigned at the declaraon of the pipe object:
pipe int
id
__attribute__((xcl_reqd_pipe_depth(
n
)));
Where:
id: Species an idener for the pipe, which must consist of lower-case alphanumerics. For
example infifo1 not inFifo1.
n: Species the depth of the pipe. Valid depth values are 16, 32, 64, 128, 256, 512, 1024,
2048, 4096, 8192, 16384, 32768.
Examples
The following is the dataflow_pipes_ocl example from Xilinx GitHub that use pipes to pass
data from one processing stage to the next using blocking read_pipe_block() and
write_pipe_block() funcons:
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
__attribute__((xcl_pipeline_loop))
mem_rd: for (int i = 0 ; i < size ; i++)
{
//blocking Write command to pipe P0
write_pipe_block(p0, &input[i]);
}
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
__attribute__((xcl_pipeline_loop))
execute: for(int i = 0 ; i < size ; i++)
{
int input_data, output_data;
//blocking read command to Pipe P0
read_pipe_block(p0, &input_data);
output_data = input_data + inc;
//blocking write command to Pipe P1
write_pipe_block(p1, &output_data);
}
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to
// Global Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
__attribute__((xcl_pipeline_loop))
mem_wr: for (int i = 0 ; i < size ; i++)
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 78
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
{
//blocking read command to Pipe P1
read_pipe_block(p1, &output[i]);
}
}
See Also
SDAccel Environment Proling and Opmizaon Guide (UG1207)
hps://www.khronos.org/
The OpenCL C Specicaon
xcl_zero_global_work_offset
Description
If you use clEnqueueNDRangeKernel with the global_work_offset set to NULL or all
zeros, you can use this aribute to tell the compiler that the global_work_offset is always
zero.
This aribute can improve memory performance when you have memory accesses like:
A[get_global_id(x)] = ...;
Note: You can specify reqd_work_group_size, vec_type_hint, and
xcl_zero_global_work_offset together to maximize performance.
Syntax
Place this aribute before the kernel denion, or before the primary funcon specied for the
kernel:
__kernel __attribute__((xcl_zero_global_work_offset))
void test (__global short *input, __global short *output, __constant short
*constants) { }
See Also
reqd_work_group_size
vec_type_hint
clEnqueueNDRangeKernel
SDAccel Environment Proling and Opmizaon Guide (UG1207)
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 79
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Appendix A: OpenCL Attributes
Vivado HLS Optimization Methodology Guide 80
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Appendix B
HLS Pragmas
Optimizations in Vivado HLS
In both SDAccel and SDSoC projects, the hardware kernel must be synthesized from the OpenCL,
C, or C++ language, into RTL that can be implemented into the programmable logic of a Xilinx
device. Vivado HLS synthesizes the RTL from the OpenCL, C, and C++ language descripons.
Vivado HLS is intended to work with your SDAccel or SDSoC Development Environment project
without interacon. However, Vivado HLS also provides pragmas that can be used to opmize
the design: reduce latency, improve throughput performance, and reduce area and device
resource ulizaon of the resulng RTL code. These pragmas can be added directly to the source
code for the kernel.
IMPORTANT!:
Although the SDSoC environment supports the use of HLS pragmas, it does not support pragmas
applied to any argument of the funcon interface (interface, array paron, or data_pack pragmas).
Refer to "Opmizing the Hardware Funcon" in the SDSoC Environment Opmizaon Guide (UG1235)
for more informaon.
The Vivado HLS pragmas include the opmizaon types specied below:
Table 11: Vivado HLS Pragmas by Type
Type Attributes
Kernel Optimization pragma HLS allocation
pragma HLS clock
pragma HLS expression_balance
pragma HLS latency
pragma HLS reset
pragma HLS resource
pragma HLS top
Function Inlining pragma HLS inline
pragma HLS function_instantiate
Vivado HLS Optimization Methodology Guide 81
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Table 11: Vivado HLS Pragmas by Type (cont'd)
Type Attributes
Interface Synthesis pragma HLS interface
pragma HLS protocol
Task-level Pipeline pragma HLS dataflow
pragma HLS stream
Pipeline pragma HLS pipeline
pragma HLS occurrence
Loop Unrolling pragma HLS unroll
pragma HLS dependence
Loop Optimization pragma HLS loop_flatten
pragma HLS loop_merge
pragma HLS loop_tripcount
Array Optimization pragma HLS array_map
pragma HLS array_partition
pragma HLS array_reshape
Structure Packing pragma HLS data_pack
pragma HLS allocation
Description
Species instance restricons to limit resource allocaon in the implemented kernel. This denes,
and can limit, the number of RTL instances and hardware resources used to implement specic
funcons, loops, operaons or cores. The ALLOCATION pragma is specied inside the body of a
funcon, a loop, or a region of code.
For example, if the C source has four instances of a funcon foo_sub, the ALLOCATION pragma
can ensure that there is only one instance of foo_sub in the nal RTL. All four instances of the C
funcon are implemented using the same RTL block. This reduces resources ulized by the
funcon, but negavely impacts performance.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 82
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The operaons in the C code, such as addions, mulplicaons, array reads, and writes, can be
limited by the ALLOCATION pragma. Cores, which operators are mapped to during synthesis, can
be limited in the same manner as the operators. Instead of liming the total number of
mulplicaon operaons, you can choose to limit the number of combinaonal mulplier cores,
forcing any remaining mulplicaons to be performed using pipelined mulpliers (or vice versa).
The ALLOCATION pragma applies to the scope it is specied within: a funcon, a loop, or a
region of code. However, you can use the -min_op argument of the config_bind command
to globally minimize operators throughout the design.
TIP: For more informaon refer to "Controlling Hardware Resources" and
config_bind
in Vivado
Design Suite User Guide: High-Level Synthesis (UG902).
Syntax
Place the pragma inside the body of the funcon, loop, or region where it will apply.
#pragma HLS allocation instances=
<list>
\
limit=
<value>
<type>
Where:
instances=
<list>
: Species the names of funcons, operators, or cores.
limit=
<value>
: Oponally species the limit of instances to be used in the kernel.
<type>
: Species that the allocaon applies to a funcon, an operaon, or a core (hardware
component) used to create the design (such as adders, mulpliers, pipelined mulpliers, and
block RAM). The type is specied as one of the following::
function: Species that the allocaon applies to the funcons listed in the instances=
list. The funcon can be any funcon in the original C or C++ code that has NOT been:
- Inlined by the pragma HLS inline, or the set_directive_inline command, or
- Inlined automacally by Vivado HLS.
operation: Species that the allocaon applies to the operaons listed in the
instances= list. Refer to Vivado Design Suite User Guide: High-Level Synthesis (UG902) for
a complete list of the operaons that can be limited using the ALLOCATION pragma.
core: Species that the ALLOCATION applies to the cores, which are the specic hardware
components used to create the design (such as adders, mulpliers, pipelined mulpliers,
and block RAM). The actual core to use is specied in the instances= opon. In the case
of cores, you can specify which the tool should use, or you can dene a limit for the
specied core.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 83
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 1
Given a design with mulple instances of funcon foo, this example limits the number of
instances of foo in the RTL for the hardware kernel to 2.
#pragma HLS allocation instances=foo limit=2 function
Example 2
Limits the number of mulplier operaons used in the implementaon of the funcon my_func
to 1. This limit does not apply to any mulpliers outside of my_func, or mulpliers that might
reside in sub-funcons of my_func.
TIP: To limit the mulpliers used in the implementaon of any sub-funcons, specify an allocaon
direcve on the sub-funcons or inline the sub-funcon into funcon
my_func
.
void my_func(data_t angle) {
#pragma HLS allocation instances=mul limit=1 operation
...
}
See Also
pragma HLS funcon_instanate
pragma HLS inline
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS array_map
Description
Combines mulple smaller arrays into a single large array to help reduce block RAM resources.
Designers typically use the pragma HLS array_map command (with the same instance=
target) to combine mulple smaller arrays into a single larger array. This larger array can then be
targeted to a single larger memory (RAM or FIFO) resource.
Each array is mapped into a block RAM or UltraRAM, when supported by the device. The basic
block RAM unit provided in an FPGA is 18K. If many small arrays do not use the full 18K, a beer
use of the block RAM resources is to map many small arrays into a single larger array.
TIP: If a block RAM is larger than 18K, they are automacally mapped into mulple 18K units.
The ARRAY_MAP pragma supports two ways of mapping small arrays into a larger one:
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 84
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Horizontal mapping: this corresponds to creang a new array by concatenang the original
arrays. Physically, this gets implemented as a single array with more elements.
Vercal mapping: this corresponds to creang a new array by concatenang the original
words in the array. Physically, this gets implemented as a single array with a larger bit-width.
The arrays are concatenated in the order that the pragmas are specied, starng at:
Target element zero for horizontal mapping, or
Bit zero for vercal mapping.
Syntax
Place the pragma in the C source within the boundaries of the funcon where the array variable
is dened.
#pragma HLS array_map variable=
<name>
instance=
<instance>
\
<mode>
offset=
<int>
Where:
variable=
<name>
: A required argument that species the array variable to be mapped into
the new target array <instance>.
instance=
<instance>
: Species the name of the new array to merge arrays into.
<mode>: Oponally species the array map as being either horizontal or vertical.
Horizontal mapping is the default <mode>, and concatenates the arrays to form a new array
with more elements.
Vercal mapping concatenates the array to form a new array with longer words.
offset=
<int>
: Applies to horizontal type array mapping only. The oset species an
integer value oset to apply before mapping the array into the new array <instance>. For
example:
Element 0 of the array variable maps to element <int> of the new target.
Other elements map to <int+1>, <int+2>... of the new target.
IMPORTANT!: If an oset is not specied, Vivado HLS calculates the required oset automacally to
avoid overlapping array elements.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 85
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 1
Arrays array1 and array2 in funcon foo are mapped into a single array, specied as array3
in the following example:
void foo (...) {
int8 array1[M];
int12 array2[N];
#pragma HLS ARRAY_MAP variable=array1 instance=array3 horizontal
#pragma HLS ARRAY_MAP variable=array2 instance=array3 horizontal
...
loop_1: for(i=0;i<M;i++) {
array1[i] = ...;
array2[i] = ...;
...
}
...
}
Example 2
This example provides a horizontal mapping of array A[10] and array B[15] in funcon foo into a
single new array AB[25].
Element AB[0] will be the same as A[0].
Element AB[10] will be the same as B[0] because no offset= opon is specied.
The bit-width of array AB[25] will be the maximum bit-width of either A[10] or B[15].
#pragma HLS array_map variable=A instance=AB horizontal
#pragma HLS array_map variable=B instance=AB horizontal
Example 3
The following example performs a vercal concatenaon of arrays C and D into a new array CD,
with the bit-width of C and D combined. The number of elements in CD is the maximum of the
original arrays, C or D:
#pragma HLS array_map variable=C instance=CD vertical
#pragma HLS array_map variable=D instance=CD vertical
See Also
pragma HLS array_paron
pragma HLS array_reshape
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 86
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS array_partition
Description
Parons an array into smaller arrays or individual elements.
This paroning:
Results in RTL with mulple small memories or mulple registers instead of one large memory.
Eecvely increases the amount of read and write ports for the storage.
Potenally improves the throughput of the design.
Requires more memory instances or registers.
Syntax
Place the pragma in the C source within the boundaries of the funcon where the array variable
is dened.
#pragma HLS array_partition variable=
<name>
\
<type>
factor=
<int>
dim=
<int>
where
variable=
<name>
: A required argument that species the array variable to be paroned.
<type>: Oponally species the paron type. The default type is complete. The following
types are supported:
cyclic: Cyclic paroning creates smaller arrays by interleaving elements from the
original array. The array is paroned cyclically by pung one element into each new array
before coming back to the rst array to repeat the cycle unl the array is fully paroned.
For example, if factor=3 is used:
- Element 0 is assigned to the rst new array
- Element 1 is assigned to the second new array.
- Element 2 is assigned to the third new array.
- Element 3 is assigned to the rst new array again.
block: Block paroning creates smaller arrays from consecuve blocks of the original
array. This eecvely splits the array into N equal blocks, where N is the integer dened by
the factor= argument.
complete: Complete paroning decomposes the array into individual elements. For a
one-dimensional array, this corresponds to resolving a memory into individual registers.
This is the default <type>.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 87
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
factor=
<int>
: Species the number of smaller arrays that are to be created.
IMPORTANT!: For complete type paroning, the factor is not specied. For block and cyclic
paroning the
factor=
is required.
dim=
<int>
: Species which dimension of a mul-dimensional array to paron. Specied as
an integer from 0 to N, for an array with N dimensions:
If a value of 0 is used, all dimensions of a mul-dimensional array are paroned with the
specied type and factor opons.
Any non-zero value parons only the specied dimension. For example, if a value 1 is
used, only the rst dimension is paroned.
Example 1
This example parons the 13 element array, AB[13], into four arrays using block paroning:
#pragma HLS array_partition variable=AB block factor=4
TIP:
Because four is not an integer factor of 13:
Three of the new arrays have three elements each,
One array has four elements (AB[9:12]).
Example 2
This example parons dimension two of the two-dimensional array, AB[6][4] into two new
arrays of dimension [6][2]:
#pragma HLS array_partition variable=AB block factor=2 dim=2
Example 3
This example parons the second dimension of the two-dimensional in_local array into
individual elements.
int in_local[MAX_SIZE][MAX_DIM];
#pragma HLS ARRAY_PARTITION variable=in_local complete dim=2
See Also
pragma HLS array_map
pragma HLS array_reshape
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_array_paron
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 88
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
SDAccel Environment Opmizaon Guide (UG1207)
pragma HLS array_reshape
Description
Combines array paroning with vercal array mapping.
The ARRAY_RESHAPE pragma combines the eect of ARRAY_PARTITION, breaking an array
into smaller arrays, with the eect of the vercal type of ARRAY_MAP, concatenang elements of
arrays by increasing bit-widths. This reduces the number of block RAM consumed while
providing the primary benet of paroning: parallel access to the data. This pragma creates a
new array with fewer elements but with greater bit-width, allowing more data to be accessed in a
single clock cycle.
Given the following code:
void foo (...) {
int array1[N];
int array2[N];
int array3[N];
#pragma HLS ARRAY_RESHAPE variable=array1 block factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array2 cycle factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array3 complete dim=1
...
}
The ARRAY_RESHAPE pragma transforms the arrays into the form shown in the following gure:
Figure 2: ARRAY_RESHAPE Pragma
01 2 ... N-3 N-2 N-1
N/2 ... N-2 N-1
0 1 ... (N/2-1)
1 ... N-3 N-1
0 2 ... N-2
block
cyclic
complete
X14307-110217
0 1 2 ... N-3 N-2 N-1
0 1 2 ... N-3 N-2 N-1
array1[N]
array2[N]
array3[N] N-1
N-2
...
1
0
MSB
LSB
MSB
LSB
MSB
LSB
array4[N/2]
array5[N/2]
array6[1]
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 89
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Syntax
Place the pragma in the C source within the region of a funcon where the array variable is
denes.
#pragma HLS array_reshape variable=
<name>
\
<type>
factor=
<int>
dim=
<int>
Where:
<name>: A required argument that species the array variable to be reshaped.
<type>: Oponally species the paron type. The default type is complete. The following
types are supported:
cyclic: Cyclic reshaping creates smaller arrays by interleaving elements from the original
array. For example, if factor=3 is used, element 0 is assigned to the rst new array,
element 1 to the second new array, element 2 is assigned to the third new array, and then
element 3 is assigned to the rst new array again. The nal array is a vercal concatenaon
(word concatenaon, to create longer words) of the new arrays into a single array.
block: Block reshaping creates smaller arrays from consecuve blocks of the original
array. This eecvely splits the array into N equal blocks where N is the integer dened by
factor=, and then combines the N blocks into a single array with word-width*N.
complete: Complete reshaping decomposes the array into temporary individual elements
and then recombines them into an array with a wider word. For a one-dimension array this
is equivalent to creang a very-wide register (if the original array was N elements of M bits,
the result is a register with N*M bits). This is the default type of array reshaping.
factor=
<int>
: Species the amount to divide the current array by (or the number of
temporary arrays to create). A factor of 2 splits the array in half, while doubling the bit-width.
A factor of 3 divides the array into three, with triple the bit-width.
IMPORTANT!: For complete type paroning, the factor is not specied. For block and cyclic
reshaping the
factor=
is required.
dim=
<int>
: Species which dimension of a mul-dimensional array to paron. Specied as
an integer from 0 to N, for an array with N dimensions:
If a value of 0 is used, all dimensions of a mul-dimensional array are paroned with the
specied type and factor opons.
Any non-zero value parons only the specied dimension. For example, if a value 1 is
used, only the rst dimension is paroned.
object: A keyword relevant for container arrays only. When the keyword is specied the
ARRAY_RESHAPE pragma applies to the objects in the container, reshaping all dimensions of
the objects within the container, but all dimensions of the container itself are preserved.
When the keyword is not specied the pragma applies to the container array and not the
objects.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 90
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 1
Reshapes (paron and maps) an 8-bit array with 17 elements, AB[17], into a new 32-bit array
with ve elements using block mapping.
#pragma HLS array_reshape variable=AB block factor=4
TIP: factor=4 indicates that the array should be divided into four. So 17 elements is reshaped into an
array of 5 elements, with four mes the bit-width. In this case, the last element, AB[17], is mapped to
the lower eight bits of the h element, and the rest of the h element is empty.
Example 2
Reshapes the two-dimensional array AB[6][4] into a new array of dimension [6][2], in which
dimension 2 has twice the bit-width:
#pragma HLS array_reshape variable=AB block factor=2 dim=2
Example 3
Reshapes the three-dimensional 8-bit array, AB[4][2][2] in funcon foo, into a new single
element array (a register), 128 bits wide (4*2*2*8):
#pragma HLS array_reshape variable=AB complete dim=0
TIP: dim=0 means to reshape all dimensions of the array.
See Also
pragma HLS array_map
pragma HLS array_paron
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
SDAccel Environment Proling and Opmizaon Guide (UG1207)
pragma HLS clock
Description
Applies the named clock to the specied funcon.
C and C++ designs support only a single clock. The clock period specied by create_clock is
applied to all funcons in the design.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 91
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
SystemC designs support mulple clocks. Mulple named clocks can be specied using the
create_clock command and applied to individual SC_MODULEs using pragma HLS clock.
Each SC_MODULE is synthesized using a single clock.
Syntax
Place the pragma in the C source within the body of the funcon.
#pragma HLS clock domain=<clock>
Where:
• domain=<clock>: Species the clock name.
IMPORTANT!: The specied clock must already exist by the
create_clock
command. There is no
pragma equivalent of the create_clock command. See the Vivado Design Suite User Guide: High-Level
Synthesis (UG902) for more informaon.
Example 1
Assume a SystemC design in which the top-level, foo_top, has clocks ports fast_clock and
slow_clock. However, foo_top uses only fast_clock within its funcon. A sub-block,
foo_sub, uses only slow_clock.
In this example, the following create_clock commands are specied in the script.tcl le
which is specied when the Vivado HLS tool is launched:
create_clock -period 15 fast_clk
create_clock -period 60 slow_clk
Then the following pragmas are specied in the C source le to assign the clock to the specied
funcons, foo_sub and foo_top:
foo_sub (p, q) {
#pragma HLS clock domain=slow_clock
...
}
void foo_top { a, b, c, d} {
#pragma HLS clock domain=fast_clock
...
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 92
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS data_pack
Description
Packs the data elds of a struct into a single scalar with a wider word width.
The DATA_PACK pragma is used for packing all the elements of a struct into a single wide
vector to reduce the memory required for the variable, while allowing all members of the
struct to be read and wrien to simultaneously. The bit alignment of the resulng new wide-
word can be inferred from the declaraon order of the struct elds. The rst eld takes the
LSB of the vector, and the nal element of the struct is aligned with the MSB of the vector.
If the struct contains arrays, the DATA_PACK pragma performs a similar operaon as the
ARRAY_RESHAPE pragma and combines the reshaped array with the other elements in the
struct. Any arrays declared inside the struct are completely paroned and reshaped into a
wide scalar and packed with other scalar elds. However, a struct cannot be opmized with
DATA_PACK and ARRAY_PARTITION or ARRAY_RESHAPE, as those pragmas are mutually
exclusive.
IMPORTANT!: You should exercise some cauon when using the
DATA_PACK
opmizaon on
struct
objects with large arrays. If an array has 4096 elements of type int, this will result in a vector
(and port) of width 4096*32=131072 bits. Vivado HLS can create this RTL design, however it is very
unlikely logic synthesis will be able to route this during the FPGA implementaon.
In general, Xilinx recommends that you use arbitrary precision (or bit-accurate) data types.
Standard C types are based on 8-bit boundaries (8-bit, 16-bit, 32-bit, 64-bit), however, using
arbitrary precision data types in a design lets you specify the exact bit-sizes in the C code prior to
synthesis. The bit-accurate widths result in hardware operators that are smaller and faster. This
allows more logic to be placed in the FPGA and for the logic to execute at higher clock
frequencies. However, the DATA_PACK pragma also lets you align data in the packed struct
along 8-bit boundaries if needed.
If a struct port is to be implemented with an AXI4 interface you should consider using the
DATA_PACK
<byte_pad>
opon to automacally align member elements of the struct to 8-
bit boundaries. The AXI4-Stream protocol requires that TDATA ports of the IP have a width in
mulples of 8. It is a specicaon violaon to dene an AXI4-Stream IP with a TDATA port width
that is not a mulple of 8, therefore, it is a requirement to round up TDATA widths to byte
mulples. Refer to "Interface Synthesis and Structs" in Vivado Design Suite User Guide: High-Level
Synthesis (UG902) for more informaon.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 93
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Syntax
Place the pragma near the denion of the struct variable to pack:
#pragma HLS data_pack variable=
<variable>
\
instance=
<name>
<byte_pad>
Where:
variable=
<variable>
: is the variable to be packed.
instance=
<name>
: Species the name of resultant variable aer packing. If no <name> is
specied, the input <variable> is used.
<byte_pad>
: Oponally species whether to pack data on an 8-bit boundary (8-bit, 16-bit,
24-bit...). The two supported values for this opon are:
struct_level: Pack the whole struct rst, then pad it upward to the next 8-bit
boundary.
field_level: First pad each individual element (eld) of the struct on an 8-bit
boundary, then pack the struct.
TIP: Deciding whether mulple elds of data should be concatenated together before
(
field_level
) or aer (
struct_level
) alignment to byte boundaries is generally determined by
considering how atomic the data is. Atomic informaon is data that can be interpreted on its own,
whereas non-atomic informaon is incomplete for the purpose of interpreng the data. For example,
atomic data can consist of all the bits of informaon in a oang point number. However, the exponent
bits in the oang point number alone would not be atomic. When packing informaon into
TDATA
,
generally non-atomic bits of data are concatenated together (regardless of bit width) unl they form
atomic units. The atomic units are then aligned to byte boundaries using pad bits where necessary.
Example 1
Packs struct array AB[17] with three 8-bit eld elds (R, G, B) into a new 17 element array of
24 bits.
typedef struct{
unsigned char R, G, B;
} pixel;
pixel AB[17];
#pragma HLS data_pack variable=AB
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 94
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 2
Packs struct pointer AB with three 8-bit elds (typedef struct {unsigned char R, G, B;} pixel) in
funcon foo, into a new 24-bit pointer.
typedef struct{
unsigned char R, G, B;
} pixel;
pixel AB;
#pragma HLS data_pack variable=AB
Example 3
In this example the DATA_PACK pragma is specied for in and out arguments to rgb_to_hsv
funcon to instruct the compiler to do pack the structure on an 8-bit boundary to improve the
memory access:
void rgb_to_hsv(RGBcolor* in, // Access global memory as RGBcolor struct-
wise
HSVcolor* out, // Access Global Memory as HSVcolor struct-
wise
int size) {
#pragma HLS data_pack variable=in struct_level
#pragma HLS data_pack variable=out struct_level
...
}
See Also
pragma HLS array_paron
pragma HLS array_reshape
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS dataflow
Description
The DATAFLOW pragma enables task-level pipelining, allowing funcons and loops to overlap in
their operaon, increasing the concurrency of the RTL implementaon, and increasing the overall
throughput of the design.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 95
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
All operaons are performed sequenally in a C descripon. In the absence of any direcves that
limit resources (such as pragma HLS allocation), Vivado HLS seeks to minimize latency and
improve concurrency. However, data dependencies can limit this. For example, funcons or loops
that access arrays must nish all read/write accesses to the arrays before they complete. This
prevents the next funcon or loop that consumes the data from starng operaon. The
DATAFLOW opmizaon enables the operaons in a funcon or loop to start operaon before
the previous funcon or loop completes all its operaons.
Figure 3: DATAFLOW Pragma
void top (a,b,c,d) {
...
func_A(a,b,i1);
func_B(c,i1,i2);
func_C(i2,d)
return d;
}
func_A
func_B
func_C
8 cycles
func_A func_B func_C
8 cycles
3 cycles
func_A
func_B
func_C
func_A
func_B
func_C
5 cycles
(A) Without Dataflow Pipelining (B) With Dataflow Pipelining
X14266-110217
When the DATAFLOW pragma is specied, Vivado HLS analyzes the dataow between sequenal
funcons or loops and create channels (based on pingpong RAMs or FIFOs) that allow consumer
funcons or loops to start operaon before the producer funcons or loops have completed.
This allows funcons or loops to operate in parallel, which decreases latency and improves the
throughput of the RTL.
If no iniaon interval (number of cycles between the start of one funcon or loop and the next)
is specied, Vivado HLS aempts to minimize the iniaon interval and start operaon as soon
as data is available.
TIP: The
config_dataflow
command species the default memory channel and FIFO depth used
in dataow opmizaon. Refer to the
config_dataflow
command in the Vivado Design Suite User
Guide: High-Level Synthesis (UG902) for more informaon.
For the DATAFLOW opmizaon to work, the data must ow through the design from one task to
the next. The following coding styles prevent Vivado HLS from performing the DATAFLOW
opmizaon:
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 96
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Single-producer-consumer violaons
Bypassing tasks
Feedback between tasks
Condional execuon of tasks
Loops with mulple exit condions
IMPORTANT!: If any of these coding styles are present, Vivado HLS issues a message and does not
perform DATAFLOW opmizaon.
Finally, the DATAFLOW opmizaon has no hierarchical implementaon. If a sub-funcon or loop
contains addional tasks that might benet from the DATAFLOW opmizaon, you must apply
the opmizaon to the loop, the sub-funcon, or inline the sub-funcon.
Syntax
Place the pragma in the C source within the boundaries of the region, funcon, or loop.
#pragma HLS dataflow
Example 1
Species DATAFLOW opmizaon within the loop wr_loop_j.
wr_loop_j: for (int j = 0; j < TILE_PER_ROW; ++j) {
#pragma HLS DATAFLOW
wr_buf_loop_m: for (int m = 0; m < TILE_HEIGHT; ++m) {
wr_buf_loop_n: for (int n = 0; n < TILE_WIDTH; ++n) {
#pragma HLS PIPELINE
// should burst TILE_WIDTH in WORD beat
outFifo >> tile[m][n];
}
}
wr_loop_m: for (int m = 0; m < TILE_HEIGHT; ++m) {
wr_loop_n: for (int n = 0; n < TILE_WIDTH; ++n) {
#pragma HLS PIPELINE
outx[TILE_HEIGHT*TILE_PER_ROW*TILE_WIDTH*i
+TILE_PER_ROW*TILE_WIDTH*m+TILE_WIDTH*j+n] = tile[m][n];
}
}
See Also
pragma HLS allocaon
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_dataow
SDAccel Environment Opmizaon Guide (UG1207)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 97
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS dependence
Description
The DEPENDENCE pragma is used to provide addional informaon that can overcome loop-
carry dependencies and allow loops to be pipelined (or pipelined with lower intervals).
Vivado HLS automacally detects dependencies:
Within loops (loop-independent dependence), or
Between dierent iteraons of a loop (loop-carry dependence).
These dependencies impact when operaons can be scheduled, especially during funcon and
loop pipelining.
Loop-independent dependence: The same element is accessed in the same loop iteraon.
for (i=0;i<N;i++) {
A[i]=x;
y=A[i];
}
Loop-carry dependence: The same element is accessed in a dierent loop iteraon.
for (i=0;i<N;i++) {
A[i]=A[i-1]*2;
}
Under certain complex scenarios automac dependence analysis can be too conservave and fail
to lter out false dependencies. Under certain circumstances, such as variable dependent array
indexing, or when an external requirement needs to be enforced (for example, two inputs are
never the same index), the dependence analysis might be too conservave. The DEPENDENCE
pragma allows you to explicitly specify the dependence and resolve a false dependence.
IMPORTANT!: Specifying a false dependency, when in fact the dependency is not false, can result in
incorrect hardware. Be sure dependencies are correct (true or false) before specifying them.
Syntax
Place the pragma within the boundaries of the funcon where the dependence is dened.
#pragma HLS dependence variable=
<variable>
<class>
\
<type>
<direction>
distance=
<int>
<dependent>
Where:
variable=
<variable>
: Oponally species the variable to consider for the dependence.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 98
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
<class>: Oponally species a class of variables in which the dependence needs claricaon.
Valid values include array or pointer.
TIP: <class> and
variable=
do not need to be specied together as you can either specify a variable
or a class of variables within a funcon.
<type>: Valid values include intra or inter. Species whether the dependence is:
intra: dependence within the same loop iteraon. When dependence <type> is specied
as intra, and <dependent> is false, Vivado HLS may move operaons freely within a loop,
increasing their mobility and potenally improving performance or area. When
<dependent> is specied as true, the operaons must be performed in the order specied.
inter: dependence between dierent loop iteraons. This is the default <type>. If
dependence <type> is specied as inter, and <dependent> is false, it allows Vivado HLS
to perform operaons in parallel if the funcon or loop is pipelined, or the loop is unrolled,
or parally unrolled, and prevents such concurrent operaon when <dependent> is
specied as true.
<direcon>: Valid values include RAW, WAR, or WAW. This is relevant for loop-carry
dependencies only, and species the direcon for a dependence:
RAW (Read-Aer-Write - true dependence) The write instrucon uses a value used by the
read instrucon.
WAR (Write-Aer-Read - an dependence) The read instrucon gets a value that is
overwrien by the write instrucon.
WAW (Write-Aer-Write - output dependence) Two write instrucons write to the same
locaon, in a certain order.
distance=
<int>
: Species the inter-iteraon distance for array access. Relevant only for
loop-carry dependencies where dependence is set to true.
<dependent>: Species whether a dependence needs to be enforced (true) or removed
(false). The default is true.
Example 1
In the following example, Vivado HLS does not have any knowledge about the value of cols and
conservavely assumes that there is always a dependence between the write to buff_A[1]
[col] and the read from buff_A[1][col]. In an algorithm such as this, it is unlikely cols will
ever be zero, but Vivado HLS cannot make assumpons about data dependencies. To overcome
this deciency, you can use the DEPENDENCE pragma to state that there is no dependence
between loop iteraons (in this case, for both buff_A and buff_B).
void foo(int rows, int cols, ...)
for (row = 0; row < rows + 1; row++) {
for (col = 0; col < cols + 1; col++) {
#pragma HLS PIPELINE II=1
#pragma HLS dependence variable=buff_A inter false
#pragma HLS dependence variable=buff_B inter false
if (col < cols) {
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 99
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
buff_A[2][col] = buff_A[1][col]; // read from buff_A[1][col]
buff_A[1][col] = buff_A[0][col]; // write to buff_A[1][col]
buff_B[1][col] = buff_B[0][col];
temp = buff_A[0][col];
}
Example 2
Removes the dependence between Var1 in the same iteraons of loop_1 in funcon foo.
#pragma HLS dependence variable=Var1 intra false
Example 3
Denes the dependence on all arrays in loop_2 of funcon foo to inform Vivado HLS that all
reads must happen aer writes (RAW) in the same loop iteraon.
#pragma HLS dependence array intra RAW true
See Also
pragma HLS pipeline
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_pipeline_loop
SDAccel Environment Opmizaon Guide (UG1207)
pragma HLS expression_balance
Description
Somemes a C-based specicaon is wrien with a sequence of operaons resulng in a long
chain of operaons in RTL. With a small clock period, this can increase the latency in the design.
By default, Vivado HLS rearranges the operaons using associave and commutave properes.
This rearrangement creates a balanced tree that can shorten the chain, potenally reducing
latency in the design at the cost of extra hardware.
The EXPRESSION_BALANCE pragma allows this expression balancing to be disabled, or to be
expressly enabled, within a specied scope.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 100
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Syntax
Place the pragma in the C source within the boundaries of the required locaon.
#pragma HLS expression_balance off
Where:
off: Turns o expression balancing at this locaon.
TIP: Leaving this opon out of the pragma enables expression balancing, which is the default mode.
Example 1
This example explicitly enables expression balancing in funcon my_Func:
void my_func(char inval, char incr) {
#pragma HLS expression_balance
Example 2
Disables expression balancing within funcon my_Func:
void my_func(char inval, char incr) {
#pragma HLS expression_balance off
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS function_instantiate
Description
The FUNCTION_INSTANTIATE pragma is an opmizaon technique that has the area benets
of maintaining the funcon hierarchy but provides an addional powerful opon: performing
targeted local opmizaons on specic instances of a funcon. This can simplify the control logic
around the funcon call and potenally improve latency and throughput.
By default:
Funcons remain as separate hierarchy blocks in the RTL.
All instances of a funcon, at the same level of hierarchy, make use of a single RTL
implementaon (block).
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 101
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
The FUNCTION_INSTANTIATE pragma is used to create a unique RTL implementaon for each
instance of a funcon, allowing each instance to be locally opmized according to the funcon
call. This pragma exploits the fact that some inputs to a funcon may be a constant value when
the funcon is called, and uses this to both simplify the surrounding control structures and
produce smaller more opmized funcon blocks.
Without the FUNCTION_INSTANTIATE pragma, the following code results in a single RTL
implementaon of funcon foo_sub for all three instances of the funcon in foo. Each
instance of funcon foo_sub is implemented in an idencal manner. This is ne for funcon
reuse and reducing the area required for each instance call of a funcon, but means that the
control logic inside the funcon must be more complex to account for the variaon in each call of
foo_sub.
char foo_sub(char inval, char incr) {
#pragma HLS function_instantiate variable=incr
return inval + incr;
}
void foo(char inval1, char inval2, char inval3,
char *outval1, char *outval2, char * outval3)
{
*outval1 = foo_sub(inval1, 1);
*outval2 = foo_sub(inval2, 2);
*outval3 = foo_sub(inval3, 3);
}
In the code sample above, the FUNCTION_INSTANTIATE pragma results in three dierent
implementaons of funcon foo_sub, each independently opmized for the incr argument,
reducing the area and improving the performance of the funcon. Aer
FUNCTION_INSTANTIATE opmizaon, foo_sub is eecvely be transformed into three
separate funcons, each opmized for the specied values of incr.
Syntax
Place the pragma in the C source within the boundaries of the required locaon.
#pragma HLS function_instantiate variable=
<variable>
Where:
variable=
<variable>
: A required argument that denes the funcon argument to use as
a constant.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 102
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 1
In the following example, the FUNCTION_INSTANTIATE pragma placed in funcon swInt)
allows each instance of funcon swInt to be independently opmized with respect to the maxv
funcon argument:
void swInt(unsigned int *readRefPacked, short *maxr, short *maxc, short
*maxv){
#pragma HLS function_instantiate variable=maxv
uint2_t d2bit[MAXCOL];
uint2_t q2bit[MAXROW];
#pragma HLS array partition variable=d2bit,q2bit cyclic factor=FACTOR
intTo2bit<MAXCOL/16>((readRefPacked + MAXROW/16), d2bit);
intTo2bit<MAXROW/16>(readRefPacked, q2bit);
sw(d2bit, q2bit, maxr, maxc, maxv);
}
See Also
pragma HLS allocaon
pragma HLS inline
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS inline
Description
Removes a funcon as a separate enty in the hierarchy. Aer inlining, the funcon is dissolved
into the calling funcon and no longer appears as a separate level of hierarchy in the RTL. In
some cases, inlining a funcon allows operaons within the funcon to be shared and opmized
more eecvely with surrounding operaons. An inlined funcon cannot be shared. This can
increase area required for implemenng the RTL.
The INLINE pragma applies dierently to the scope it is dened in depending on how it is
specied:
INLINE: Without arguments, the pragma means that the funcon it is specied in should be
inlined upward into any calling funcons or regions.
INLINE OFF: Species that the funcon it is specied in should NOT be inlined upward into
any calling funcons or regions. This disables the inline of a specic funcon that may be
automacally inlined, or inlined as part of a region or recursion.
INLINE REGION: This applies the pragma to the region or the body of the funcon it is
assigned in. It applies downward, inlining the contents of the region or funcon, but not
inlining recursively through the hierarchy.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 103
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
INLINE RECURSIVE: This applies the pragma to the region or the body of the funcon it is
assigned in. It applies downward, recursively inlining the contents of the region or funcon.
By default, inlining is only performed on the next level of funcon hierarchy, not sub-funcons.
However, the recursive opon lets you specify inlining through levels of the hierarchy.
Syntax
Place the pragma in the C source within the body of the funcon or region of code.
#pragma HLS inline <region | recursive | off>
Where:
region: Oponally species that all funcons in the specied region (or contained within the
body of the funcon) are to be inlined, applies to the scope of the region.
recursive: By default, only one level of funcon inlining is performed, and funcons within
the specied funcon are not inlined. The recursive opon inlines all funcons recursively
within the specied funcon or region.
off: Disables funcon inlining to prevent specied funcons from being inlined. For example,
if recursive is specied in a funcon, this opon can prevent a parcular called funcon
from being inlined when all others are.
TIP: Vivado HLS automacally inlines small funcons and using the INLINE pragma with the
off
opon may be used to prevent this automac inlining.
Example 1
This example inlines all funcons within the region it is specied in, in this case the body of
foo_top, but does not inline any lower level funcons within those funcons.
void foo_top { a, b, c, d} {
#pragma HLS inline region
...
Example 2
The following example, inlines all funcons within the body of foo_top, inlining recursively
down through the funcon hierarchy, except funcon foo_sub is not inlined. The recursive
pragma is placed in funcon foo_top. The pragma to disable inlining is placed in the funcon
foo_sub:
foo_sub (p, q) {
#pragma HLS inline off
int q1 = q + 10;
foo(p1,q);// foo_3
...
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 104
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
}
void foo_top { a, b, c, d} {
#pragma HLS inline region recursive
...
foo(a,b);//foo_1
foo(a,c);//foo_2
foo_sub(a,d);
...
}
Note: Noce in this example, that INLINE applies downward to the contents of funcon foo_top, but
applies upward to the code calling foo_sub.
Example 3
This example inlines the copy_output funcon into any funcons or regions calling
copy_output.
void copy_output(int *out, int out_lcl[OSize * OSize], int output) {
#pragma HLS INLINE
// Calculate each work_item's result update location
int stride = output * OSize * OSize;
// Work_item updates output filter/image in DDR
writeOut: for(int itr = 0; itr < OSize * OSize; itr++) {
#pragma HLS PIPELINE
out[stride + itr] = out_lcl[itr];
}
See Also
pragma HLS allocaon
pragma HLS funcon_instanate
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 105
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS interface
Description
In C based design, all input and output operaons are performed, in zero me, through formal
funcon arguments. In an RTL design these same input and output operaons must be
performed through a port in the design interface and typically operate using a specic I/O (input-
output) protocol. For more informaon, refer to "Managing Interfaces" in the Vivado Design Suite
User Guide: High-Level Synthesis (UG902).
The INTERFACE pragma species how RTL ports are created from the funcon denion during
interface synthesis.
The ports in the RTL implementaon are derived from:
Any funcon-level protocol that is specied.
Funcon arguments.
Global variables accessed by the top-level funcon and dened outside its scope.
Funcon-level protocols, also called block-level I/O protocols, provide signals to control when
the funcon starts operaon, and indicate when funcon operaon ends, is idle, and is ready for
new inputs. The implementaon of a funcon-level protocol:
Is specied by the <mode> values ap_ctrl_none, ap_ctrl_hs or ap_ctrl_chain. The
ap_ctrl_hs block-level I/O protocol is the default.
Are associated with the funcon name.
Each funcon argument can be specied to have its own port-level (I/O) interface protocol, such
as valid handshake (ap_vld) or acknowledge handshake (ap_ack). Port Level interface protocols
are created for each argument in the top-level funcon and the funcon return, if the funcon
returns a value. The default I/O protocol created depends on the type of C argument. Aer the
block-level protocol has been used to start the operaon of the block, the port-level IO protocols
are used to sequence data into and out of the block.
If a global variable is accessed, but all read and write operaons are local to the design, the
resource is created in the design. There is no need for an I/O port in the RTL. If the global
variable is expected to be an external source or desnaon, specify its interface in a similar
manner as standard funcon arguments. See the examples below.
When the INTERFACE pragma is used on sub-funcons, only the register opon can be used.
The <mode> opon is not supported on sub-funcons.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 106
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
TIP: Vivado HLS automacally determines the I/O protocol used by any sub-funcons. You cannot
control these ports except to specify whether the port is registered.
Syntax
Place the pragma within the boundaries of the funcon.
#pragma HLS interface
<mode>
port=
<name>
bundle=
<string>
\
register register_mode=
<mode>
depth=
<int>
offset=
<string>
\
clock=
<string>
name=
<string>
\
num_read_outstanding=
<int>
num_write_outstanding=
<int>
\
max_read_burst_length=
<int>
max_write_burst_length=
<int>
Where:
<mode>: Species the interface protocol mode for funcon arguments, global variables used
by the funcon, or the block-level control protocols. For detailed descripons of these
dierent modes see "Interface Synthesis Reference" in the Vivado Design Suite User Guide:
High-Level Synthesis (UG902). The mode can be specied as one of the following:
ap_none: No protocol. The interface is a data port.
ap_stable: No protocol. The interface is a data port. Vivado HLS assumes the data port is
always stable aer reset, which allows internal opmizaons to remove unnecessary
registers.
ap_vld: Implements the data port with an associated valid port to indicate when the
data is valid for reading or wring.
ap_ack: Implements the data port with an associated acknowledge port to acknowledge
that the data was read or wrien.
ap_hs: Implements the data port with associated valid and acknowledge ports to
provide a two-way handshake to indicate when the data is valid for reading and wring and
to acknowledge that the data was read or wrien.
ap_ovld: Implements the output data port with an associated valid port to indicate
when the data is valid for reading or wring.
IMPORTANT!: Vivado HLS implements the input argument or the input half of any read/write
arguments with mode
ap_none
.
ap_fifo: Implements the port with a standard FIFO interface using data input and output
ports with associated acve-Low FIFO empty and full ports.
Note: You can only use this interface on read arguments or write arguments. The ap_fifo mode does
not support bidireconal read/write arguments.
ap_bus: Implements pointer and pass-by-reference ports as a bus interface.
ap_memory: Implements array arguments as a standard RAM interface. If you use the RTL
design in Vivado IP integrator, the memory interface appears as discrete ports.
bram: Implements array arguments as a standard RAM interface. If you use the RTL design
in Vivado IP integrator, the memory interface appears as a single port.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 107
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
axis: Implements all ports as an AXI4-Stream interface.
s_axilite: Implements all ports as an AXI4-Lite interface. Vivado HLS produces an
associated set of C driver les during the Export RTL process.
m_axi: Implements all ports as an AXI4 interface. You can use the config_interface
command to specify either 32-bit (default) or 64-bit address ports and to control any
address oset.
ap_ctrl_none: No block-level I/O protocol.
Note: Using the ap_ctrl_none mode might prevent the design from being veried using the C/RTL co-
simulaon feature.
ap_ctrl_hs: Implements a set of block-level control ports to start the design operaon
and to indicate when the design is idle, done, and ready for new input data.
Note: The ap_ctrl_hs mode is the default block-level I/O protocol.
ap_ctrl_chain: Implements a set of block-level control ports to start the design
operaon, continue operaon, and indicate when the design is idle, done, and ready
for new input data.
Note: The ap_ctrl_chain interface mode is similar to ap_ctrl_hs but provides an addional input
signal ap_continue to apply back pressure. Xilinx recommends using the ap_ctrl_chain block-
level I/O protocol when chaining Vivado HLS blocks together.
port=
<name>
: Species the name of the funcon argument, funcon return, or global
variable which the INTERFACE pragma applies to.
TIP: Block-level I/O protocols (
ap_ctrl_none
,
ap_ctrl_hs
, or
ap_ctrl_chain
) can be
assigned to a port for the funcon
return
value.
bundle=
<string>
: Groups funcon arguments into AXI interface ports. By default, Vivado
HLS groups all funcon arguments specied as an AXI4-Lite (s_axilite) interface into a
single AXI4-Lite port. Similarly, all funcon arguments specied as an AXI4 (m_axi) interface
are grouped into a single AXI4 port. This opon explicitly groups all interface ports with the
same bundle=
<string>
into the same AXI interface port and names the RTL port the value
specied by <string>.
register: An oponal keyword to register the signal and any relevant protocol signals, and
causes the signals to persist unl at least the last cycle of the funcon execuon. This opon
applies to the following interface modes:
ap_none
ap_ack
ap_vld
ap_ovld
ap_hs
ap_stable
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 108
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
axis
s_axilite
TIP: The
-register_io
opon of the
config_interface
command globally controls registering
all inputs/outputs on the top funcon. Refer to Vivado Design Suite User Guide: High-Level Synthesis
(UG902) for more informaon.
register_mode= <forward|reverse|both|off>: Used with the register keyword,
this opon species if registers are placed on the forward path (TDATA and TVALID), the
reverse path (TREADY), on both paths (TDATA, TVALID, and TREADY), or if none of the
port signals are to be registered (off). The default register_mode is both. AXI-Stream
(axis) side-channel signals are considered to be data signals and are registered whenever the
TDATA is registered.
depth=
<int>
: Species the maximum number of samples for the test bench to process. This
seng indicates the maximum size of the FIFO needed in the vericaon adapter that Vivado
HLS creates for RTL co-simulaon.
TIP: While
depth
is usually an opon, it is required for
m_axi
interfaces.
offset=
<string>
: Controls the address oset in AXI4-Lite (s_axilite) and AXI4
(m_axi) interfaces.
For the s_axilite interface, <string> species the address in the register map.
For the m_axi interface, <string> species on of the following values:
-direct: Generate a scalar input oset port.
-slave: Generate an oset port and automacally map it to an AXI4-Lite slave interface.
-off: Do not generate an oset port.
TIP: The
-m_axi_offset
opon of the
config_interface
command globally controls the
oset ports of all M_AXI interfaces in the design.
clock=
<name>
: Oponally specied only for interface mode s_axilite. This denes the
clock signal to use for the interface. By default, the AXI-Lite interface clock is the same clock
as the system clock. This opon is used to specify a separate clock for the AXI-Lite
(s_axilite) interface.
TIP: If the
bundle
opon is used to group mulple top-level funcon arguments into a single AXI-Lite
interface, the clock opon need only be specied on one of the bundle members.
num_read_outstanding=
<int>
: For AXI4 (m_axi) interfaces, this opon species how
many read requests can be made to the AXI4 bus, without a response, before the design stalls.
This implies internal storage in the design, a FIFO of size:
num_read_outstanding*max_read_burst_length*word_size.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 109
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
num_write_outstanding=
<int>
: For AXI4 (m_axi) interfaces, this opon species how
many write requests can be made to the AXI4 bus, without a response, before the design
stalls. This implies internal storage in the design, a FIFO of size:
num_write_outstanding*max_write_burst_length*word_size
max_read_burst_length=
<int>
: For AXI4 (m_axi) interfaces, this opon species the
maximum number of data values read during a burst transfer.
max_write_burst_length=
<int>
: For AXI4 (m_axi) interfaces, this opon species the
maximum number of data values wrien during a burst transfer.
name=
<string>
: This opon is used to rename the port based on your own specicaon.
The generated RTL port will use this name.
Example 1
In this example, both funcon arguments are implemented using an AXI4-Stream interface:
void example(int A[50], int B[50]) {
//Set the HLS native interface types
#pragma HLS INTERFACE axis port=A
#pragma HLS INTERFACE axis port=B
int i;
for(i = 0; i < 50; i++){
B[i] = A[i] + 5;
}
}
Example 2
The following turns o block-level I/O protocols, and is assigned to the funcon return value:
#pragma HLS interface ap_ctrl_none port=return
The funcon argument InData is specied to use the ap_vld interface, and also indicates the
input should be registered:
#pragma HLS interface ap_vld register port=InData
This exposes the global variable lookup_table as a port on the RTL design, with an
ap_memory interface:
pragma HLS interface ap_memory port=lookup_table
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 110
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 3
This example denes the INTERFACE standards for the ports of the top-level transpose
funcon. Noce the use of the bundle= opon to group signals.
// TOP LEVEL - TRANSPOSE
void transpose(int* input, int* output) {
#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
#pragma HLS INTERFACE s_axilite port=input bundle=control
#pragma HLS INTERFACE s_axilite port=output bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
#pragma HLS dataflow
See Also
pragma HLS protocol
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS latency
Description
Species a minimum or maximum latency value, or both, for the compleon of funcons, loops,
and regions. Latency is dened as the number of clock cycles required to produce an output.
Funcon latency is the number of clock cycles required for the funcon to compute all output
values, and return. Loop latency is the number of cycles to execute all iteraons of the loop. See
"Performance Metrics Example" of Vivado Design Suite User Guide: High-Level Synthesis (UG902).
Vivado HLS always tries to the minimize latency in the design. When the LATENCY pragma is
specied, the tool behavior is as follows:
Latency is greater than the minimum, or less than the maximum: The constraint is sased. No
further opmizaons are performed.
Latency is less than the minimum: If Vivado HLS can achieve less than the minimum specied
latency, it extends the latency to the specied value, potenally increasing sharing.
Latency is greater than the maximum: If Vivado HLS cannot schedule within the maximum
limit, it increases eort to achieve the specied constraint. If it sll fails to meet the maximum
latency, it issues a warning, and produces a design with the smallest achievable latency in
excess of the maximum.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 111
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
TIP: You can also use the LATENCY pragma to limit the eorts of the tool to nd an opmum soluon.
Specifying latency constraints for scopes within the code: loops, funcons, or regions, reduces the
possible soluons within that scope and improves tool runme. Refer to "Improving Run Time and
Capacity" of Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more informaon.
Syntax
Place the pragma within the boundary of a funcon, loop, or region of code where the latency
must be managed.
#pragma HLS latency min=<int> max=<int>
Where:
min=
<int>
: Oponally species the minimum latency for the funcon, loop, or region of
code.
max=
<int>
: Oponally species the maximum latency for the funcon, loop, or region of
code.
Note: Although both min and max are described as oponal, one must be specied.
Example 1
Funcon foo is specied to have a minimum latency of 4 and a maximum latency of 8:
int foo(char x, char a, char b, char c) {
#pragma HLS latency min=4 max=8
char y;
y = x*a+b+c;
return y
}
Example 2
In the following example loop_1 is specied to have a maximum latency of 12. Place the pragma
in the loop body as shown:
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS latency max=12
...
result = a + b;
}
}
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 112
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 3
The following example creates a code region and groups signals that need to change in the same
clock cycle by specifying zero latency:
// create a region { } with a latency = 0
{
#pragma HLS LATENCY max=0 min=0
*data = 0xFF;
*data_vld = 1;
}
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS loop_flatten
Description
Allows nested loops to be aened into a single loop hierarchy with improved latency.
In the RTL implementaon, it requires one clock cycle to move from an outer loop to an inner
loop, and from an inner loop to an outer loop. Flaening nested loops allows them to be
opmized as a single loop. This saves clock cycles, potenally allowing for greater opmizaon of
the loop body logic.
Apply the LOOP_FLATTEN pragma to the loop body of the inner-most loop in the loop hierarchy.
Only perfect and semi-perfect loops can be aened in this manner:
Perfect loop nests:
Only the innermost loop has loop body content.
There is no logic specied between the loop statements.
All loop bounds are constant.
Semi-perfect loop nests:
Only the innermost loop has loop body content.
There is no logic specied between the loop statements.
The outermost loop bound can be a variable.
Imperfect loop nests: When the inner loop has variable bounds (or the loop body is not
exclusively inside the inner loop), try to restructure the code, or unroll the loops in the loop
body to create a perfect loop nest.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 113
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Syntax
Place the pragma in the C source within the boundaries of the nested loop.
#pragma HLS loop_flatten off
Where:
off: Is an oponal keyword that prevents aening from taking place. Can prevent some
loops from being aened while all others in the specied locaon are aened.
Note: The presence of the LOOP_FLATTEN pragma enables the opmizaon.
Example 1
Flaens loop_1 in funcon foo and all (perfect or semi-perfect) loops above it in the loop
hierarchy, into a single loop. Place the pragma in the body of loop_1.
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS loop_flatten
...
result = a + b;
}
}
Example 2
Prevents loop aening in loop_1:
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS loop_flatten off
...
See Also
pragma HLS loop_merge
pragma HLS loop_tripcount
pragma HLS unroll
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 114
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS loop_merge
Description
Merge consecuve loops into a single loop to reduce overall latency, increase sharing, and
improve logic opmizaon. Merging loops:
Reduces the number of clock cycles required in the RTL to transion between the loop-body
implementaons.
Allows the loops be implemented in parallel (if possible).
The LOOP_MERGE pragma will seek to merge all loops within the scope it is placed. For example,
if you apply a LOOP_MERGE pragma in the body of a loop, Vivado HLS applies the pragma to any
sub-loops within the loop but not to the loop itself.
The rules for merging loops are:
If the loop bounds are variables, they must have the same value (number of iteraons).
If the loop bounds are constants, the maximum constant value is used as the bound of the
merged loop.
Loops with both variable bounds and constant bounds cannot be merged.
The code between loops to be merged cannot have side eects. Mulple execuon of this
code should generate the same results (a=b is allowed, a=a+1 is not).
Loops cannot be merged when they contain FIFO reads. Merging changes the order of the
reads. Reads from a FIFO or FIFO interface must always be in sequence.
Syntax
Place the pragma in the C source within the required scope or region of code:
#pragma HLS loop_merge force
where
force: An oponal keyword to force loops to be merged even when Vivado HLS issues a
warning.
IMPORTANT!: In this case, you must manually insure that the merged loop will funcon correctly.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 115
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Examples
Merges all consecuve loops in funcon foo into a single loop.
void foo (num_samples, ...) {
#pragma HLS loop_merge
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
...
All loops inside loop_2 (but not loop_2 itself) are merged by using the force opon. Place
the pragma in the body of loop_2.
loop_2: for(i=0;i< num_samples;i++) {
#pragma HLS loop_merge force
...
See Also
pragma HLS loop_aen
pragma HLS loop_tripcount
pragma HLS unroll
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS loop_tripcount
Description
The TRIPCOUNT pragma can be applied to a loop to manually specify the total number of
iteraons performed by a loop.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 116
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
IMPORTANT!: The
TRIPCOUNT
pragma is for analysis only, and does not impact the results of
synthesis.
Vivado HLS reports the total latency of each loop, which is the number of clock cycles to execute
all iteraons of the loop. The loop latency is therefore a funcon of the number of loop
iteraons, or tripcount.
The tripcount can be a constant value. It may depend on the value of variables used in the loop
expression (for example, x<y), or depend on control statements used inside the loop. In some
cases Vivado HLS cannot determine the tripcount, and the latency is unknown. This includes
cases in which the variables used to determine the tripcount are:
Input arguments, or
Variables calculated by dynamic operaon.
In cases where the loop latency is unknown or cannot be calculate, the TRIPCOUNT pragma lets
you specify minimum and maximum iteraons for a loop. This lets the tool analyze how the loop
latency contributes to the total design latency in the reports, and helps you determine
appropriate opmizaons for the design.
Syntax
Place the pragma in the C source within the body of the loop:
#pragma HLS loop_tripcount min=<int> max=<int> avg=<int>
Where:
max=
<int>
: Species the maximum number of loop iteraons.
min=
<int>
: Species the minimum number of loop iteraons.
avg=
<int>
: Species the average number of loop iteraons.
Examples
In this example loop_1 in funcon foo is specied to have a minimum tripcount of 12 and a
maximum tripcount of 16:
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS loop_tripcount min=12 max=16
...
result = a + b;
}
}
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 117
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS occurrence
Description
When pipelining funcons or loops, the OCCURRENCE pragma species that the code in a region
is executed less frequently than the code in the enclosing funcon or loop. This allows the code
that is executed less oen to be pipelined at a slower rate, and potenally shared within the top-
level pipeline. To determine the OCCURRENCE:
A loop iterates N mes.
However, part of the loop body is enabled by a condional statement, and as a result only
executes M mes, where N is an integer mulple of M.
The condional code has an occurrence that is N/M mes slower than the rest of the loop
body.
For example, in a loop that executes 10 mes, a condional statement within the loop only
executes 2 mes has an occurrence of 5 (or 10/2).
Idenfying a region with the OCCURRENCE pragma allows the funcons and loops in that region
to be pipelined with a higher iniaon interval that is slower than the enclosing funcon or loop.
Syntax
Place the pragma in the C source within a region of code.
#pragma HLS occurrence cycle=<int>
Where:
cycle=
<int>
: Species the occurrence N/M, where:
N is the number of mes the enclosing funcon or loop is executed .
M is the number of mes the condional region is executed.
IMPORTANT!: N must be an integer mulple of M.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 118
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Examples
In this example, the region Cond_Region has an occurrence of 4 (it executes at a rate four mes
less oen than the surrounding code that contains it):
Cond_Region: {
#pragma HLS occurrence cycle=4
...
}
See Also
pragma HLS pipeline
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 119
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS pipeline
Description
The PIPELINE pragma reduces the iniaon interval for a funcon or loop by allowing the
concurrent execuon of operaons.
A pipelined funcon or loop can process new inputs every N clock cycles, where N is the
iniaon interval (II) of the loop or funcon. The default iniaon interval for the PIPELINE
pragma is 1, which processes a new input every clock cycle. You can also specify the iniaon
interval through the use of the II opon for the pragma.
Pipelining a loop allows the operaons of the loop to be implemented in a concurrent manner as
shown in the following gure. In this gure, (A) shows the default sequenal operaon where
there are 3 clock cycles between each input read (II=3), and it requires 8 clock cycles before the
last output write is performed.
Figure 4: Loop Pipeline
void func(m,n,o) {
for (i=2;i>=0;i--) {
op_Read;
op_Compute;
op_Write;
}
}
4 cycles
RD
3 cycles
8 cycles
1 cycle
RD CMP WR
RD CMP WR
RD CMP WR
(A) Without Loop Pipelining (B) With Loop Pipelining
X14277-110217
CMP WR RD CMP WR RD CMP WR
IMPORTANT!: Loop pipelining can be prevented by loop carry dependencies. You can use the
DEPENDENCE
pragma to provide addional informaon that can overcome loop-carry dependencies
and allow loops to be pipelined (or pipelined with lower intervals).
If Vivado HLS cannot create a design with the specied II, it:
Issues a warning.
Creates a design with the lowest possible II.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 120
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
You can then analyze this design with the warning message to determine what steps must be
taken to create a design that sases the required iniaon interval.
Syntax
Place the pragma in the C source within the body of the funcon or loop.
#pragma HLS pipeline II=
<int>
enable_flush rewind
Where:
II=
<int>
: Species the desired iniaon interval for the pipeline. Vivado HLS tries to meet
this request. Based on data dependencies, the actual result might have a larger iniaon
interval. The default II is 1.
enable_flush: An oponal keyword which implements a pipeline that will ush and empty
if the data valid at the input of the pipeline goes inacve.
TIP: This feature is only supported for pipelined funcons: it is not supported for pipelined loops.
rewind: An oponal keyword that enables rewinding, or connuous loop pipelining with no
pause between one loop iteraon ending and the next iteraon starng. Rewinding is
eecve only if there is one single loop (or a perfect loop nest) inside the top-level funcon.
The code segment before the loop:
Is considered as inializaon.
Is executed only once in the pipeline.
Cannot contain any condional operaons (if-else).
TIP: This feature is only supported for pipelined loops: it is not supported for pipelined funcons.
Example 1
In this example funcon foo is pipelined with an iniaon interval of 1:
void foo { a, b, c, d} {
#pragma HLS pipeline II=1
...
}
Note: The default value for II is 1, so II=1 is not required in this example.
See Also
pragma HLS dependence
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_pipeline_loop
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 121
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
SDAccel Environment Opmizaon Guide (UG1207)
pragma HLS protocol
Description
The PROTOCOL pragma species a region of the code to be a protocol region, in which no clock
operaons are inserted by Vivado HLS unless explicitly specied in the code. A protocol region
can be used to manually specify an interface protocol to ensure the nal design can be
connected to other hardware blocks with the same I/O protocol.
Note: See "Specifying Manual Interface"in the Vivado Design Suite User Guide: High-Level Synthesis (UG902)
for more informaon.
Vivado HLS does not insert any clocks between the operaons, including those that read from, or
write to, funcon arguments, unless explicitly specied in the code. The order of read and writes
are therefore obeyed in the RTL.
A clock operaon may be specied:
In C by using an ap_wait() statement (include ap_utils.h).
In C++ and SystemC designs by using the wait() statement (include systemc.h).
The ap_wait and wait statements have no eect on the simulaon of C and C++ designs
respecvely. They are only interpreted by Vivado HLS.
To create a region of C code:
1. Enclose the region in braces, {},
2. Oponally name it to provide an idener.
For example, the following denes a region called io_section:
io_section:{
...
}
Syntax
Place the pragma inside the boundaries of a region to dene the protocol for the region.
#pragma HLS protocol <floating | fixed>
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 122
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Where:
floating: Protocol mode that allows statements outside the protocol region to overlap with
the statements inside the protocol region in the nal RTL. The code within the protocol region
remains cycle accurate, but other operaons can occur at the same me. This is the default
protocol mode.
fixed: Protocol mode that ensures that there is no overlap of statements inside or outside
the protocol region.
IMPORTANT!: If no protocol mode is specied, the default of oang is assumed.
Example 1
This example denes region io_section as a xed protocol region. Place the pragma inside
region:
io_section:{
#pragma HLS protocol fixed
...
}
See Also
pragma HLS array_map
pragma HLS array_reshape
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
xcl_array_paron
SDAccel Environment Opmizaon Guide (UG1207)
pragma HLS reset
Description
Adds or removes resets for specic state variables (global or stac).
The reset port is used in an FPGA to restore the registers and block RAM connected to the reset
port to an inial value any me the reset signal is applied. The presence and behavior of the RTL
reset port is controlled using the config_rtl conguraon le. The reset sengs include the
ability to set the polarity of the reset, and specify whether the reset is synchronous or
asynchronous, but more importantly it controls, through the reset opon, which registers are
reset when the reset signal is applied. See Clock, Reset, and RTL Output in the Vivado Design
Suite User Guide: High-Level Synthesis (UG902) for more informaon.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 123
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Greater control over reset is provided through the RESET pragma. If a variable is a stac or
global, the RESET pragma is used to explicitly add a reset, or the variable can be removed from
the reset by turning off the pragma. This can be parcularly useful when stac or global arrays
are present in the design.
Syntax
Place the pragma in the C source within the boundaries of the variable life cycle.
#pragma HLS reset variable=
<a>
off
Where:
variable=
<a>
: Species the variable to which the pragma is applied.
off: Indicates that reset is not generated for the specied variable.
Example 1
This example adds reset to the variable a in funcon foo even when the global reset seng is
none or control:
void foo(int in[3], char a, char b, char c, int out[3]) {
#pragma HLS reset variable=a
Example 2
Removes reset from variable a in funcon foo even when the global reset seng is state or
all.
void foo(int in[3], char a, char b, char c, int out[3]) {
#pragma HLS reset variable=a off
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS resource
Description
Specify that a specic library resource (core) is used to implement a variable (array, arithmec
operaon or funcon argument) in the RTL. If the RESOURCE pragma is not specied, Vivado
HLS determines the resource to use.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 124
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Vivado HLS implements the operaons in the code using hardware cores. When mulple cores in
the library can implement the operaon, you can specify which core to use with the RESOURCE
pragma. To generate a list of available cores, use the list_core command.
TIP: The
list_core
command is used to obtain details on the cores available in the library. The
list_core
can only be used in the Vivado HLS Tcl command interface, and a Xilinx device must be
specied using the
set_part
command. If a device has not been selected, the
list_core
command does not have any eect.
For example, to specify which memory element in the library to use to implement an array, use
the RESOURCE pragma. This lets you control whether the array is implemented as a single or a
dual-port RAM. This usage is important for arrays on the top-level funcon interface, because
the memory type associated with the array determines the ports needed in the RTL.
You can use the latency= opon to specify the latency of the core. For block RAMs on the
interface, the latency= opon allows you to model o-chip, non-standard SRAMs at the
interface, for example supporng an SRAM with a latency of 2 or 3. See Arrays on the Interface
in the Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more informaon. For
internal operaons, the latency= opon allows the operaon to be implemented using more
pipelined stages. These addional pipeline stages can help resolve ming issues during RTL
synthesis.
IMPORTANT!: To use the
latency=
opon, the operaon must have an available mul-stage core.
Vivado HLS provides a mul-stage core for all basic arithmec operaons (add, subtract, mulply and
divide), all oang-point operaons, and all block RAMs.
For best results, Xilinx recommends that you use -std=c99 for C and -fno-builtin for C and
C++. To specify the C compile opons, such as -std=c99, use the Tcl command add_files
with the -cflags opon. Alternavely, use the Edit CFLAGs buon in the Project Sengs
dialog box. See Creang a New Synthesis Project in the Vivado Design Suite User Guide: High-Level
Synthesis (UG902).
Syntax
Place the pragma in the C source within the body of the funcon where the variable is dened.
#pragma HLS resource variable=
<variable>
core=
<core>
\
latency=
<int>
Where:
variable=
<variable>
: A required argument that species the array, arithmec operaon,
or funcon argument to assign the RESOURCE pragma to.
core=
<core>
: A required argument that species the core, as dened in the technology
library.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 125
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
latency=
<int>
: Species the latency of the core.
Example 1
In the following example, a 2-stage pipelined mulplier is specied to implement the
mulplicaon for variable c of the funcon foo. It is le to Vivado HLS which core to use for
variable d.
int foo (int a, int b) {
int c, d;
#pragma HLS RESOURCE variable=c latency=2
c = a*b;
d = a*c;
return d;
}
Example 2
In the following example, the variable coeffs[128] is an argument to the top-level funcon
foo_top. This example species that coeffs be implemented with core RAM_1P from the
library:
#pragma HLS resource variable=coeffs core=RAM_1P
TIP: The ports created in the RTL to access the values of
coeffs
are dened in the RAM_1P core.
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS stream
Description
By default, array variables are implemented as RAM:
Top-level funcon array parameters are implemented as a RAM interface port.
General arrays are implemented as RAMs for read-write access.
In sub-funcons involved in DATAFLOW opmizaons, the array arguments are implemented
using a RAM pingpong buer channel.
Arrays involved in loop-based DATAFLOW opmizaons are implemented as a RAM
pingpong buer channel
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 126
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
If the data stored in the array is consumed or produced in a sequenal manner, a more ecient
communicaon mechanism is to use streaming data as specied by the STREAM pragma, where
FIFOs are used instead of RAMs.
IMPORTANT!: When an argument of the top-level funcon is specied as INTERFACE type
ap_fifo
, the array is automacally implemented as streaming.
Syntax
Place the pragma in the C source within the boundaries of the required locaon.
#pragma HLS stream variable=
<variable>
depth=
<int>
dim=
<int>
off
Where:
variable=
<variable>
: Species the name of the array to implement as a streaming
interface.
depth=
<int>
: Relevant only for array streaming in DATAFLOW channels. By default, the
depth of the FIFO implemented in the RTL is the same size as the array specied in the C
code. This opons lets you modify the size of the FIFO and specify a dierent depth.
When the array is implemented in a DATAFLOW region, it is common to the use the depth=
opon to reduce the size of the FIFO. For example, in a DATAFLOW region when all loops and
funcons are processing data at a rate of II=1, there is no need for a large FIFO because data
is produced and consumed in each clock cycle. In this case, the depth= opon may be used
to reduce the FIFO size to 1 to substanally reduce the area of the RTL design.
TIP: The
config_dataflow -depth
command provides the ability to stream all arrays in a
DATAFLOW region. The
depth=
opon specied here overrides the
config_dataflow
command
for the assigned variable.
dim=
<int>
: Species the dimension of the array to be streamed. The default is dimension 1.
Specied as an integer from 0 to N, for an array with N dimensions.
off: Disables streaming data. Relevant only for array streaming in dataow channels.
TIP: The
config_dataflow -default_channel fifo
command globally implies a
STREAM
pragma on all arrays in the design. The
off
opon specied here overrides the
config_dataflow
command for the assigned variable, and restores the default of using a RAM pingpong buer based
channel.
Example 1
The following example species array A[10] to be streaming, and implemented as a FIFO:
#pragma HLS STREAM variable=A
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 127
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Example 2
In this example array B is set to streaming with a FIFO depth of 12:
#pragma HLS STREAM variable=B depth=12
Example 3
Array C has streaming disabled. It is assumed to be enabled by config_dataflow in this
example:
#pragma HLS STREAM variable=C off
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS top
Description
Aaches a name to a funcon, which can then be used with the set_top command to
synthesize the funcon and any funcons called from the specied top-level. This is typically
used to synthesize member funcons of a class in C/C++.
Specify the pragma in an acve soluon, and then use the set_top command with the new
name.
Syntax
Place the pragma in the C source within the boundaries of the required locaon.
#pragma HLS top name=
<string>
Where:
name=
<string>
: Species the name to be used by the set_top command.
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 128
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Examples
Funcon foo_long_name is designated the top-level funcon, and renamed to DESIGN_TOP.
Aer the pragma is placed in the code, the set_top command must sll be issued from the Tcl
command line, or from the top-level specied in the GUI project sengs.
void foo_long_name () {
#pragma HLS top name=DESIGN_TOP
...
}
set_top DESIGN_TOP
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS unroll
Description
Unroll loops to create mulple independent operaons rather than a single collecon of
operaons. The UNROLL pragma transforms loops by creang mulples copies of the loop body
in the RTL design, which allows some or all loop iteraons to occur in parallel.
Loops in the C/C++ funcons are kept rolled by default. When loops are rolled, synthesis creates
the logic for one iteraon of the loop, and the RTL design executes this logic for each iteraon of
the loop in sequence. A loop is executed for the number of iteraons specied by the loop
inducon variable. The number of iteraons might also be impacted by logic inside the loop body
(for example, break condions or modicaons to a loop exit variable). Using the UNROLL
pragma you can unroll loops to increase data access and throughput.
The UNROLL pragma allows the loop to be fully or parally unrolled. Fully unrolling the loop
creates a copy of the loop body in the RTL for each loop iteraon, so the enre loop can be run
concurrently. Parally unrolling a loop lets you specify a factor N, to create N copies of the loop
body and reduce the loop iteraons accordingly. To unroll a loop completely, the loop bounds
must be known at compile me. This is not required for paral unrolling.
Paral loop unrolling does not require N to be an integer factor of the maximum loop iteraon
count. Vivado HLS adds an exit check to ensure that parally unrolled loops are funconally
idencal to the original loop. For example, given the following code:
for(int i = 0; i < X; i++) {
pragma HLS unroll factor=2
a[i] = b[i] + c[i];
}
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 129
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Loop unrolling by a factor of 2 eecvely transforms the code to look like the following code
where the break construct is used to ensure the funconality remains the same, and the loop
exits at the appropriate point:
for(int i = 0; i < X; i += 2) {
a[i] = b[i] + c[i];
if (i+1 >= X) break;
a[i+1] = b[i+1] + c[i+1];
}
Because the maximum iteraon count X is a variable, Vivado HLS may not be able to determine
its value and so adds an exit check and control logic to parally unrolled loops. However, if you
know that the specied unrolling factor, 2 in this example, is an integer factor of the maximum
iteraon count X, the skip_exit_check opon lets you remove the exit check and associated
logic. This helps minimize the area and simplify the control logic.
TIP: When the use of pragmas like DATA_PACK, ARRAY_PARTITION, or ARRAY_RESHAPE, let more
data be accessed in a single clock cycle, Vivado HLS automacally unrolls any loops consuming this
data, if doing so improves the throughput. The loop can be fully or parally unrolled to create enough
hardware to consume the addional data in a single clock cycle. This feature is controlled using the
config_unroll
command. See
config_unroll
in the Vivado Design Suite User Guide: High-
Level Synthesis (UG902) for more informaon.
Syntax
Place the pragma in the C/C++ source within the body of the loop to unroll.
#pragma HLS unroll factor=<N> region skip_exit_check
Where:
factor=
<N>
: Species a non-zero integer indicang that paral unrolling is requested. The
loop body is repeated the specied number of mes, and the iteraon informaon is adjusted
accordingly. If factor= is not specied, the loop is fully unrolled.
region: An oponal keyword that unrolls all loops within the body (region) of the specied
loop, without unrolling the enclosing loop itself.
skip_exit_check: An oponal keyword that applies only if paral unrolling is specied
with factor=. The eliminaon of the exit check is dependent on whether the loop iteraon
count is known or unknown:
Fixed (known) bounds: No exit condion check is performed if the iteraon count is a
mulple of the factor. If the iteraon count is not an integer mulple of the factor, the tool:
1. Prevents unrolling.
2. Issues a warning that the exit check must be performed to proceed.
Variable (unknown) bounds: The exit condion check is removed as requested. You must
ensure that:
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 130
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
1. The variable bounds is an integer mulple of the specied unroll factor.
2. No exit check is in fact required.
Example 1
The following example fully unrolls loop_1 in funcon foo. Place the pragma in the body of
loop_1 as shown:
loop_1: for(int i = 0; i < N; i++) {
#pragma HLS unroll
a[i] = b[i] + c[i];
}
Example 2
This example species an unroll factor of 4 to parally unroll loop_2 of funcon foo, and
removes the exit check:
void foo (...) {
int8 array1[M];
int12 array2[N];
...
loop_2: for(i=0;i<M;i++) {
#pragma HLS unroll skip_exit_check factor=4
array1[i] = ...;
array2[i] = ...;
...
}
...
}
Example 3
The following example fully unrolls all loops inside loop_1 in funcon foo, but not loop_1
itself due to the presence of the region keyword:
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N];
loop_1: for(int i = 0; i < N; i++) {
#pragma HLS unroll region
temp1[i] = data_in[i] * scale;
loop_2: for(int j = 0; j < N; j++) {
data_out1[j] = temp1[j] * 123;
}
loop_3: for(int k = 0; k < N; k++) {
data_out2[k] = temp1[k] * 456;
}
}
}
See Also
pragma HLS loop_aen
pragma HLS loop_merge
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 131
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
pragma HLS loop_tripcount
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
opencl_unroll_hint
SDAccel Environment Opmizaon Guide (UG1207)
Appendix B: HLS Pragmas
Vivado HLS Optimization Methodology Guide 132
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Appendix C
Additional Resources and Legal
Notices
Xilinx Resources
For support resources such as Answers, Documentaon, Downloads, and Forums, see Xilinx
Support.
Solution Centers
See the Xilinx Soluon Centers for support on devices, soware tools, and intellectual property
at all stages of the design cycle. Topics include design assistance, advisories, and troubleshoong
ps
References
These documents provide supplemental material useful with this webhelp:
1. SDx Environments Release Notes, Installaon, and Licensing Guide (UG1238)
2. SDSoC Environment User Guide (UG1027)
3. SDSoC Environment Opmizaon Guide (UG1235)
4. SDSoC Environment Tutorial: Introducon (UG1028)
5. SDSoC Environment Plaorm Development Guide (UG1146)
6. SDSoC Development Environment web page
7. UltraFast Embedded Design Methodology Guide (UG1046)
8. Zynq-7000 All Programmable SoC Soware Developers Guide (UG821)
9. Zynq UltraScale+ MPSoC Soware Developer Guide (UG1137)
10. ZC702 Evaluaon Board for the Zynq-7000 XC7Z020 All Programmable SoC User Guide (UG850)
11. ZCU102 Evaluaon Board User Guide (UG1182)
12. PetaLinux Tools Documentaon: Workow Tutorial (UG1156)
13. Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Vivado HLS Optimization Methodology Guide 133
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
14. Vivado Design Suite User Guide: Creang and Packaging Custom IP (UG1118)
15. Vivado® Design Suite Documentaon
Please Read: Important Legal Notices
The informaon disclosed to you hereunder (the “Materials”) is provided solely for the selecon
and use of Xilinx products. To the maximum extent permied by applicable law: (1) Materials are
made available "AS IS" and with all faults, Xilinx hereby DISCLAIMS ALL WARRANTIES AND
CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY
PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether in contract or tort, including
negligence, or under any other theory of liability) for any loss or damage of any kind or nature
related to, arising under, or in connecon with, the Materials (including your use of the
Materials), including for any direct, indirect, special, incidental, or consequenal loss or damage
(including loss of data, prots, goodwill, or any type of loss or damage suered as a result of any
acon brought by a third party) even if such damage or loss was reasonably foreseeable or Xilinx
had been advised of the possibility of the same. Xilinx assumes no obligaon to correct any
errors contained in the Materials or to nofy you of updates to the Materials or to product
specicaons. You may not reproduce, modify, distribute, or publicly display the Materials
without prior wrien consent. Certain products are subject to the terms and condions of
Xilinx’s limited warranty, please refer to Xilinx’s Terms of Sale which can be viewed at
www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty and support terms contained
in a license issued to you by Xilinx. Xilinx products are not designed or intended to be fail-safe or
for use in any applicaon requiring fail-safe performance; you assume sole risk and liability for
use of Xilinx products in such crical applicaons, please refer to Xilinx’s Terms of Sale which can
be viewed at www.xilinx.com/legal.htm#tos.
AUTOMOTIVE APPLICATIONS DISCLAIMER
AUTOMOTIVE PRODUCTS (IDENTIFIED AS “XA” IN THE PART NUMBER) ARE NOT
WARRANTED FOR USE IN THE DEPLOYMENT OF AIRBAGS OR FOR USE IN APPLICATIONS
THAT AFFECT CONTROL OF A VEHICLE (“SAFETY APPLICATION”) UNLESS THERE IS A
SAFETY CONCEPT OR REDUNDANCY FEATURE CONSISTENT WITH THE ISO 26262
AUTOMOTIVE SAFETY STANDARD (“SAFETY DESIGN”). CUSTOMER SHALL, PRIOR TO
USING OR DISTRIBUTING ANY SYSTEMS THAT INCORPORATE PRODUCTS, THOROUGHLY
TEST SUCH SYSTEMS FOR SAFETY PURPOSES. USE OF PRODUCTS IN A SAFETY
APPLICATION WITHOUT A SAFETY DESIGN IS FULLY AT THE RISK OF CUSTOMER, SUBJECT
ONLY TO APPLICABLE LAWS AND REGULATIONS GOVERNING LIMITATIONS ON PRODUCT
LIABILITY.
Appendix C: Additional Resources and Legal Notices
Vivado HLS Optimization Methodology Guide 134
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
© Copyright 2017 Xilinx, Inc. Xilinx, the Xilinx logo, Arx, ISE, Kintex, Spartan, Virtex, Vivado,
Zynq, and other designated brands included herein are trademarks of Xilinx in the United States
and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by
permission by Khronos. PCI, PCIe and PCI Express are trademarks of PCI-SIG and used under
license. All other trademarks are the property of their respecve owners.
Appendix C: Additional Resources and Legal Notices
Vivado HLS Optimization Methodology Guide 135
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]
Appendix C: Additional Resources and Legal Notices
Vivado HLS Optimization Methodology Guide 136
UG1270 (v2017.4) December 20, 2017 www.xilinx.com [placeholder text]

Navigation menu