Vivado Design Suite User Guide: High Level Synthesis (UG902) Xilinx HLS Guide
User Manual:
Open the PDF directly: View PDF  .
.
Page Count: 672
| Download |  | 
| Open PDF In Browser | View PDF | 
Vivado Design Suite
User Guide
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
Revision History
The following table shows the revision history for this document.
Date
Version
04/05/2017
2017.1
Revision
Added new section HLS Math Library in Chapter 2.
Updated code examples in Pointers, apint_print(), Invert Bit, Dependencies with
Vivado HLS, and Cholesky Inverse and QR Inverse.
Removed -avg option for TRIPCOUNT throughout document.
Updated Specifying Arrays as Block RAM or FIFOs and set_directive_stream with
information about -depth.
Clarified C/RTL co-simulation halting conditions in Interface Synthesis
Requirements.
Updated Half-Precision Floating-Point Data Types.
Added Off mode information to AXI4-Stream Interfaces.
Updated AXI4-Lite Interface.
Updated C Modeling and RTL Implementation.
Updated Non-Blocking Reads and Writes.
Removed Table 3-2 (Floating Point Cores and Device Support) from Standard Types.
Added support information for Function Pointers to Pointer Limitations.
Updated -register_mode in set_directive_interface.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
2
Table of Contents
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 1: High-Level Synthesis
Introduction to C-Based FPGA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Understanding Vivado HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Using Vivado HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Data Types for Efficient Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Managing Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Optimizing the Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Verifying the RTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Exporting the RTL Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Chapter 2: High-Level Synthesis C Libraries
Introduction to the Vivado HLS C Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arbitrary Precision Data Types Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS Stream Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS Math Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS Video Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS IP Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS Linear Algebra Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS DSP Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
206
206
223
231
241
256
281
298
Chapter 3: High-Level Synthesis Coding Styles
Introduction to Coding Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unsupported C Constructs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C Test Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C Builtin Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Efficient C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C++ Classes and Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
300
300
305
315
316
324
333
361
361
380
3
Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
SystemC Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Chapter 4: High-Level Synthesis Reference Guide
Command Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GUI Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Interface Synthesis Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AXI4-Lite Slave C Driver Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS Video Functions Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS Linear Algebra Library Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HLS DSP Library Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C Arbitrary Precision Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C++ Arbitrary Precision Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C++ Arbitrary Precision Fixed-Point Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of SystemC and Vivado HLS Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
412
482
486
505
519
580
589
605
619
638
662
Appendix A: Additional Resources and Legal Notices
Xilinx Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Solution Centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Documentation Navigator and Design Hubs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Training Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Please Read: Important Legal Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
670
670
670
671
671
672
4
Chapter 1
High-Level Synthesis
Introduction to C-Based FPGA Design
The Xilinx® Vivado ® High-Level Synthesis (HLS) tool transforms a C specification into a
register transfer level (RTL) implementation that you can synthesize into a Xilinx field
programmable gate array (FPGA). You can write C specifications in C, C++, SystemC, or as
an Open Computing Language (OpenCL™) API C kernel, and the FPGA provides a massively
parallel architecture with benefits in performance, cost, and power over traditional
processors. This chapter provides an overview of high-level synthesis.
Note: For more information on FPGA architectures and Vivado HLS basic concepts, see the
Introduction to FPGA Design with Vivado High-Level Synthesis (UG998) [Ref 1].
High-Level Synthesis Benefits
High-level synthesis bridges hardware and software domains, providing the following
primary benefits:
•
Improved productivity for hardware designers
Hardware designers can work at a higher level of abstraction while creating
high-performance hardware.
•
Improved system performance for software designers
Software developers can accelerate the computationally intensive parts of their
algorithms on a new compilation target, the FPGA.
Using a high-level synthesis design methodology allows you to:
•
Develop algorithms at the C-level
Work at a level that is abstract from the implementation details, which consume
development time.
•
Verify at the C-level
Validate the functional correctness of the design more quickly than with traditional
hardware description languages.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
5
Chapter 1: High-Level Synthesis
•
Control the C synthesis process through optimization directives
Create specific high-performance hardware implementations.
•
Create multiple implementations from the C source code using optimization directives
Explore the design space, which increases the likelihood of finding an optimal
implementation.
•
Create readable and portable C source code
Retarget the C source into different devices as well as incorporate the C source into new
projects.
High-Level Synthesis Basics
High-level synthesis includes the following phases:
•
Scheduling
Determines which operations occur during each clock cycle based on:
°
Length of the clock cycle or clock frequency
°
Time it takes for the operation to complete, as defined by the target device
°
User-specified optimization directives
If the clock period is longer or a faster FPGA is targeted, more operations are completed
within a single clock cycle, and all operations might complete in one clock cycle.
Conversely, if the clock period is shorter or a slower FPGA is targeted, high-level
synthesis automatically schedules the operations over more clock cycles, and some
operations might need to be implemented as multicycle resources.
•
Binding
Determines which hardware resource implements each scheduled operation. To
implement the optimal solution, high-level synthesis uses information about the target
device.
•
Control logic extraction
Extracts the control logic to create a finite state machine (FSM) that sequences the
operations in the RTL design.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
6
Chapter 1: High-Level Synthesis
High-level synthesis synthesizes the C code as follows:
•
Top-level function arguments synthesize into RTL I/O ports
•
C functions synthesize into blocks in the RTL hierarchy
If the C code includes a hierarchy of sub-functions, the final RTL design includes a
hierarchy of modules or entities that have a one-to-one correspondence with the
original C function hierarchy. All instances of a function use the same RTL
implementation or block.
•
Loops in the C functions are kept rolled by default
When loops are rolled, synthesis creates the logic for one iteration of the loop, and the
RTL design executes this logic for each iteration of the loop in sequence. Using
optimization directives, you can unroll loops, which allows all iterations to occur in
parallel.
•
Arrays in the C code synthesize into block RAM or UltraRAM in the final FPGA design
If the array is on the top-level function interface, high-level synthesis implements the
array as ports to access a block RAM outside the design.
High-level synthesis creates the optimal implementation based on default behavior,
constraints, and any optimization directives you specify. You can use optimization directives
to modify and control the default behavior of the internal logic and I/O ports. This allows
you to generate variations of the hardware implementation from the same C code.
To determine if the design meets your requirements, you can review the performance
metrics in the synthesis report generated by high-level synthesis. After analyzing the
report, you can use optimization directives to refine the implementation. The synthesis
report contains information on the following performance metrics:
•
Area: Amount of hardware resources required to implement the design based on the
resources available in the FPGA, including look-up tables (LUT), registers, block RAMs,
and DSP48s.
•
Latency: Number of clock cycles required for the function to compute all output values.
•
Initiation interval (II): Number of clock cycles before the function can accept new input
data.
•
Loop iteration latency: Number of clock cycles it takes to complete one iteration of the
loop.
•
Loop initiation interval: Number of clock cycle before the next iteration of the loop
starts to process data.
•
Loop latency: Number of cycles to execute all iterations of the loop.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
7
Chapter 1: High-Level Synthesis
Scheduling and Binding Example
The following figure shows an example of the scheduling and binding phases for this code
example:
int foo(char x, char a, char b, char c) {
char y;
y = x*a+b+c;
return y
}
X-Ref Target - Figure 1-1
&ORFN&\FOH
6FKHGXOLQJ
3KDVH
D
[
E
\
F
,QLWLDO%LQGLQJ
3KDVH
0XO
$GG6XE
$GG6XE
7DUJHW%LQGLQJ
3KDVH
'63
$GG6XE
;
Figure 1-1:
Scheduling and Binding Example
In the scheduling phase of this example, high-level synthesis schedules the following
operations to occur during each clock cycle:
•
First clock cycle: Multiplication and the first addition
•
Second clock cycle: Second addition and output generation
Note: In the preceding figure, the square between the first and second clock cycles indicates when
an internal register stores a variable. In this example, high-level synthesis only requires that the
output of the addition is registered across a clock cycle. The first cycle reads x, a, and b data ports.
The second cycle reads data port c and generates output y.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
8
Chapter 1: High-Level Synthesis
In the final hardware implementation, high-level synthesis implements the arguments to
the top-level function as input and output (I/O) ports. In this example, the arguments are
simple data ports. Because each input variables is a char type, the input data ports are all
8-bits wide. The function return is a 32-bit int data type, and the output data port is
32-bits wide.
IMPORTANT: The advantage of implementing the C code in the hardware is that all operations finish
in a shorter number of clock cycles. In this example, the operations complete in only two clock cycles.
In a central processing unit (CPU), even this simple code example takes more clock cycles to complete.
In the initial binding phase of this example, high-level synthesis implements the multiplier
operation using a combinational multiplier (Mul) and implements both add operations
using a combinational adder/subtractor (AddSub).
In the target binding phase, high-level synthesis implements both the multiplier and one of
the addition operations using a DSP48 resource. The DSP48 resource is a computational
block available in the FPGA architecture that provides the ideal balance of
high-performance and efficient implementation.
Extracting Control Logic and Implementing I/O Ports Example
The following figure shows the extraction of control logic and implementation of I/O ports
for this code example:
void foo(int in[3], char a, char b, char c, int out[3]) {
int x,y;
for(int i = 0; i < 3; i++) {
x = in[i];
y = a*x + b + c;
out[i] = y;
}
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
9
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-2
&ORFN
E
F
D
\
RXWBGDWD
[
LQBGDWD
RXWBDGGU
LQBDGGU
RXWBFH
LQBFH
RXWBZH
)LQLWH6WDWH0DFKLQH )60
&
&
&
&
[
;
Figure 1-2:
Control Logic Extraction and I/O Port Implementation Example
This code example performs the same operations as the previous example. However, it
performs the operations inside a for-loop, and two of the function arguments are arrays.
The resulting design executes the logic inside the for-loop three times when the code is
scheduled. High-level synthesis automatically extracts the control logic from the C code
and creates an FSM in the RTL design to sequence these operations. High-level synthesis
implements the top-level function arguments as ports in the final RTL design. The scalar
variable of type char maps into a standard 8-bit data bus port. Array arguments, such as in
and out, contain an entire collection of data.
In high-level synthesis, arrays are synthesized into block RAM by default, but other options
are possible, such as FIFOs, distributed RAM, and individual registers. When using arrays as
arguments in the top-level function, high-level synthesis assumes that the block RAM is
outside the top-level function and automatically creates ports to access a block RAM
outside the design, such as data ports, address ports, and any required chip-enable or
write-enable signals.
The FSM controls when the registers store data and controls the state of any I/O control
signals. The FSM starts in the state C0. On the next clock, it enters state C1, then state C2,
and then state C3. It returns to state C1 (and C2, C3) a total of three times before returning
to state C0.
Note: This closely resembles the control structure in the C code for-loop. The full sequence of states
are: C0,{C1, C2, C3}, {C1, C2, C3}, {C1, C2, C3}, and return to C0.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
10
Chapter 1: High-Level Synthesis
The design requires the addition of b and c only one time. High-level synthesis moves the
operation outside the for-loop and into state C0. Each time the design enters state C3, it
reuses the result of the addition.
The design reads the data from in and stores the data in x. The FSM generates the address
for the first element in state C1. In addition, in state C1, an adder increments to keep track
of how many times the design must iterate around states C1, C2, and C3. In state C2, the
block RAM returns the data for in and stores it as variable x.
High-level synthesis reads the data from port a with other values to perform the calculation
and generates the first y output. The FSM ensures that the correct address and control
signals are generated to store this value outside the block. The design then returns to state
C1 to read the next value from the array/block RAM in. This process continues until all
output is written. The design then returns to state C0 to read the next values of b and c to
start the process again.
Performance Metrics Example
The following figure shows the complete cycle-by-cycle execution for the code in the
Extracting Control Logic and Implementing I/O Ports Example, including the states for each
clock cycle, read operations, computation operations, and write operations.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
11
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-3
&
5HDG%
DQG&
E
F
&
&
&
&
&
&
&
&
&
&
$GGU
LQ>@
5HDG
LQ>@
&DOF
RXW>@
$GGU
LQ>@
5HDG
LQ>@
&DOF
RXW>@
$GGU
LQ>@
5HDG
LQ>@
&DOF
RXW>@
5HDG%
DQG&
$GGU[ 'DWD
D
$GGU[ 'DWD
D
$GGU[ 'DWD
D
E
<>@
<>@
F
<>@
)XQFWLRQ/DWHQF\ 
)XQFWLRQ,QLWLDWLRQ,QWHUYDO 
/RRS,WHUDWLRQ/DWHQF\ 
/RRS,WHUDWLRQ,QWHUYDO 
/RRS/DWHQF\ 
;
Figure 1-3:
Latency and Initiation Interval Example
Following are the performance metrics for this example:
•
Latency: It takes the function 9 clock cycles to output all values.
Note: When the output is an array, the latency is measured to the last array value output.
•
II: The II is 10, which means it takes 10 clock cycles before the function can initiate a
new set of input reads and start to process the next set of input data.
Note: The time to perform one complete execution of a function is referred to as one
transaction. In this example, it takes 11 clock cycles before the function can accept data for the
next transaction.
•
Loop iteration latency: The latency of each loop iteration is 3 clock cycles.
•
Loop II: The interval is 3.
•
Loop latency: The latency is 9 clock cycles.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
12
Chapter 1: High-Level Synthesis
Understanding Vivado HLS
The Xilinx Vivado HLS tool synthesizes a C function into an IP block that you can integrate
into a hardware system. It is tightly integrated with the rest of the Xilinx design tools and
provides comprehensive language support and features for creating the optimal
implementation for your C algorithm.
Following is the Vivado HLS design flow:
1. Compile, execute (simulate), and debug the C algorithm.
Note: In high-level synthesis, running the compiled C program is referred to as C simulation.
Executing the C algorithm simulates the function to validate that the algorithm is functionally
correct.
2. Synthesize the C algorithm into an RTL implementation, optionally using user
optimization directives.
3. Generate comprehensive reports and analyze the design.
4. Verify the RTL implementation using a pushbutton flow.
5. Package the RTL implementation into a selection of IP formats.
Inputs and Outputs
Following are the inputs to Vivado HLS:
•
C function written in C, C++, SystemC, or an OpenCL API C kernel
This is the primary input to Vivado HLS. The function can contain a hierarchy of
sub-functions.
•
Constraints
Constraints are required and include the clock period, clock uncertainty, and FPGA
target. The clock uncertainty defaults to 12.5% of the clock period if not specified.
•
Directives
Directives are optional and direct the synthesis process to implement a specific
behavior or optimization.
•
C test bench and any associated files
Vivado HLS uses the C test bench to simulate the C function prior to synthesis and to
verify the RTL output using C/RTL Cosimulation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
13
Chapter 1: High-Level Synthesis
You can add the C input files, directives, and constraints to a Vivado HLS project
interactively using the Vivado HLS graphical user interface (GUI) or using Tcl commands at
the command prompt. You can also create a Tcl file and execute the commands in batch
mode.
Following are the outputs from Vivado HLS:
•
RTL implementation files in hardware description language (HDL) formats
This is the primary output from Vivado HLS. Using Vivado synthesis, you can synthesize
the RTL into a gate-level implementation and an FPGA bitstream file. The RTL is available
in the following industry standard formats:
°
VHDL (IEEE 1076-2000)
°
Verilog (IEEE 1364-2001)
Vivado HLS packages the implementation files as an IP block for use with other tools in
the Xilinx design flow. Using logic synthesis, you can synthesize the packaged IP into an
FPGA bitstream.
•
Report files
This output is the result of synthesis, C/RTL co-simulation, and IP packaging.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
14
Chapter 1: High-Level Synthesis
The following figure shows an overview of the Vivado HLS input and output files.
X-Ref Target - Figure 1-4
7HVW
%HQFK
&RQVWUDLQWV
'LUHFWLYHV
&&
6\VWHP&
2SHQ&/$3,&
&6LPXODWLRQ
&6\QWKHVLV
9LYDGR+/6
57/
$GDSWHU
9+'/
9HULORJ
57/6LPXODWLRQ
3DFNDJHG,3
9LYDGR
'HVLJQ
6XLWH
6\VWHP
*HQHUDWRU
;LOLQ[
3ODWIRUP
6WXGLR
;
Figure 1-4:
Vivado HLS Design Flow
Test Bench, Language Support, and C Libraries
In any C program, the top-level function is called main(). In the Vivado HLS design flow,
you can specify any sub-function below main() as the top-level function for synthesis. You
cannot synthesize the top-level function main(). Following are additional rules:
•
Only one function is allowed as the top-level function for synthesis.
•
Any sub-functions in the hierarchy under the top-level function for synthesis are also
synthesized.
•
If you want to synthesize functions that are not in the hierarchy under the top-level
function for synthesis, you must merge the functions into a single top-level function for
synthesis.
•
The verification flow for OpenCL API C kernels requires special handling in the Vivado
HLS flow. For more information, see OpenCL API C Test Benches in Chapter 3.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
15
Chapter 1: High-Level Synthesis
Test Bench
When using the Vivado HLS design flow, it is time consuming to synthesize a functionally
incorrect C function and then analyze the implementation details to determine why the
function does not perform as expected. To improve productivity, use a test bench to
validate that the C function is functionally correct prior to synthesis.
The C test bench includes the function main() and any sub-functions that are not in the
hierarchy under the top-level function for synthesis. These functions verify that the
top-level function for synthesis is functionally correct by providing stimuli to the function
for synthesis and by consuming its output.
Vivado HLS uses the test bench to compile and execute the C simulation. During the
compilation process, you can select the Launch Debugger option to open a full C-debug
environment, which enables you to analyze the C simulation. For more information on test
benches, see C Test Bench in Chapter 3.
RECOMMENDED: Because Vivado HLS uses the test bench to both verify the C function prior to
synthesis and to automatically verify the RTL output, using a test bench is highly recommended.
Language Support
Vivado HLS supports the following standards for C compilation/simulation:
•
ANSI-C (GCC 4.6)
•
C++ (G++ 4.6)
•
OpenCL API (1.0 embedded profile)
•
SystemC (IEEE 1666-2006, version 2.2)
C, C++, and SystemC Language Constructs
Vivado HLS supports many C, C++, and SystemC language constructs and all native data
types for each language, including float and double types. However, synthesis is not
supported for some constructs, including:
•
Dynamic memory allocation
An FPGA has a fixed set of resources, and the dynamic creation and freeing of memory
resources is not supported.
•
Operating system (OS) operations
All data to and from the FPGA must be read from the input ports or written to output
ports. OS operations, such as file read/write or OS queries like time and date, are not
supported. Instead, the C test bench can perform these operations and pass the data
into the function for synthesis as function arguments.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
16
Chapter 1: High-Level Synthesis
For details on the supported and unsupported C constructs and examples of each of the
main constructs, see Chapter 3, High-Level Synthesis Coding Styles.
OpenCL API C Language Constructs
Vivado HLS supports the OpenCL API C language constructs and built-in functions from the
OpenCL API C 1.0 embedded profile.
C Libraries
C libraries contain functions and constructs that are optimized for implementation in an
FPGA. Using these libraries helps to ensure high quality of results (QoR), that is, the final
output is a high-performance design that makes optimal use of the resources. Because the
libraries are provided in C, C++, OpenCL API C, or SystemC, you can incorporate the
libraries into the C function and simulate them to verify the functional correctness before
synthesis.
Vivado HLS provides the following C libraries to extend the standard C languages:
•
Arbitrary precision data types
•
Half-precision (16-bit) floating-point data types
•
Math operations
•
Video functions
•
Xilinx IP functions, including fast fourier transform (FFT) and finite impulse response
(FIR)
•
FPGA resource functions to help maximize the use of shift register LUT (SRL) resources
For more information on the C libraries provided by Vivado HLS, see Chapter 2, High-Level
Synthesis C Libraries.
C Library Example
C libraries ensure a higher QoR than standard C types. Standard C types are based on 8-bit
boundaries (8-bit, 16-bit, 32-bit, 64-bit). However, when targeting a hardware platform, it is
often more efficient to use data types of a specific width.
For example, a design with a filter function for a communications protocol requires 10-bit
input data and 18-bit output data to satisfy the data transmission requirements. Using
standard C data types, the input data must be at least 16-bits and the output data must be
at least 32-bits. In the final hardware, this creates a datapath between the input and output
that is wider than necessary, uses more resources, has longer delays (for example, a 32-bit
by 32-bit multiplication takes longer than an 18-bit by 18-bit multiplication), and requires
more clock cycles to complete.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
17
Chapter 1: High-Level Synthesis
Using an arbitrary precision data type in this design instead, you can specify the exact
bit-sizes to be specified in the C code prior to synthesis, simulate the updated C code, and
verify the quality of the output using C simulation prior to synthesis. Arbitrary precision
data types are provided for C and C++ and allow you to model data types of any width from
1 to 1024-bit. For example, you can model some C++ types up to 32768 bits. For more
information on arbitrary precision data types, see Data Types for Efficient Hardware.
Note: Arbitrary precision types are only required on the function boundaries, because Vivado HLS
optimizes the internal logic and removes data bits and logic that do not fanout to the output ports.
Synthesis, Optimization, and Analysis
Vivado HLS is project based. Each project holds one set of C code and can contain multiple
solutions. Each solution can have different constraints and optimization directives. You can
analyze and compare the results from each solution in the Vivado HLS GUI.
Following are the synthesis, optimization, and analysis steps in the Vivado HLS design
process:
1. Create a project with an initial solution.
2. Verify the C simulation executes without error.
3. Run synthesis to obtain a set of results.
4. Analyze the results.
After analyzing the results, you can create a new solution for the project with different
constraints and optimization directives and synthesize the new solution. You can repeat this
process until the design has the desired performance characteristics. Using multiple
solutions allows you to proceed with development while still retaining the previous results.
Optimization
Using Vivado HLS, you can apply different optimization directives to the design, including:
•
Instruct a task to execute in a pipeline, allowing the next execution of the task to begin
before the current execution is complete.
•
Specify a latency for the completion of functions, loops, and regions.
•
Specify a limit on the number of resources used.
•
Override the inherent or implied dependencies in the code and permit specified
operations. For example, if it is acceptable to discard or ignore the initial data values,
such as in a video stream, allow a memory read before write if it results in better
performance.
•
Select the I/O protocol to ensure the final design can be connected to other hardware
blocks with the same I/O protocol.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
18
Chapter 1: High-Level Synthesis
Note: Vivado HLS automatically determines the I/O protocol used by any sub-functions. You
cannot control these ports except to specify whether the port is registered. For more information
on working with I/O interfaces, see Managing Interfaces.
You can use the Vivado HLS GUI to place optimization directives directly into the source
code. Alternatively, you can use Tcl commands to apply optimization directives. For more
information on the various optimizations, see Optimizing the Design.
Analysis
When synthesis completes, Vivado HLS automatically creates synthesis reports to help you
understand the performance of the implementation. In the Vivado HLS GUI, the Analysis
Perspective includes the Performance tab, which allows you to interactively analyze the
results in detail. The following figure shows the Performance tab for the Extracting Control
Logic and Implementing I/O Ports Example.
X-Ref Target - Figure 1-5
Figure 1-5:
Vivado HLS Analysis Example
The Performance tab shows the following for each state:
•
C0: The first state includes read operations on ports a, b, and c and the addition
operation.
•
C1 and C2: The design enters a loop and checks the loop increment counter and exit
condition. The design then reads data into variable x, which requires two clock cycles.
Two clock cycles are required, because the design is accessing a block RAM, requiring
an address in one cycle and a data read in the next.
•
C3: The design performs the calculations and writes output to port y. Then, the loop
returns to the start.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
19
Chapter 1: High-Level Synthesis
OpenCL API C Kernel Synthesis
IMPORTANT: For OpenCL API C kernels, Vivado HLS always synthesizes logic for the entire work group.
You cannot apply the standard Vivado HLS interface directives to an OpenCL API C kernel.
The following OpenCL API C kernel code shows a vector addition design where two arrays
of data are summed into a third. The required size of the work group is 16, that is, this kernel
must execute a minimum of 16 times to produce a valid result.
#include 
// For VHLS OpenCL C kernels, the full work group is synthesized
__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1)))
vadd(__global int* a,
__global int* b,
__global int* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
Vivado HLS synthesizes this design into hardware that performs the following:
•
16 reads from interface a and b
•
16 additions and 16 writes to output interface c
RTL Verification
If you added a C test bench to the project, you can use it to verify that the RTL is functionally
identical to the original C. The C test bench verifies the output from the top-level function
for synthesis and returns zero to the top-level function main() if the RTL is functionally
identical. Vivado HLS uses this return value for both C simulation and C/RTL co-simulation
to determine if the results are correct. If the C test bench returns a non-zero value, Vivado
HLS reports that the simulation failed.
IMPORTANT: Even if the output data is correct and valid, Vivado HLS reports a simulation failure if the
test bench does not return the value zero to function main().
TIP: For test bench examples that you can use for reference, see Design Examples and References.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
20
Chapter 1: High-Level Synthesis
Vivado HLS automatically creates the infrastructure to perform the C/RTL co-simulation and
automatically executes the simulation using one of the following supported RTL simulators:
•
Vivado Simulator (XSim)
•
ModelSim simulator
•
VCS
•
NCSim
•
Riviera
If you select Verilog or VHDL HDL for simulation, Vivado HLS uses the HDL simulator you
specify. The Xilinx design tools include Vivado Simulator. Third-party HDL simulators
require a license from the third-party vendor. The VCS and NCSim simulators are only
supported on the Linux operating system. For more information, see Using C/RTL
Co-Simulation.
RTL Export
Using Vivado HLS, you can export the RTL and package the final RTL output files as IP in any
of the following Xilinx IP formats:
•
Vivado IP Catalog
Import into the Vivado IP catalog for use in the Vivado Design Suite.
•
System Generator for DSP
Import the HLS design into System Generator.
•
Synthesized Checkpoint (.dcp)
Import directly into the Vivado Design Suite the same way you import any Vivado
Design Suite checkpoint.
Note: The synthesized checkpoint format invokes logic synthesis and compiles the RTL
implementation into a gate-level implementation, which is included in the IP package.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
21
Chapter 1: High-Level Synthesis
For all IP formats except the synthesized checkpoint, you can optionally execute logic
synthesis from within Vivado HLS to evaluate the results of RTL synthesis or
implementation. This optional step allows you to confirm the estimates provided by Vivado
HLS for timing and area before handing off the IP package. These gate-level results are not
included in the packaged IP.
Note: Vivado HLS estimates the timing and area resources based on built-in libraries for each FPGA.
When you use logic synthesis to compile the RTL into a gate-level implementation, perform physical
placement of the gates in the FPGA, and perform routing of the inter-connections between gates,
logic synthesis might make additional optimizations that change the Vivado HLS estimates.
For more information, see Exporting the RTL Design.
Using Vivado HLS
To invoke Vivado HLS on a Windows platform double-click the desktop button as shown in
the following figure.
X-Ref Target - Figure 1-6
Figure 1-6:
Vivado HLS GUI Button
To invoke Vivado HLS on a Linux platform (or from the Vivado HLS Command Prompt on
Windows) execute the following command at the command prompt.
$ vivado_hls
The Vivado HLS GUI opens as shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
22
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-7
Figure 1-7:
Vivado HLS GUI Welcome Page
You can use the Quick Start options to perform the following tasks:
•
Create New Project: Launch the project setup wizard.
•
Open Project: Navigate to an existing project or select from a list of recent projects.
•
Open Example Project: Open Vivado HLS examples. For details on these examples, see
Design Examples and References.
You can use the Documentation options to perform the following tasks:
•
Tutorials: Opens the Vivado Design Suite Tutorial: High-Level Synthesis (UG871) [Ref 2].
For details on the tutorial examples, see Design Examples and References.
•
User Guide: Opens this document, the Vivado Design Suite User Guide: High-Level
Synthesis (UG902).
•
Release Notes Guide: Opens the Vivado Design Suite User Guide: Release Notes,
Installation, and Licensing (UG973) [Ref 3] for the latest software version.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
23
Chapter 1: High-Level Synthesis
The primary controls for using Vivado HLS are shown in the toolbar in the following figure.
Project control ensures only commands that can be currently executed are highlighted. For
example, synthesis must be performed before C/RTL co-simulation can be executed. The
C/RTL co-simulation toolbar buttons remain gray until synthesis completes.
X-Ref Target - Figure 1-8
Figure 1-8:
Vivado HLS Controls
In the Project Management section, the buttons are (from left to right):
•
Create New Project opens the new project wizard.
•
Project Settings allows the current project settings to be modified.
•
New Solution opens the new solution dialog box.
•
Solution Settings allows the current solution settings to be modified.
The next group of toolbar buttons control the tool operation (from left to right):
•
Index C Source refreshes the annotations in the C source.
•
Run C Simulation opens the C Simulation dialog box.
•
C Synthesis starts C source code in Vivado HLS.
•
Run C/RTL Cosimulation verifies the RTL output.
•
Export RTL packages the RTL into the desired IP output format.
The final group of toolbar buttons are for design analysis (from left to right):
•
Open Report opens the C synthesis report or drops down to open other reports.
•
Compare Reports allows the reports from different solutions to be compared.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
24
Chapter 1: High-Level Synthesis
Each of the buttons on the toolbar has an equivalent command in the menus. In addition,
Vivado HLS GUI provides three perspectives. When you select a perspective, the windows
automatically adjust to a more suitable layout for the selected task.
•
The Debug perspective opens the C debugger.
•
The Synthesis perspective is the default perspective and arranges the windows for
performing synthesis.
•
The Analysis perspective is used after synthesis completes to analyze the design in
detail. This perspective provides considerable more detail than the synthesis report.
Changing between perspectives can be done at any time by selecting the desired
perspective button.
The remainder of this chapter discusses how to use Vivado HLS. The following topics are
discussed:
•
How to create a Vivado HLS synthesis project.
•
How to simulate and debug the C code.
•
How to synthesize the design, create new solutions and add optimizations.
•
How to perform design analysis.
•
How to verify and package the RTL output.
•
How to use the Vivado HLS Tcl commands and batch mode.
This chapter ends with a review of the design examples, tutorials, and resources for more
information.
Creating a New Synthesis Project
To create a new project, click the Create New Project link on the Welcome page shown in
Figure 1-7, or select the File > New Project menu command. This opens the project wizard
shown in Figure 1-9, which allows you to specify the following:
•
Project Name: Specifies the project name, which is also the name of the directory in
which the project details are stored.
•
Location: Specifies where to store the project.
CAUTION! The Windows operating system has a 260-character limit for path lengths, which can affect
the Vivado tools. To avoid this issue, use the shortest possible names and directory locations when
creating projects, defining IP or managed IP projects, and creating block designs.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
25
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-9
Figure 1-9:
Project Specification
Selecting the Next > button moves the wizard to the second screen where you can enter
details in the project C source files (Figure 1-10).
•
Top Function: Specifies the name of the top-level function to be synthesized. If you
add the C files first, you can use the Browse button to review the C hierarchy, and then
select the top-level function for synthesis. The Browse button remains grayed out until
you add the source files.
Note: This step is not required when the project is specified as SystemC, because Vivado HLS
automatically identifies the top-level functions.
Use the Add Files button to add the source code files to the project.
IMPORTANT: Do not add header files (with the .h suffix) to the project using the Add Files button (or
with the associated add_files Tcl command).
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
26
Chapter 1: High-Level Synthesis
Vivado HLS automatically adds the following directories to the search path:
•
Working directory
Note: The working directory contains the Vivado HLS project directory.
•
Any directory that contains C files added to the project
Header files that reside in these directories are automatically included in the project. You
must specify the path to all other header files using the Edit CFLAGS button.
The Edit CFLAGS button specifies the C compiler flags options required to compile the C
code. These compiler flag options are the same used in gcc or g++. C compiler flags include
the path name to header files, macro specifications, and compiler directives, as shown in
the following examples:
•
-I/project/source/headers: Provides the search path to associated header files
Note: You must specify relative path names in relation to the working directory not the project
directory.
•
-DMACRO_1: Defines macro MACRO_1 during compilation
•
-fnested-functions: Defines directives required for any design that contains nested
functions
TIP: For a complete list of supported Edit CFLAGS options, see the Option Summary page
(gcc.gnu.org/onlinedocs/gcc/Option-Summary.html) on the GNU Compiler Collection (GCC) website.
TIP: You can use $::env(MY_ENV_VAR) to specify environment variables in CFLAGS.
For example, to include the directory $MY_ENV_VAR/include for compilation, you can specify
-I$::env(MY_ENV_VAR)/include in CFLAGS.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
27
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-10
Figure 1-10:
Project Source Files
The next window in the project wizard allows you to add the files associated with the test
bench to the project.
Note: For SystemC designs with header files associated with the test bench but not the design file,
you must use the Add Files button to add the header files to the project.
In most of the example designs provided with Vivado HLS, the test bench is in a separate
file from the design. Having the test bench and the function to be synthesized in separate
files keeps a clean separation between the process of simulation and synthesis. If the test
bench is in the same file as the function to be synthesized, the file should be added as a
source file and, as shown in the next step, a test bench file.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
28
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-11
Figure 1-11:
Project Test Bench Files
As with the C source files, click the Add Files button to add the C test bench and the Edit
CFLAGS button to include any C compiler options.
In addition to the C source files, all files read by the test bench must be added to the
project. In the example shown in Figure 1-11, the test bench opens file in.dat to supply
input stimuli to the design and file out.golden.dat to read the expected results.
Because the test bench accesses these files, both files must be included in the project.
If the test bench files exist in a directory, the entire directory might be added to the project,
rather than the individual files, using the Add Folders button.
If there is no C test bench, there is no requirement to enter any information here and the
Next > button opens the final window of the project wizard, which allows you to specify the
details for the first solution, as shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
29
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-12
Figure 1-12:
Initial Solution Settings
The final window in the new project wizard allows you to specify the details of the first
solution:
•
Solution Name: Vivado HLS provides the initial default name solution1, but you can
specify any name for the solution.
•
Clock Period: The clock period specified in units of ns or a frequency value specified
with the MHz suffix (For example, 150MHz).
•
Uncertainty: The clock period used for synthesis is the clock period minus the clock
uncertainty. Vivado HLS uses internal models to estimate the delay of the operations
for each FPGA. The clock uncertainty value provides a controllable margin to account
for any increases in net delays due to RTL logic synthesis, place, and route. If not
specified in nanoseconds (ns) or a percentage, the clock uncertainty defaults to 12.5%
of the clock period.
•
Part: Click to select the appropriate technology, as shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
30
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-13
Figure 1-13:
Part Selection
Select the FPGA to be targeted. You can use the filter to reduce the number of device in the
device list. If the target is a board, specify boards in the top-left corner and the device list
is replaced by a list of the supported boards (and Vivado HLS automatically selects the
correct target device).
Clicking Finish opens the project as shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
31
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-14
Figure 1-14:
New Project in the Vivado HLS GUI
The Vivado HLS GUI consists of four panes:
•
On the left hand side, the Explorer pane lets you navigate through the project
hierarchy. A similar hierarchy exists in the project directory on the disk.
•
In the center, the Information pane displays files. Files can be opened by
double-clicking on them in the Explorer Pane.
•
On the right, the Auxiliary pane shows information relevant to whatever file is open in
the Information pane,
•
At the bottom, the Console Pane displays the output when Vivado HLS is running.
Simulating the C Code
Verification in the Vivado HLS flow can be separated into two distinct processes.
•
Pre-synthesis validation that validates the C program correctly implements the required
functionality.
•
Post-synthesis verification that verifies the RTL is correct.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
32
Chapter 1: High-Level Synthesis
Both processes are referred to as simulation: C simulation and C/RTL co-simulation.
Before synthesis, the function to be synthesized should be validated with a test bench using
C simulation. A C test bench includes a top-level function main() and the function to be
synthesized. It might include other functions. An ideal test bench has the following
attributes:
•
The test bench is self-checking and verifies the results from the function to be
synthesized are correct.
•
If the results are correct the test bench returns a value of 0 to main(). Otherwise, the
test bench should return any non-zero values
Vivado HLS synthesizes an OpenCL API C kernel. To simulate an OpenCL API C kernel, you
must use a standard C test bench. You cannot use the OpenCL API C host code as the C test
bench. For more information on test benches, see C Test Bench in Chapter 3.
Clicking the Run C Simulation toolbar button
shown in the following figure.
opens the C Simulation Dialog box,
X-Ref Target - Figure 1-15
Figure 1-15:
C Simulation Dialog Box
If no option is selected in the dialog box, the C code is compiled and the C simulation is
automatically executed. The results are shown in the following figure. When the C code
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
33
Chapter 1: High-Level Synthesis
simulates successfully, the console window displays a message, as shown in the following
figure. The test bench echoes to the console any printf commands used with the
message “Test Passed!”
X-Ref Target - Figure 1-16
Figure 1-16:
C Compiled with Build
The other options in the C Simulation dialog box are:
•
Launch Debugger: This compiles the C code and automatically opens the debug
perspective. From within the debug perspective the Synthesis perspective button (top
left) can be used to return the windows to synthesis perspective.
•
Build Only: The C code compiles, but the simulation does not run. Details on executing
the C simulation are covered in Reviewing the Output of C Simulation.
•
Clean Build: Remove any existing executable and object files from the project before
compiling the code.
•
Optimized Compile: By default the design is compiled with debug information,
allowing the compilation to be analyzed in the debug perspective. This option uses a
higher level of optimization effort when compiling the design but removes all
information required by the debugger. This increases the compile time but should
reduce the simulation run time.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
34
Chapter 1: High-Level Synthesis
•
Compiler: Allows you to select between using gcc/g++ or clang to compile the code.
Using clang to compile the code automatically invoke additional code checking
(including the gcc/g++ equivalent -wall option) and optionally allows out-of-range
memory-access and undefined behavior checking through the -clang_sanitizer
option. Use of the sanitizer option increases the memory required to compile the code.
Note: The Compiler option is Linux only and not shown above in Figure 1-15, which displays
the Windows dialog box.
If you select the Launch Debugger option, the windows automatically switch to the debug
perspective and the debug environment opens as shown in the following figure. This is a
full featured C debug environment. The step buttons (red box in the following figure) allow
you to step through code, breakpoints can be set and the value of the variables can be
directly viewed.
X-Ref Target - Figure 1-17
Figure 1-17:
C Debug Environment
TIP: Click the Synthesis perspective button to return to the standard synthesis windows.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
35
Chapter 1: High-Level Synthesis
Reviewing the Output of C Simulation
When C simulation completes, a folder csim is created inside the solution folder as shown.
.
X-Ref Target - Figure 1-18
Figure 1-18:
C Simulation Output Files
The folder csim/build is the primary location for all files related to the C simulation.
•
Any files read by the test bench are copied to this folder.
•
The C executable file csim.exe is created and run in this folder.
•
Any files written by the test bench are created in this folder.
If the Build Only option is selected in the C simulation dialog box, the file csim.exe is
created in this folder but the file is not executed. The C simulation is run manually by
executing this file from a command shell. On Windows the Vivado HLS command shell is
available through the start menu.
The folder csim/report contains a log file of the C simulation.
The next step in the Vivado HLS design flow is to execute synthesis.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
36
Chapter 1: High-Level Synthesis
Synthesizing the C Code
The following topics are discussed in this section:
•
Creating an Initial Solution.
•
Reviewing the Output of C Synthesis.
•
Analyzing the Results of Synthesis.
•
Creating a New Solution.
•
Applying Optimization Directives.
Creating an Initial Solution
Use the C Synthesis toolbar button
or the menu Solution > Run C Synthesis to
synthesize the design to an RTL implementation. During the synthesis process messages are
echoed to the console window.
The message include information messages showing how the synthesis process is
proceeding:
INFO: [HLS 200-10] Opening and resetting project
'C:/Vivado_HLS/My_First_Project/proj_dct'.
INFO: [HLS 200-10] Adding design file 'dct.cpp' to the project
INFO: [HLS 200-10] Adding test bench file 'dct_test.cpp' to the project
INFO: [HLS 200-10] Adding test bench file 'in.dat' to the project
INFO: [HLS 200-10] Adding test bench file 'out.golden.dat' to the project
INFO: [HLS 200-10] Opening and resetting solution
'C:/Vivado_HLS/My_First_Project/proj_dct/solution1'.
INFO: [HLS 200-10] Cleaning up the solution database.
INFO: [HLS 200-10] Setting target device to 'xc7k160tfbg484-1'
INFO: [SYN 201-201] Setting up clock 'default' with a period of 4ns.
Within the GUI, some messages may contain links to enhanced information. In the following
example, message XFORM 203-602 is underlined indicating the presence of a hyperlink.
Clicking on this message provides more details on why the message was issued and
possible resolutions. In this case, Vivado HLS automatically inlines small functions and using
the INLINE directive with the -off option may be used to prevent this automatic inlining.
INFO: [XFORM 203-602] Inlining function 'read_data' into 'dct' (dct.cpp:85) automatically.
INFO: [XFORM 203-602] Inlining function 'write_data' into 'dct' (dct.cpp:90) automatically.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
37
Chapter 1: High-Level Synthesis
When synthesis completes, the synthesis report for the top-level function opens
automatically in the information pane as shown in the following figure.
X-Ref Target - Figure 1-19
Figure 1-19:
Synthesis Report
Reviewing the Output of C Synthesis
When synthesis completes, the folder syn is now available in the solution folder.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
38
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-20
Figure 1-20:
C Synthesis Output Files
The syn folder contains 4 sub-folders. A report folder and one folder for each of the RTL
output formats.
The report folder contains a report file for the top-level function and one for every
sub-function in the design: provided the function was not inlined using the INLINE directive
or inlined automatically by Vivado HLS. The report for the top-level function provides
details on the entire design.
The verilog, vhdl, and systemc folders contain the output RTL files. Figure 1-20 shows
the verilog folder expanded. The top-level file has the same name as the top-level
function for synthesis. In the C design there is one RTL file for each function (not inlined).
There might be additional RTL files to implement sub-blocks (block RAM, pipelined
multipliers, etc).
IMPORTANT: Xilinx does not recommend using these files for RTL synthesis. Instead, Xilinx
recommends using the packaged IP output files discussed later in this design flow. Carefully read the
text that immediately follows this note.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
39
Chapter 1: High-Level Synthesis
In cases where Vivado HLS uses Xilinx IP in the design, such as with floating point designs,
the RTL directory includes a script to create the IP during RTL synthesis. If the files in the
syn folder are used for RTL synthesis, it is your responsibility to correctly use any script files
present in those folders. If the package IP is used, this process is performed automatically
by the design Xilinx tools.
Analyzing the Results of C Synthesis
The two primary features provided to analyze the RTL design are:
•
Synthesis reports
•
Analysis Perspective
In addition, if you are more comfortable working in an RTL environment, Vivado HLS creates
two projects during the IP packaging process:
•
Vivado Design Suite project
•
Vivado IP Integrator project
Synthesis Reports
The RTL projects are discussed in Reviewing the Output of IP Packaging.
When synthesis completes, the synthesis report for the top-level function opens
automatically in the information pane (Figure 1-19). The report provides details on both the
performance and area of the RTL design. The outline tab on the right-hand side can be used
to navigate through the report.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
40
Chapter 1: High-Level Synthesis
The following table explains the categories in the synthesis report.
Table 1-1:
Synthesis Report Categories
Category
Description
General Information
Details on when the results were generated, the version of the software
used, the project name, the solution name, and the technology details.
Performance Estimates >
Timing
The target clock frequency, clock uncertainty, and the estimate of the
fastest achievable clock frequency.
Performance Estimates >
Latency > Summary
Reports the latency and initiation interval for this block and any sub-blocks
instantiated in this block.
Each sub-function called at this level in the C source is an instance in this
RTL block, unless it was inlined.
The latency is the number of cycles it takes to produce the output. The
initiation interval is the number of clock cycles before new inputs can be
applied.
In the absence of any PIPELINE directives, the latency is one cycle less than
the initiation interval (the next input is read when the final output is
written).
Performance Estimates >
Latency > Detail
The latency and initiation interval for the instances (sub-functions) and
loops in this block. If any loops contain sub-loops, the loop hierarchy is
shown.
The min and max latency values indicate the latency to execute all iterations
of the loop. The presence of conditional branches in the code might make
the min and max different.
The Iteration Latency is the latency for a single iteration of the loop.
If the loop has a variable latency, the latency values cannot be determined
and are shown as a question mark (?). See the text after this table.
Any specified target initiation interval is shown beside the actual initiation
interval achieved.
The tripcount shows the total number of loop iterations.
Utilization Estimates >
Summary
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
This part of the report shows the resources (LUTS, Flip-Flops, DSP48s) used
to implement the design.
www.xilinx.com
Send Feedback
41
Chapter 1: High-Level Synthesis
Table 1-1:
Synthesis Report Categories (Cont’d)
Category
Utilization Estimates >
Details > Instance
Description
The resources specified here are used by the sub-blocks instantiated at this
level of the hierarchy.
If the design only has no RTL hierarchy, there are no instances reported.
If any instances are present, clicking on the name of the instance opens the
synthesis report for that instance.
Utilization Estimates >
Details > Memory
The resources listed here are those used in the implementation of
memories at this level of the hierarchy.
Vivado HLS reports a single-port BRAM as using one bank of memory and
reports a dual-port BRAM as using two banks of memory.
Utilization Estimates >
Details > FIFO
The resources listed here are those used in the implementation of any FIFOs
implemented at this level of the hierarchy.
Utilization Estimates >
Details > Shift Register
A summary of all shift registers mapped into Xilinx SRL components.
Additional mapping into SRL components can occur during RTL synthesis.
Utilization Estimates >
Details > Expressions
This category shows the resources used by any expressions such as
multipliers, adders, and comparators at the current level of hierarchy.
The bit-widths of the input ports to the expressions are shown.
Utilization Estimates >
Details > Multiplexors
This section of the report shows the resources used to implement
multiplexors at this level of hierarchy.
The input widths of the multiplexors are shown.
Utilization Estimates >
Details > Register
A list of all registers at this level of hierarchy is shown here. The report
includes the register bit-widths.
Interface Summary >
Interface
This section shows how the function arguments have been synthesized into
RTL ports.
The RTL port names are grouped with their protocol and source object:
these are the RTL ports created when that source object is synthesized with
the stated I/O protocol.
Certain Xilinx devices use stacked silicon interconnect (SSI) technology. In these devices, the
total available resources are divided over multiple super logic regions (SLRs). When you
select an SSI technology device as the target technology, the utilization report includes
details on both the SLR usage and the total device usage.
IMPORTANT: When using SSI technology devices, it is important to ensure that the logic created by
Vivado HLS fits within a single SLR. For information on using SSI technology devices, see Managing
Interfaces with SSI Technology Devices.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
42
Chapter 1: High-Level Synthesis
A common issue for new users of Vivado HLS is seeing a synthesis report similar to the
following figure. The latency values are all shown as a “?” (question mark).
X-Ref Target - Figure 1-21
Figure 1-21:
Synthesis Report
Vivado HLS performs analysis to determine the number of iteration of each loop. If the loop
iteration limit is a variable, Vivado HLS cannot determine the maximum upper limit.
In the following example, the maximum iteration of the for-loop is determined by the value
of input num_samples. The value of num_samples is not defined in the C function, but
comes into the function from the outside.
void foo (char num_samples, ...);
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
...
result = a + b;
}
}
If the latency or throughput of the design is dependent on a loop with a variable index,
Vivado HLS reports the latency of the loop as being unknown (represented in the reports by
a question mark “?”).
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
43
Chapter 1: High-Level Synthesis
The TRIPCOUNT directive can be applied to the loop to manually specify the number of
loop iterations and ensure the report contains useful numbers. The -max option tells
Vivado HLS the maximum number of iterations that the loop iterates over and the -min
option specifies the minimum number of iterations performed.
Note: The TRIPCOUNT directive does not impact the results of synthesis.
The tripcount values are used only for reporting, to ensure the reports generated by Vivado
HLS show meaningful ranges for latency and interval. This also allows a meaningful
comparison between different solutions.
If the C assert macro is used in the code, Vivado HLS can use it to both determine the loop
limits automatically and create hardware that is exactly sized to these limits. See Assertions
in Chapter 3 for more information.
Analysis Perspective
In addition to the synthesis report, you can use the Analysis Perspective to analyze the
results. To open the Analysis Perspective, click the Analysis button as shown in the
following figure.
X-Ref Target - Figure 1-22
Figure 1-22:
Analysis Perspective
The Analysis Perspective provides both a tabular and graphical view of the design
performance and resources and supports cross-referencing between both views. The
following figure shows the default window configuration when the Analysis Perspective is
first opened.
The Module Hierarchy pane provides an overview of the entire RTL design.
•
This view can navigate throughout the design hierarchy.
•
The Module Hierarchy pane shows the resources and latency contribution for each
block in the RTL hierarchy.
The following figure shows the dct design uses 6 block RAMs, approximately 300 LUTs and
has a latency of around 3000 clock cycles. Sub-block dct_2b contributes 4 block RAMs,
approximately 250 LUTs and about 2600 cycle of latency to the total. It is immediately clear
that most of the resources and latency in this design are due to sub-block dct_2d and this
block should be analyzed first.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
44
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-23
Figure 1-23:
Analysis Perspective in the Vivado HLS GUI
The Performance Profile pane provides details on the performance of the block currently
selected in the Module Hierarchy pane, in this case, the dct block highlighted in the Module
Hierarchy pane.
•
The performance of the block is a function of the sub-blocks it contains and any logic
within this level of hierarchy. The Performance Profile pane shows items at this level of
hierarchy that contribute to the overall performance.
•
Performance is measured in terms of latency and the initiation interval. This pane also
includes details on whether the block was pipelined or not.
•
In this example, you can see that two loops (RD_Loop_Row and WR_Loop_Row) are
implemented as logic at this level of hierarchy and both contain sub-loops and both
contribute 144 clock cycles to the latency. Add the latency of both loops to the latency
of dct_2d which is also inside dct and you get the total latency for the dct block.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
45
Chapter 1: High-Level Synthesis
The Schedule View pane shows how the operations in this particular block are scheduled
into clock cycles. The default view is the Performance view.
•
•
•
The left-hand column lists the resources.
°
Sub-blocks are green.
°
Operations resulting from loops in the source are colored yellow.
°
Standard operations are purple.
The dct has three main resources:
°
A loop called RD_Loop_Row. In Figure 1-23 the loop hierarchy for loop
RD_Loop_Row has been expanded.
°
A sub-block called dct_2d.
°
A loop called WR_Loop_Row. The plus symbol “+” indicates this loop has hierarchy
and the loop can be expanded to view it.
The top row lists the control states in the design. Control states are the internal states
used by Vivado HLS to schedule operations into clock cycles. There is a close
correlation between the control states and the final states in the RTL FSM, but there is
no one-to-one mapping.
The information presented in the Schedule View is explained here by reviewing the first set
of resources to be execute: the RD_Loop_Row loop.
•
The design starts in the C0 state.
•
It then starts to execute the logic in loop RD_Loop_Row.
Note: In the first state of the loop, the exit condition is checked and there is an add operation.
•
The loop executes over 3 states: C1, C2, and C3.
•
The Performance Profile pane shows this loop has a tripcount of 8: it therefore iterates
around these 3 states 8 times.
•
The Performance Profile pane shows loop RD_Loop_Rows takes 144 clock cycles to
execute.
•
°
One cycle at the start of loop RD_Loop_Row.
°
The Performance Profile pane indicates it takes 16 clock cycles to execute all
operations of loop RD_Loop_Cols.
°
Plus a clock cycle to return to the start of loop RD_Loop_Row for a total of 18 cycles
per loop iteration.
°
8 iterations of 18 cycles is why it takes 144 clock cycles to complete.
Within loop RD_Loop_Col you can see there are some adders, a 2 cycle read operation
and a write operation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
46
Chapter 1: High-Level Synthesis
The following figure shows that you can select an operation and right-click the mouse to
open the associated variable in the source code view. You can see that the write operation
is implementing the writing of data into the buf array from the input array variable.
X-Ref Target - Figure 1-24
Figure 1-24:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
C Source Code Correlation
www.xilinx.com
Send Feedback
47
Chapter 1: High-Level Synthesis
The Analysis Perspective also allows you to analyze resource usage. The following figure
shows the resource profile and the resource panes.
X-Ref Target - Figure 1-25
Figure 1-25:
Analysis Perspective with Resource Profile
The Resource Profile pane shows the resources used at this level of hierarchy. In this
example, you can see that most of the resources are due to the instances: blocks that are
instantiated inside this block.
You can see by expanding the Expressions that most of the resources at this level of
hierarchy are used to implement adders.
The Resource pane shows the control state of the operations used. In this example, all the
adder operations are associated with a different adder resource. There is no sharing of the
adders. More than one add operation on each horizontal line indicates the same resource is
used multiple times in different states or clock cycles.
The adders are used in the same cycles that are memory accessed and are dedicated to each
memory. Cross correlation with the C code can be used to confirm.
If the DATAFLOW directive has been applied to a function, the Analysis Perspective provides
a dataflow viewer which shows the structure of the design. This may be used to ensure data
flows from one task to the next.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
48
Chapter 1: High-Level Synthesis
In Figure 1-26, the
icon beside the dct function indicates a dataflow view is available.
Right-click the function to open the dataflow view.
X-Ref Target - Figure 1-26
Figure 1-26:
Dataflow View
The Analysis Perspective is a highly interactive feature. More information on the Analysis
Perspective can be found in the Design Analysis section of the Vivado Design Suite Tutorial:
High-Level Synthesis (UG871) [Ref 2].
TIP: Remember, even if a Tcl flow is used to create designs, the project can still be opened in the GUI
and the Analysis Perspective used to analyze the design.
Use the Synthesis perspective button to return to the synthesis view.
Generally after design analysis you can create a new solution to apply optimization
directives. Using a new solution for this allows the different solutions to be compared.
Creating a New Solution
The most typical use of Vivado HLS is to create an initial design, then perform optimizations
to meet the desired area and performance goals. Solutions offer a convenient way to ensure
the results from earlier synthesis runs can be both preserved and compared.
Use the New Solution toolbar button
or the menu Project > New Solution to create
a new solution. This opens the Solution Wizard as shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
49
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-27
Figure 1-27:
New Solution Wizard
The Solution Wizard has the same options as the final window in the New Project wizard
(Figure 1-12) plus an additional option that allow any directives and customs constraints
applied to an existing solution to be conveniently copied to the new solution, where they
can be modified or removed.
After the new solution has been created, optimization directives can be added (or modified
if they were copied from the previous solution). The next section explains how directives
can be added to solutions. Custom constraints are applied using the configuration options
and are discussed in Optimizing the Design.
Applying Optimization Directives
The first step in adding optimization directives is to open the source code in the
Information pane. As shown in the following figure, expand the Source container located at
the top of the Explorer pane, and double-click the source file to open it for editing in the
Information pane.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
50
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-28
Figure 1-28:
Source and Directive
With the source code active in the Information pane, select the Directives tab on the right
to display and modify directives for the file. The Directives tab contains all the objects and
scopes in the currently opened source code to which you can apply directives.
Note: To apply directives to objects in other C files, you must open the file and make it active in the
Information pane.
Although you can select objects in the Vivado HLS GUI and apply directives, Vivado HLS
applies all directives to the scope that contains the object. For example, you can apply an
INTERFACE directive to an interface object in the Vivado HLS GUI. Vivado HLS applies the
directive to the top-level function (scope), and the interface port (object) is identified in the
directive. In the following example, port data_in on function foo is specified as an
AXI4-Lite interface:
set_directive_interface -mode s_axilite "foo" adata_in
You can apply optimization directives to the following objects and scopes:
•
Interfaces
When you apply directives to an interface, Vivado HLS applies the directive to the
top-level function, because the top-level function is the scope that contains the
interface.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
51
Chapter 1: High-Level Synthesis
•
Functions
When you apply directives to functions, Vivado HLS applies the directive to all objects
within the scope of the function. The effect of any directive stops at the next level of
function hierarchy. The only exception is a directive that supports or uses a recursive
option, such as the PIPELINE directive that recursively unrolls all loops in the hierarchy.
•
Loops
When you apply directives to loops, Vivado HLS applies the directive to all objects
within the scope of the loop. For example, if you apply a LOOP_MERGE directive to a
loop, Vivado HLS applies the directive to any sub-loops within the loop but not to the
loop itself.
Note: The loop to which the directive is applied is not merged with siblings at the same level of
hierarchy.
•
Arrays
When you apply directives to arrays, Vivado HLS applies the directive to the scope that
contains the array.
•
Regions
When you apply directives to regions, Vivado HLS applies the directive to the entire
scope of the region. A region is any area enclosed within two braces. For example:
{
the scope between these braces is a region
}
Note: You can apply directives to a region in the same way you apply directives to functions and
loops.
To apply a directive, select an object in the Directives tab, right-click, and select Insert
Directive to open the Directives Editor dialog box. From the drop-down menu, select the
appropriate directive. The drop-down menu only shows directives that you can add to the
selected object or scope. For example, if you select an array object, the drop-down menu
does not show the PIPELINE directive, because an array cannot be pipelined. The following
figure shows the addition of the DATAFLOW directive to the DCT function.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
52
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-29
Figure 1-29:
Adding Directives
Using Tcl Commands or Embedded Pragmas
In the Vivado HLS Directive Editor dialog box, you can specify either of the following
Destination settings:
•
Directive File: Vivado HLS inserts the directive as a Tcl command into the file
directives.tcl in the solution directory.
•
Source File: Vivado HLS inserts the directive directly into the C source file as a pragma.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
53
Chapter 1: High-Level Synthesis
The following table describes the advantages and disadvantages of both approaches.
Table 1-2:
Tcl Commands Versus Pragmas
Directive Format
Directives file (Tcl
Command)
Advantages
• Each solution has independent
directives. This approach is ideal
for design exploration.
• If any solution is re-synthesized,
only the directives specified in
that solution are applied.
Source Code (Pragma)
• The optimization directives are
embedded into the C source code.
• Ideal when the C sources files are
shipped to a third-party as C IP.
No other files are required to
recreate the same results.
Disadvantages
• If the C source files are transferred
to a third-party or archived, the
directives.tcl file must be
included.
• The directives.tcl file is
required if the results are to be
re-created.
• If the optimization directives are
embedded in the code, they are
automatically applied to every
solution when re-synthesized.
• Useful approach for directives
that are unlikely to change, such
as TRIPCOUNT and INTERFACE.
The following figure shows the DATAFLOW directive being added to the Directive File. The
directives.tcl file is located in the solution constraints folder and opened in the
Information pane using the resulting Tcl command.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
54
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-30
Figure 1-30:
Adding Tcl Directives
When directives are applied as a Tcl command, the Tcl command specifies the scope or the
scope and object within that scope. In the case of loops and regions, the Tcl command
requires that these scopes be labeled. If the loop or region does not currently have a label,
a pop-up dialog box asks for a label (Assigns a default name for the label).
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
55
Chapter 1: High-Level Synthesis
The following shows examples of labeled and unlabeled loops and regions.
// Example of a loop with no label
for(i=0; i<3;i++ {
printf(“This is loop WITHOUT a label \n”);
}
// Example of a loop with a label
My_For_Loop:for(i=0; i<3;i++ {
printf(“This loop has the label My_For_Loop \n”);
}
// Example of an region with no label
{
printf(“The scope between these braces has NO label”);
}
// Example of a NAMED region
My_Region:{
printf(“The scope between these braces HAS the label My_Region”);
}
TIP: Named loops allow the synthesis report to be easily read. An auto-generated label is assigned to
loops without a label.
The following figure shows the DATAFLOW directive added to the Source File and the
resultant source code open in the information pane. The source code now contains a
pragma which specifies the optimization directive.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
56
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-31
Figure 1-31:
Adding Pragma Directives
In both cases, the directive is applied and the optimization performed when synthesis is
executed. If the code was modified, either by inserting a label or pragma, a pop-up dialog
box reminds you to save the code before synthesis.
A complete list of all directives and custom constraints can be found in Optimizing the
Design. For information on directives and custom constraints, see Chapter 4, High-Level
Synthesis Reference Guide.
Applying Optimization Directives to Global Variables
Directives can only be applied to scopes or objects within a scope. As such, they cannot be
directly applied to global variables which are declared outside the scope of any function.
To apply a directive to a global variable, apply the directive to the scope (function, loop or
region) where the global variable is used. Open the directives tab on a scope were the
variable is used, apply the directive and enter the variable name manually in Directives
Editor.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
57
Chapter 1: High-Level Synthesis
Applying Optimization Directives to Class Objects
Optimization directives can be also applied to objects or scopes defined in a class. The
difference is typically that classes are defined in a header file. Use one of the following
actions to open the header file:
•
From the Explorer pane, open the Includes folder, navigate to the header file, and
double-click the file to open it.
•
From within the C source, place the cursor over the header file (the #include
statement), to open hold down the Ctrl key, and click the header file.
The directives tab is then populated with the objects in the header file and directives can be
applied.
CAUTION! Care should be taken when applying directives as pragmas to a header file. The file might be
used by other people or used in other projects. Any directives added as a pragma are applied each time
the header file is included in a design.
Applying Optimization Directives to Templates
To apply optimization directives manually on templates when using Tcl commands, specify
the template arguments and class when referring to class methods. For example, given the
following C++ code:
template 
void DES10::calcRUN() {…}
The following Tcl command is used to specify the INLINE directive on the function:
set_directive_inline DES10::calcRUN
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
58
Chapter 1: High-Level Synthesis
Using #Define with Pragma Directives
Pragma directives do not natively support the use of values specified by the define
statement. The following code seeks to specify the depth of a stream using the define
statement and will not compile.
TIP: Specify the depth argument with an explicit value.
#include 
using namespace hls;
#define STREAM_IN_DEPTH 8
void foo (stream &InStream, stream &OutStream) {
// Illegal pragma
#pragma HLS stream depth=STREAM_IN_DEPTH variable=InStream
// Legal pragma
#pragma HLS stream depth=8 variable=OutStream
}
You can use macros in the C code to implement this functionality. The key to using macros
is to use a level of hierarchy in the macro. This allows the expansion to be correctly
performed. The code can be made to compile as follows:
#include 
using namespace hls;
#define PRAGMA_SUB(x) _Pragma (#x)
#define PRAGMA_HLS(x) PRAGMA_SUB(x)
#define STREAM_IN_DEPTH 8
void foo (stream &InStream, stream &OutStream) {
// Legal pragmas
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=InStream)
#pragma HLS stream depth=8 variable=OutStream
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
59
Chapter 1: High-Level Synthesis
Failure to Satisfy Optimization Directives
When optimization directives are applied, Vivado HLS outputs information to the console
(and log file) detailing the progress. In the following example the PIPELINE directives was
applied to the C function with an II=1 (iteration interval of 1) but synthesis failed to satisfy
this objective.
INFO: [SCHED 11] Starting scheduling ...
INFO: [SCHED 61] Pipelining function 'array_RAM'.
WARNING: [SCHED 63] Unable to schedule the whole 2 cycles 'load' operation
('d_i_load', array_RAM.c:98) on array 'd_i' within the first cycle (II = 1).
WARNING: [SCHED 63] Please consider increasing the target initiation interval of the
pipeline.
WARNING: [SCHED 69] Unable to schedule 'load' operation ('idx_load_2',
array_RAM.c:98) on array 'idx' due to limited memory ports.
INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 4, Depth: 6.
INFO: [SCHED 11] Finished scheduling.
IMPORTANT: If Vivado HLS fails to satisfy an optimization directive, it automatically relaxes the
optimization target and seeks to create a design with a lower performance target. If it cannot relax the
target, it will halt with an error.
By seeking to create a design which satisfies a lower optimization target, Vivado HLS is able
to provide three important types of information:
•
What target performance can be achieved with the current C code and optimization
directives.
•
A list of the reasons why it was unable to satisfy the higher performance target.
•
A design which can be analyzed to provide more insight and help understand the
reason for the failure.
In message SCHED-69, the reason given for failing to reach the target II is due to limited
ports. The design must access a block RAM, and a block RAM only has a maximum of two
ports.
The next step after a failure such as this is to analyze what the issue is. In this example,
analyze line 52 of the code and/or use the Analysis perspective to determine the bottleneck
and if the requirement for more than two ports can be reduced or determine how the
number of ports can be increased. More details on how to optimize designs for higher
performance are provided in Optimizing the Design.
After the design is optimized and the desired performance achieved, the RTL can be verified
and the results of synthesis packaged as IP.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
60
Chapter 1: High-Level Synthesis
Verifying the RTL is Correct
Use the C/RTL cosimulation toolbar button
cosimulation verify the RTL results.
or the menu Solution > Run C/RTL
The C/RTL co-simulation dialog box shown in the following figure allows you to select
which type of RTL output to use for verification (Verilog or VHDL) and which HDL simulator
to use for the simulation.
A complete description of all C/RTL co-simulation options are provided in Verifying the RTL.
X-Ref Target - Figure 1-32
Figure 1-32:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
C/RTL Co-Simulation Dialog Box
www.xilinx.com
Send Feedback
61
Chapter 1: High-Level Synthesis
When verification completes, the console displays message SIM-1000 to confirm the
verification was successful. The result of any printf commands in the C test bench are
echoed to the console.
INFO: [COSIM 316] Starting C post checking ...
Test passed !
INFO: [COSIM 1000] *** C/RTL co-simulation finished: PASS ***
The simulation report opens automatically in the Information pane, showing the pass or fail
status and the measured statistics on latency and II.
IMPORTANT: The C/RTL co-simulation only passes if the C test bench returns a value of zero.
Reviewing the Output of C/RTL Co-Simulation
A sim directory is created in the solution folder when RTL verification completes. The
following figure shows the sub-folders created.
•
The report folders contains the report and log file for each type of RTL simulated.
•
A verification folder is created for each type of RTL which is verified. The verification
folder is named verilog or vhdl. If an RTL format is not verified, no folder is created.
•
The RTL files used for simulation are stored in the verification folder.
•
The RTL simulation is executed in the verification folder.
•
Any outputs, such as trace files, are written to the verification folder.
•
Folders autowrap, tv, wrap and wrap_pc are work folders used by Vivado HLS. There
are no user files in these folders.
If the Setup Only option was selected in the C/RTL Co-Simulation dialog boxes, an
executable is created in the verification folder but the simulation is not run. The simulation
can be manually run by executing the simulation executable at the command prompt.
Note: For more information on the RTL verification process, see Verifying the RTL.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
62
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-33
Figure 1-33:
RTL Verification Output
Packaging the IP
The final step in the Vivado HLS design flow is to package the RTL output as IP. Use the
Export RTL toolbar button
or the menu Solution > Export RTL to open the Export RTL
dialog box shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
63
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-34
Figure 1-34:
RTL Export Dialog Box
The selections available in the drop-down Format Selection menu depend on the FPGA
device targeted for synthesis. More details on the IP packaging options is provided in
Exporting the RTL Design.
Reviewing the Output of IP Packaging
The folder impl is created in the solution folder when the Export RTL process completes.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
64
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-35
Figure 1-35:
Export RTL Output
In all cases the output includes:
•
The report folder. If the flow option is selected, the report for Verilog and VHDL
synthesis or implementation is placed in this folder.
•
The verilog folder. This contains the Verilog format RTL output files. If the flow
option is selected, RTL synthesis or implementation is performed in this folder.
•
The vhdl folder. This contains the VHDL format RTL output files. If the flow option is
selected, RTL synthesis or implementation is performed in this folder.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
65
Chapter 1: High-Level Synthesis
IMPORTANT: Xilinx does not recommend directly using the files in the verilog or vhdl folders for
your own RTL synthesis project. Instead, Xilinx recommends using the packaged IP output files
discussed next. Please carefully read the text that immediately follows this note.
In cases where Vivado HLS uses Xilinx IP in the design, such as with floating point designs,
the RTL directory includes a script to create the IP during RTL synthesis. If the files in the
verilog or vhdl folders are copied out and used for RTL synthesis, it is your responsibility
to correctly use any script files present in those folders. If the package IP is used, this
process is performed automatically by the design Xilinx tools.
The Format Selection drop-down determines which other folders are created. The
following formats are provided: IP Catalog, System Generator for DSP, and Synthesized
Checkpoint (.dcp). For more details, see Exporting the RTL Design.
Table 1-3:
RTL Export Selections
Format Selection
IP Catalog
Sub-Folder
ip
Comments
Contains a ZIP file which can be added to the Vivado IP
Catalog. The ip folder also contains the contents of the
ZIP file (unzipped).
This option is not available for FPGA devices older than 7
series or Zynq-7000 AP SoC.
System Generator for DSP
sysgen
This output can be added to the Vivado edition of System
Generator for DSP.
This option is not available for FPGA devices older than 7
series or Zynq-7000 AP SoC.
Synthesized Checkpoint
(.dcp)
ip
This option creates Vivado checkpoint files which can be
added directly into a design in the Vivado Design Suite.
This option requires RTL synthesis to be performed. When
this option is selected, the flow option and setting syn is
automatically selected.
The output includes an HDL wrapper you can use to
instantiate the IP into an HDL file.
Example Vivado RTL Project
The Export RTL process automatically creates a Vivado RTL project. For hardware designers
more familiar with RTL design and working in the Vivado RTL environment, this provides a
convenient way to analyze the RTL.
As shown in Figure 1-35 a project.xpr file is created in the verilog and vhdl folders. This
file can be used to directly open the RTL output inside the Vivado Design Suite.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
66
Chapter 1: High-Level Synthesis
If C/RTL co-simulation has been executed in Vivado HLS, the Vivado project contains an RTL
test bench and the design can be simulated.
Note: The Vivado RTL project has the RTL output from Vivado HLS as the top-level design. Typically,
this design should be incorporated as IP into a larger Vivado RTL project. This Vivado project is
provided solely as a means for design analysis and is not intended as a path to implementation.
Example IP Integrator Project
If IP Catalog is selected as the output format, the output folder impl/ip/example is
created. This folder contains an executable (ipi_example.bat or ipi_example.csh) which can
be used to create a project for IP Integrator.
To create the IP Integrator project, execute the ipi_example.* file at the command
prompt then open the Vivado IPI project file which is created.
Archiving the Project
To archive the Vivado HLS project to an industry-standard ZIP file, select File > Archive.
Use the Archive Name option to name the specified ZIP file. You can modify the default
settings as follows:
•
By default, only the current active solution is archived. To ensure all solutions are
archived, deselect the Active Solution Only option.
•
By default, the archive contains all of the output results from the archived solutions. If
you want to archive the input files only, deselect the Include Run Results option.
Using the Command Prompt and Tcl Interface
On Windows the Vivado HLS Command Prompt can be invoked from the start menu:
Xilinx Design Tools > Vivado 2017.x > Vivado HLS > Vivado HLS 2017.x
Command Prompt.
On Windows and Linux, using the -i option with the vivado_hls command opens Vivado
HLS in interactive mode. Vivado HLS then waits for Tcl commands to be entered.
$ vivado_hls -i [-l ]
vivado_hls>
By default, Vivado HLS creates a vivado_hls.log file in the current directory. To specify
a different name for the log file, the -1  option can be used.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
67
Chapter 1: High-Level Synthesis
The help command is used to access documentation on the commands. A complete list of
all commands is provided using:
vivado_hls> help
Help on any individual command is provided by using the command name.
vivado_hls> help 
Any command or command option can be completed using the auto-complete feature.
After a single character has been specified, pressing the tab key causes Vivado HLS to list
the possible options to complete the command or command option. Entering more
characters improves the filtering of the possible options. For example, pressing the tab key
after typing “open” lists all commands that start with “open”.
vivado_hls> open 
open
open_project
open_solution
Selecting the Tab Key after typing open_p auto-completes the command open_project,
because there are no other possible options.
Type the exit command to quit interactive mode and return to the shell prompt:
vivado_hls> exit
Additional options for Vivado HLS are:
•
vivado_hls -p: open the specified project
•
vivado_hls -nosplash: open the GUI without the Vivado HLS splash screen
•
vivado_hls -r: return the path to the installation root directory
•
vivado_hls -s: return the type of system (for example: Linux, Win)
•
vivado_hls -v: return the release version number.
Commands embedded in a Tcl script are executed in batch mode with the -f
 option.
$ vivado_hls -f script.tcl
All the Tcl commands for creating a project in GUI are stored in the script.tcl file within
the solution. If you wish to develop Tcl batch scripts, the script.tcl file is an ideal
starting point.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
68
Chapter 1: High-Level Synthesis
Understanding the Windows Command Prompt
On the Windows OS, the Vivado HLS Command prompt is implemented using the
Minimalist GNU for Windows (minGW) environment, that allows both standard Windows
DOS commands to be used and/or a subset of Linux commands.
The following figure shows that both (or either) the Linux ls command and the DOS dir
command is used to list the contents of a directory.
X-Ref Target - Figure 1-36
Figure 1-36:
Vivado HLS Command Prompt
Be aware that not all Linux commands and behaviors are supported in the minGW
environment. The following represent some known common differences in support:
•
The Linux which command is not supported.
•
Linux paths in a Makefile expand into minGW paths. In all Makefile files, replace any
Linux style path name assignments such as FOO := :/ with versions in which the path
name is quoted such as FOO := “:/” to prevent any path substitutions.
Improving Run Time and Capacity
If the issue is with C/RTL co-simulation, refer to the reduce_diskspace option discussed
in Verifying the RTL. The remainder of this section reviews issues with synthesis run time.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
69
Chapter 1: High-Level Synthesis
Vivado HLS schedules operations hierarchically. The operations within a loop are scheduled,
then the loop, the sub-functions and operations with a function are scheduled. Run time for
Vivado HLS increases when:
•
There are more objects to schedule.
•
There is more freedom and more possibilities to explore.
Vivado HLS schedules objects. Whether the object is a floating-point multiply operation or
a single register, it is still an object to be scheduled. The floating-point multiply may take
multiple cycles to complete and use many resources to implement but at the level of
scheduling it is still one object.
Unrolling loops and partitioning arrays creates more objects to schedule and potentially
increases the run time. Inlining functions creates more objects to schedule at this level of
hierarchy and also increases run time. These optimizations may be required to meet
performance but be very careful about simply partitioning all arrays, unrolling all loops and
inlining all functions: you can expect a run time increase. Use the optimization strategies
provided earlier and judiciously apply these optimizations.
If the arrays must be partitioned to achieve performance, consider using the
throughput_driven option for config_array_partition to only partition the arrays
based on throughput requirements.
If the loops must be unrolled, or if the use of the PIPELINE directive in the hierarchy above
has automatically unrolled the loops, consider capturing the loop body as a separate
function. This will capture all the logic into one function instead of creating multiple copies
of the logic when the loop is unrolled: one set of objects in a defined hierarchy will be
scheduled faster. Remember to pipeline this function if the unrolled loop is used in
pipelined region.
The degrees of freedom in the code can also impact run time. Consider Vivado HLS to be an
expert designer who by default is given the task of finding the design with the highest
throughput, lowest latency and minimum area. The more constrained Vivado HLS is, the
fewer options it has to explore and the faster it will run. Consider using latency constraints
over scopes within the code: loops, functions or regions. Setting a LATENCY directive with
the same minimum and maximum values reduces the possible optimization searches within
that scope.
Finally, the config_schedule configuration controls the effort level used during
scheduling. This generally has less impact than the techniques mentioned above, but it is
worth considering. The default strategy is set to Medium.
If this setting is set to Low, Vivado HLS will reduce the amount of time it spends on trying
to improve on the initial result. In some cases, especially if there are many operations and
hence combinations to explore, it may be worth using the low setting. The design may not
be ideal but it may satisfy the requirements and be very close to the ideal. You can proceed
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
70
Chapter 1: High-Level Synthesis
to make progress with the low setting and then use the default setting before you create
your final result.
With a run strategy set to High, Vivado HLS uses additional CPU cycles and memory, even
after satisfying the constraints, to determine if it can create an even smaller or faster design.
This exploration may, or may not, result in a better quality design but it does take more time
and memory to complete. For designs that are just failing to meet their goals or for designs
where many different optimization combinations are possible, this could be a useful
strategy. In general, it is a better practice to leave the run strategies at the Medium default
setting.
Design Examples and References
Vivado HLS provides many tutorials and design examples.
Tutorials
Tutorials are available in the Vivado Design Suite Tutorial: High-Level Synthesis (UG871)
[Ref 2]. The following table shows a list of the tutorial exercises.
Table 1-4:
Vivado HLS Tutorial Exercises
Tutorial Exercise
Description
Vivado HLS Introductory
Tutorial
An introduction to the operation and primary features of Vivado HLS using
an FIR design.
C Validation
This tutorial uses a Hamming window design to explain C simulation and
using the C debug environment to validate your C algorithm.
Interface Synthesis
Exercises on how to create various types of RTL interface ports using
interface synthesis.
Arbitrary Precision Types
Shows how a floating-point winding function is implemented using
fixed-point arbitrary precision types to produce more optimal hardware.
Design Analysis
Shows how the Analysis perspective is used to improve the performance of
a DCT block.
Design Optimization
Uses a matrix multiplication example to show how an algorithm in
optimized. This tutorial demonstrates how changes to the initial might be
required for a specific hardware implementation.
RTL Verification
How to use the RTL verification features and analyze the RTL signals
waveforms.
Using HLS IP in IP
Integrator
Shows how two HLS pre and post processing blocks for an FFT can be
connected to an FFT IP block using IP integrator.
Using HLS IP in a
Zynq-7000 AP SoC
Processor Design
Shows how the CPU can be used to control a Vivado HLS block through the
AXI4-Lite interface and DMA streaming data from DDR memory to and
from a Vivado HLS block. Includes the CPU source code and required steps
in SDK.
Using HLS IP in System
Generator for DSP
A tutorial on how to use an HLS block and inside a System Generator for
DSP design.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
71
Chapter 1: High-Level Synthesis
Design Examples
To open the Vivado HLS design examples from the Welcome Page, click Open Example
Project. In the Examples wizard, select a design from the Design Examples folder.
Note: The Welcome Page appears when you invoke the Vivado HLS GUI. You can access it at any
time by selecting Help > Welcome.
You can also open the design examples directly from the Vivado Design Suite installation
area: Vivado_HLS\2017.x\examples\design.
The following table provides a description for each design example.
Table 1-5:
Vivado HLS Design Examples
Design Example
Description
2D_convolution_with_linebuffer
2D convolution implemented using hls::streams and a line
buffer to conserve resources.
FFT > fft_ifft
Inverse FFT using FFT IP.
FFT > fft_single
Single 1024 point forward FFT with pipelined streaming I/O.
FIR > fir_2ch_int
FIR filter with 2 interleaved channels.
FIR > fir_3stage
FIR chain with 3 FIRs connected in series: Half band FIR to
Half band FIR to a square root raise cosine (SRRC) FIR.
FIR > fir_config
FIR filter with coefficients updated using the FIR CONFIG
channel.
FIR > fir_srrc
SRRC FIR filter.
__builtin_ctz
Priority encoder (32- and 64-bit versions) implemented
using gcc built-in ‘count trailing zero’ function.
axi_lite
AXI4-Lite interface.
axi_master
AXI4 master interface.
axi_stream_no_side_channel_data
AXI4-Stream interface with no side-channel data in the C
code.
axi_stream_side_channel_data
AXI4-Stream interfaces using side-channel data.
dds > dds_mode_fixed
DDS IP created with both phase offset and phase increment
used in fixed mode.
dds > dds_mode_none
DDS IP created with phase offset in fixed mode and no phase
increment (mode=none).
dsp > atan2
arctan function from the HLS DSP library.
dsp > awgn
Additive white Gaussian noise (awgn) function from the HLS
DSP library.
dsp > cmpy_complex
Fixed-point complex multiplier using complex data types.
dsp > cmpy_scalar
Fixed-point complex multiplier using separate scalar data
types for the real and imaginary components.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
72
Chapter 1: High-Level Synthesis
Table 1-5:
Vivado HLS Design Examples (Cont’d)
Design Example
Description
dsp > convolution_encoder
Convolution_encoder function from the HLS DSP library,
which performs convolutional encoding of an input data
stream based on user-defined convolution codes and
constraint length.
dsp > nco
Numerically controlled oscillator (NCO) function from the
HLS DSP library.
dsp > qam_demod
QAM demodulator function from the HLS DSP library.
dsp > qam_mod
QAM modulator function from the HLS DSP library.
dsp > sqrt
Fixed-point coordinate rotation digital computer (CORDIC)
implementation of the square root function from the HLS
DSP library.
dsp > viterbi_decoder
Viterbi decoder from the HLS DSP library.
fp_mul_pow2
Efficient (area and timing) floating point multiplication
implementation using power-of-two, which uses a small
adder and some optional limit checks instead of a
floating-point core and DSP resources.
fxp_sqrt
Square-root implementation for ap_fixed types
implemented in a bit-serial, fully pipelineable manner.
hls_stream
Multirate dataflow (8-bit I/O, 32-bit data processing and
decimation) design using hls::stream.
linear_algebra > cholesky
Parameterized Cholesky function.
linear_algebra > cholesky_alt
Alternative Cholesky implementation.
linear_algebra > cholesky_alt_inverse
Cholesky function with a customized trait class to select
different implementations.
linear_algebra > cholesky_complex
Cholesky function with a complex data type.
linear_algebra > cholesky_inverse
Parameterized Cholesky Inverse function.
linear_algebra > implementation_targets
Implementation target examples.
Note: For details, see Optimizing the Linear Algebra Functions in
Chapter 2.
linear_algebra > matrix_multiply
Parameterized matrix multiply function.
linear_algebra > matrix_multiply_alt
Alternative matrix multiply function.
linear_algebra > qr_inverse
Parameterized QR Inverse function.
linear_algebra > qrf
Parameterized QRF function.
linear_algebra > qrf_alt
Alternative parameterized QRF function.
linear_algebra > svd
Parameterized SVD function.
linear_algebra > svd_pairs
Parameterized SVD function with alternative “pairs” SVD
implementation.
loop_labels > loop_label
Loop with a label.
loop_labels > no_loop_label
Loop without a label.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
73
Chapter 1: High-Level Synthesis
Table 1-5:
Vivado HLS Design Examples (Cont’d)
Design Example
Description
memory_porting_and_ii
Initiation interval improved using array partitioning
directives.
perfect_loop > perfect
Perfect loop.
perfect_loop > semi_perfect
Semi-perfect loop.
rom_init_c
Array coded using a sub-function to guarantee a ROM
implementation.
window_fn_float
Single-precision floating point windowing function. C++
template class example with compile time selection between
Rectangular (none), Hann, Hamming, or Gaussian windows.
window_fn_fxpt
Fixed-point windowing function. C++ template class
example with compile time selection between Rectangular
(none), Hann, Hamming, or Gaussian windows.
Coding Examples
The Vivado HLS coding examples provide examples of various coding techniques. These are
small examples intended to highlight the results of Vivado HLS synthesis on various C, C++,
and SystemC constructs.
To open the Vivado HLS coding examples from the Welcome Page, click Open Example
Project. In the Examples wizard, select a design from the Coding Style Examples folder.
Note: The Welcome Page appears when you invoke the Vivado HLS GUI. You can access it at any
time by selecting Help > Welcome.
You can also open the design examples directly from the Vivado Design Suite installation
area: Vivado_HLS\2017.x\examples\coding.
The following table provides a description for each coding example.
Table 1-6:
Vivado HLS Coding Examples
Coding Example
Description
apint_arith
Using C ap_cint types.
apint_promotion
Highlights the casting required to avoid integer promotion issues with C
ap_cint types.
array_arith
Using arithmetic in interface arrays.
array_FIFO
Implementing a FIFO interface.
array_mem_bottleneck
Demonstrates how access to arrays can create a performance bottleneck.
array_mem_perform
A solution for the performance bottleneck shown by example
array_mem_bottleneck.
array_RAM
Implementing a block RAM interface.
array_ROM
Example demonstrating how a ROM is automatically inferred.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
74
Chapter 1: High-Level Synthesis
Table 1-6:
Vivado HLS Coding Examples (Cont’d)
Coding Example
Description
array_ROM_math_init
Example demonstrating how to infer a ROM in more complex cases.
cpp_ap_fixed
Using C++ ap_int types.
cpp_ap_int_arith
Using C++ ap_int types for arithmetic.
cpp_FIR
An example C++ design using object orientated coding style.
cpp_math
An example floating point math design that shows how to use a tolerance
in the test bench when comparing results for operations that are not IEEE
exact.
cpp_template
C++ template example.
func_sized
Fixing the size of operation by defining the data widths at the interface.
hier_func
An example of adding files as test bench and design files.
hier_func2
An example of adding files as test bench and design files. An example of
synthesizing a lower-level block in the hierarchy.
hier_func3
An example of combining test bench and design functions into the same
file.
hier_func4
Using the pre-defined macro __SYNTHESIS__ to prevent code being
synthesized.
Note: Only use the __SYNTHESIS__ macro in the code to be synthesized. Do not use
this macro in the test bench, because it is not obeyed by C simulation or C RTL
co-simulation.
loop_functions
Converting loops into functions for parallel execution.
loop_imperfect
An imperfect loop example.
loop_max_bounds
Using a maximum bounds to allow loops be unrolled.
loop_perfect
An perfect loop example.
loop_pipeline
Example of loop pipelining.
loop_sequential
Sequential loops.
loop_sequential_assert
Using assert statements.
loop_var
A loop with variable bounds.
malloc_removed
Example on removing mallocs from the code.
opencl_kernel
Example of synthesizing an OpenCL API C kernel using Vivado HLS,
including the implementation of the test bench for verification.
pointer_arith
Pointer arithmetic example.
pointer_array
An array of pointers.
pointer_basic
Basic pointer example.
pointer_cast_native
Pointer casting between native C types.
pointer_double
Pointer-to-Pointer example.
pointer_multi
An example of using multiple pointer targets.
pointer_stream_better
Example showing how the volatile keyword is used on interfaces.
pointer_stream_good
Multi-read pointer example using explicit pointer arithmetic.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
75
Chapter 1: High-Level Synthesis
Table 1-6:
Vivado HLS Coding Examples (Cont’d)
Coding Example
Description
sc_combo_method
SystemC combinational design example.
sc_FIFO_port
SystemC FIFO port example.
sc_multi_clock
SystemC example with multiple clocks.
sc_RAM_port
SystemC block RAM port example.
sc_sequ_cthread
SystemC sequential design example.
struct_port
Using structs on the interface.
sum_io
Example of top-level interface ports.
types_composite
Composite types.
types_float_double
Float types to double type conversion.
types_global
Using global variables.
types_standard
Example with standard C types.
types_union
Example with unions.
Data Types for Efficient Hardware
C-based native data types are all on 8-bit boundaries (8, 16, 32, 64 bits). RTL buses
(corresponding to hardware) support arbitrary data lengths. Using the standard C data
types can result in inefficient hardware. For example the basic multiplication unit in an FPGA
is the DSP48 macro. This provides a multiplier which is 18*18-bit. If a 17-bit multiplication
is required, you should not be forced to implement this with a 32-bit C data type: this would
require 3 DSP48 macros to implement a multiplier when only 1 is required.
The advantage of arbitrary precision data types is that they allow the C code to be updated
to use variables with smaller bit-widths and then for the C simulation to be re-executed to
validate the functionality remains identical or acceptable. The smaller bit-widths result in
hardware operators which are in turn smaller and faster. This is in turn allows more logic to
be place in the FPGA and for the logic to execute at higher clock frequencies.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
76
Chapter 1: High-Level Synthesis
Advantages of Hardware Efficient Data Types
The following code performs some basic arithmetic operations:
#include "types.h"
void apint_arith(dinA_t inA, dinB_t inB, dinC_t inC, dinD_t inD,
dout1_t *out1, dout2_t *out2, dout3_t *out3, dout4_t *out4
) {
// Basic arithmetic operations
*out1 = inA * inB;
*out2 = inB + inA;
*out3 = inC / inA;
*out4 = inD % inA;
}
The data types dinA_t, dinB_t etc. are defined in the header file types.h. It is highly
recommended to use a project wide header file such as types.h as this allows for the easy
migration from standard C types to arbitrary precision types and helps in refining the
arbitrary precision types to the optimal size.
If the data types in the above example are defined as:
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
char dinA_t;
short dinB_t;
int dinC_t;
long long dinD_t;
int dout1_t;
unsigned int dout2_t;
int32_t dout3_t;
int64_t dout4_t;
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
77
Chapter 1: High-Level Synthesis
The design gives the following results after synthesis:
+ Timing (ns):
* Summary:
+---------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+---------+-------+----------+------------+
|default |
4.00|
3.85|
0.50|
+---------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
| Latency | Interval | Pipeline|
| min | max | min | max |
Type |
+-----+-----+-----+-----+---------+
|
66|
66|
67|
67|
none |
+-----+-----+-----+-----+---------+
* Summary:
+-----------------+---------+-------+--------+--------+
|
Name
| BRAM_18K| DSP48E|
FF
|
LUT |
+-----------------+---------+-------+--------+--------+
|Expression
|
-|
-|
0|
17|
|FIFO
|
-|
-|
-|
-|
|Instance
|
-|
1|
17920|
17152|
|Memory
|
-|
-|
-|
-|
|Multiplexer
|
-|
-|
-|
-|
|Register
|
-|
-|
7|
-|
+-----------------+---------+-------+--------+--------+
|Total
|
0|
1|
17927|
17169|
+-----------------+---------+-------+--------+--------+
|Available
|
650|
600| 202800| 101400|
+-----------------+---------+-------+--------+--------+
|Utilization (%) |
0|
~0 |
8|
16|
+-----------------+---------+-------+--------+--------+
If the width of the data is not required to be implemented using standard C types but in
some width which is smaller, but still greater than the next smallest standard C type, such as
the following,
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
int6 dinA_t;
int12 dinB_t;
int22 dinC_t;
int33 dinD_t;
int18 dout1_t;
uint13 dout2_t;
int22 dout3_t;
int6 dout4_t;
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
78
Chapter 1: High-Level Synthesis
The results after synthesis shown an improvement to the maximum clock frequency, the
latency and a significant reduction in area of 75%.
+ Timing (ns):
* Summary:
+---------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+---------+-------+----------+------------+
|default |
4.00|
3.49|
0.50|
+---------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
| Latency | Interval | Pipeline|
| min | max | min | max |
Type |
+-----+-----+-----+-----+---------+
|
35|
35|
36|
36|
none |
+-----+-----+-----+-----+---------+
* Summary:
+-----------------+---------+-------+--------+--------+
|
Name
| BRAM_18K| DSP48E|
FF
|
LUT |
+-----------------+---------+-------+--------+--------+
|Expression
|
-|
-|
0|
13|
|FIFO
|
-|
-|
-|
-|
|Instance
|
-|
1|
4764|
4560|
|Memory
|
-|
-|
-|
-|
|Multiplexer
|
-|
-|
-|
-|
|Register
|
-|
-|
6|
-|
+-----------------+---------+-------+--------+--------+
|Total
|
0|
1|
4770|
4573|
+-----------------+---------+-------+--------+--------+
|Available
|
650|
600| 202800| 101400|
+-----------------+---------+-------+--------+--------+
|Utilization (%) |
0|
~0 |
2|
4|
+-----------------+---------+-------+--------+--------+
The large difference in latency between both design is due to the division and remainder
operations which take multiple cycles to complete. Using accurate data types, rather than
force fitting the design into standard C data types, results in a higher quality FPGA
implementation: the same accuracy, running faster with less resources.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
79
Chapter 1: High-Level Synthesis
Overview of Arbitrary Precision Integer Data Types
Vivado HLS provides integer and fixed-point arbitrary precision data types for C, C++ and
supports the arbitrary precision data types that are part of SystemC.
Table 1-7:
Arbitrary Precision Data Types
Language
Integer Data Type
Required Header
C
[u]int (1024 bits)
#include “ap_cint.h”
C++
ap_[u]int (1024 bits)
#include “ap_int.h”
Can be extended to 32K bits wide.
C++
ap_[u]fixed
#include “ap_fixed.h”
System C
sc_[u]int (64 bits)
#include “systemc.h”
sc_[u]bigint (512 bits)
System C
sc_[u]fixed
#define SC_INCLUDE_FX
[#define SC_FX_EXCLUDE_OTHER]
#include “systemc.h”
The header files which define the arbitrary precision types are also provided with Vivado
HLS as a standalone package with the rights to use them in your own source code. The
package, xilinx_hls_lib_.tgz is provided in the include
directory in the Vivado HLS installation area. The package does not include the C arbitrary
precision types defined in ap_cint.h. These types cannot be used with standard C
compilers - only with Vivado HLS.
Arbitrary Precision Integer Types with C
For the C language, the header file ap_cint.h defines the arbitrary precision integer data
types [u]int. To use arbitrary precision integer data types in a C function:
•
Add header file ap_cint.h to the source code.
•
Change the bit types to intN or uintN, where N is a bit-size from 1 to 1024.
Arbitrary Precision Types with C++
For the C++ language ap_[u]int data types the header file ap_int.h defines the
arbitrary precision integer data type. To use arbitrary precision integer data types in a C++
function:
•
Add header file ap_int.h to the source code.
•
Change the bit types to ap_int or ap_uint, where N is a bit-size from 1 to
1024.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
80
Chapter 1: High-Level Synthesis
The following example shows how the header file is added and two variables implemented
to use 9-bit integer and 10-bit unsigned integer types:
#include "ap_int.h"
void foo_top (…) {
ap_int<9> var1;
ap_uint<10> var2;
// 9-bit
// 10-bit unsigned
The default maximum width allowed for ap_[u]int data types is 1024 bits. This default
may be overridden by defining the macro AP_INT_MAX_W with a positive integer value less
than or equal to 32768 before inclusion of the ap_int.h header file.
CAUTION! Setting the value of AP_INT_MAX_W too High may cause slow software compile and run
times.
Following is an example of overriding AP_INT_MAX_W:
#define AP_INT_MAX_W 4096
#include "ap_int.h"
// Must be defined before next line
ap_int<4096> very_wide_var;
Arbitrary Precision Types with SystemC
The arbitrary precision types used by SystemC are defined in the systemc.h header file
that is required to be included in all SystemC designs. The header file includes the SystemC
sc_int<>, sc_uint<>, sc_bigint<> and sc_biguint<> types.
Overview of Arbitrary Precision Fixed-Point Data Types
Fixed-point data types model the data as an integer and fraction bits. In this example the
Vivado HLS ap_fixed type is used to define an 18-bit variable with 6 bits representing the
numbers above the binary point and 12-bits representing the value below the decimal
point. The variable is specified as signed, the quantization mode is set to round to plus
infinity. Since the overflow mode is not specified, the default wrap-around mode is used for
overflow.
#include 
...
ap_fixed<18,6,AP_RND > my_type;
...
When performing calculations where the variables have different number of bits or different
precision, the binary point is automatically aligned.
The behavior of the C++/SystemC simulations performed using fixed-point matches the
resulting hardware. This allows you to analyze the bit-accurate, quantization, and overflow
behaviors using fast C-level simulation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
81
Chapter 1: High-Level Synthesis
Fixed-point types are a useful replacement for floating point types which require many
clock cycle to complete. Unless the entire range of the floating-point type is required, the
same accuracy can often be implemented with a fixed-point type resulting in the same
accuracy with smaller and faster hardware.
A summary of the ap_fixed type identifiers is provided in the following table.
Table 1-8:
Fixed-Point Identifier Summary
Identifier
Description
W
Word length in bits
I
The number of bits used to represent the integer value (the number of bits above the
binary point)
Q
Quantization mode
This dictates the behavior when greater precision is generated than can be defined by
smallest fractional bit in the variable used to store the result.
O
SystemC Types
ap_fixed Types
Description
SC_RND
AP_RND
Round to plus infinity
SC_RND_ZERO
AP_RND_ZERO
Round to zero
SC_RND_MIN_INF
AP_RND_MIN_INF
Round to minus infinity
SC_RND_INF
AP_RND_INF
Round to infinity
SC_RND_CONV
AP_RND_CONV
Convergent rounding
SC_TRN
AP_TRN
Truncation to minus infinity
(default)
SC_TRN_ZERO
AP_TRN_ZERO
Truncation to zero
Overflow mode.
This dictates the behavior when the result of an operation exceeds the maximum (or
minimum in the case of negative numbers) value which can be stored in the result variable.
N
SystemC Types
ap_fixed Types
Description
SC_SAT
AP_SAT
Saturation
SC_SAT_ZERO
AP_SAT_ZERO
Saturation to zero
SC_SAT_SYM
AP_SAT_SYM
Symmetrical saturation
SC_WRAP
AP_WRAP
Wrap around (default)
SC_WRAP_SM
AP_WRAP_SM
Sign magnitude wrap
around
This defines the number of saturation bits in the overflow wrap modes.
The default maximum width allowed for ap_[u]fixed data types is 1024 bits. This default
may be overridden by defining the macro AP_INT_MAX_W with a positive integer value less
than or equal to 32768 before inclusion of the ap_int.h header file.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
82
Chapter 1: High-Level Synthesis
CAUTION! Setting the value of AP_INT_MAX_W too High may cause slow software compile and run
times.
Following is an example of overriding AP_INT_MAX_W:
#define AP_INT_MAX_W 4096
#include "ap_fixed.h"
// Must be defined before next line
ap_fixed<4096> very_wide_var;
Arbitrary precision data types are highly recommend when using Vivado HLS. As shown in
the earlier example, they typically have a significant positive benefit on the quality of the
hardware implementation. Complete details on the Vivado HLS arbitrary precision data
types are provided in the Chapter 4, High-Level Synthesis Reference Guide.
Half-Precision Floating-Point Data Types
Vivado HLS provides a half-precision (16-bit) floating-point data type. This data type
provides many of the advantages of standard C float types but uses fewer hardware
resources when synthesized. The half-precision floating-point data type provides a smaller
dynamic range than the standard 32-bit float type. From the MSB to the LSB, the
half-precision floating-point data type provides the following:
•
1 signed bit
•
5 exponent bits
•
10 mantissa bits
The following example shows how Vivado HLS uses the half-precision floating-point data
type:
// Include half-float header file
#include “hls_half.h”
// Use data-type “half”
typedef half data_t;
// Use typedef or “half” on arrays and pointers
void top( data_t in[SIZE], half &out_sum);
Vivado HLS supports the following arithmetic operations for the half-precision
floating-point data type:
•
Addition
•
Division
•
Multiplication
•
Subtraction
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
83
Chapter 1: High-Level Synthesis
For details on the arithmetic operations supported for half-precision floating-point data
types, refer to Chapter 2, High-Level Synthesis C Libraries.
Often with VHLS designs, unions are used to convert the raw bits from one data type to
another data type. Generally, this raw bit conversion is needed when using floating point
values at the top-level port interface. For one example, see below:
typedef float T;
unsigned int value; // the “input” of the conversion
T myhalfvalue; // the “output” of the conversion
union
{
unsigned int as_uint32;
T as_floatingpoint;
} my_converter;
my_converter.as_uint32 = value;
myhalfvalue = my_converter. as_floatingpoint;
This type of code is fine for float C data types and with modification, it is also fine for
double data types. Changing the typedef and the int to short will not work for half data
types, however, because half is a class and cannot be used in a union. Instead, the following
code can be used:
typedef half T;
short value;
T myhalfvalue = static_cast(value);
Similarly, the conversion the other way around uses value=static_cast
>(myhalfvalue) or static_cast< unsigned short >(myhalfvalue).
Another method is to use the helper class fp_struct to make conversions using
the methods data() or to_int(). Use the header file hls/utils/x_hls_utils.h.
Managing Interfaces
In C based design, all input and output operations are performed, in zero time, through
formal function arguments. In an RTL design these same input and output operations must
be performed through a port in the design interface and typically operates using a specific
I/O (input-output) protocol.
Vivado HLS supports two solutions for specifying the type of I/O protocol used:
•
Interface Synthesis, where the port interface is created based on efficient industry
standard interfaces.
•
Manual interface specification where the interface behavior is explicitly described in
the input source code. This allows any arbitrary I/O protocol to be used.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
84
Chapter 1: High-Level Synthesis
°
This solution is provided through SystemC designs, where the I/O control signals
are specified in the interface declaration and their behavior specified in the code.
°
Vivado HLS also supports this mode of interface specification for C and C++
designs.
Interface Synthesis
When the top-level function is synthesized, the arguments (or parameters) to the function
are synthesized into RTL ports. This process is called interface synthesis.
Interface Synthesis Overview
The following code provides a comprehensive overview of interface synthesis.
#include "sum_io.h"
dout_t sum_io(din_t in1, din_t in2, dio_t *sum) {
dout_t temp;
*sum = in1 + in2 + *sum;
temp = in1 + in2;
return
temp;
}
This example above includes:
•
Two pass-by-value inputs in1 and in2.
•
A pointer sum that is both read from and written to.
•
A function return, the value of temp.
With the default interface synthesis settings, the design is synthesized into an RTL block
with the ports shown in the following figure.
X-Ref Target - Figure 1-37
Figure 1-37:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
RTL Ports After Default Interface Synthesis
www.xilinx.com
Send Feedback
85
Chapter 1: High-Level Synthesis
Vivado HLS creates three types of ports on the RTL design:
•
Clock and Reset ports: ap_clk and ap_rst.
•
Block-Level interface protocol. These are shown expanded in the preceding figure:
ap_start, ap_done, ap_ready, and ap_idle.
•
Port Level interface protocols. These are created for each argument in the top-level
function and the function return (if the function returns a value). In this example, these
ports are: in1, in2, sum_i, sum_o, sum_o_ap_vld, and ap_return.
Clock and Reset Ports
If the design takes more than 1 cycle to complete operation.
A chip-enable port can optionally be added to the entire block using Solution > Solution
Settings > General and config_interface configuration.
The operation of the reset is controlled by the config_rtl configuration. More details on the
reset configuration are provided in Clock, Reset, and RTL Output.
Block-Level Interface Protocol
By default, a block-level interface protocol is added to the design. These signal control the
block, independently of any port-level I/O protocols. These ports control when the block
can start processing data (ap_start), indicate when it is ready to accept new inputs
(ap_ready) and indicate if the design is idle (ap_idle) or has completed operation
(ap_done).
Port-Level Interface Protocol
The final group of signals are the data ports. The I/O protocol created depends on the type
of C argument and on the default. A complete list of all possible I/O protocols is shown in
Figure 1-39. After the block-level protocol has been used to start the operation of the
block, the port-level IO protocols are used to sequence data into and out of the block.
By default input pass-by-value arguments and pointers are implemented as simple wire
ports with no associated handshaking signal. In the above example, the input ports are
therefore implemented without an I/O protocol, only a data port. If the port has no I/O
protocol, (by default or by design) the input data must be held stable until it is read.
By default output pointers are implemented with an associated output valid signal to
indicate when the output data is valid. In the above example, the output port is
implemented with an associated output valid port (sum_o_ap_vld) which indicates when the
data on the port is valid and can be read. If there is no I/O protocol associated with the
output port, it is difficult to know when to read the data. It is always a good idea to use an
I/O protocol on an output.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
86
Chapter 1: High-Level Synthesis
Function arguments which are both read from and writes to are split into separate input and
output ports. In the above example, sum is implemented as input port sum_i and output
port sum_o with associated I/O protocol port sum_o_ap_vld.
If the function has a return value, an output port ap_return is implemented to provide the
return value. When the design completes one transaction - this is equivalent to one
execution of the C function - the block-level protocols indicate the function is complete
with the ap_done signal. This also indicates the data on port ap_return is valid and can
be read.
Note: The return value to the top-level function cannot be a pointer.
For the example code shown the timing behavior is shown in the following figure (assuming
that the target technology and clock frequency allow a single addition per clock cycle).
X-Ref Target - Figure 1-38
Figure 1-38:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
RTL Port Timing with Default Synthesis
www.xilinx.com
Send Feedback
87
Chapter 1: High-Level Synthesis
•
The design starts when ap_start is asserted High.
•
The ap_idle signal is asserted Low to indicate the design is operating.
•
The input data is read at any clock after the first cycle. Vivado HLS schedules when the
reads occur. The ap_ready signal is asserted high when all inputs have been read.
•
When output sum is calculated, the associated output handshake (sum_o_ap_vld)
indicates that the data is valid.
•
When the function completes, ap_done is asserted. This also indicates that the data on
ap_return is valid.
•
Port ap_idle is asserted High to indicate that the design is waiting start again.
Interface Synthesis and OpenCL API C
During synthesis, Vivado HLS groups all interfaces in OpenCL API C as follows:
•
All scalar interfaces and the block-level interface into a single AXI4-Lite interface
•
All arrays and pointers into a single AXI4 interface
Note: No other interface specifications are allowed for OpenCL API C kernels.
Interface Synthesis I/O Protocols
The type of interfaces that are created by interface synthesis depend on the type of C
argument, the default interface mode, and the INTERFACE optimization directive. The
following figure shows the interface protocol mode you can specify on each type of C
argument. This figure uses the following abbreviations:
•
D: Default interface mode for each type.
Note: If you specify an illegal interface, Vivado HLS issues a message and implements the
default interface mode.
•
I: Input arguments, which are only read.
•
O: Output arguments, which are only written to.
•
I/O: Input/Output arguments, which are both read and written.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
88
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-39
Argument
Type
Interface Mode
Scalar
Input
Array
Return
I
I/O
Pointer or Reference
O
I
I/O
O
HLS::
Stream
I and O
ap_ctrl_none
ap_ctrl_hs
D
ap_ctrl_chain
axis
s_axilite
m_axi
ap_none
D
D
ap_stable
ap_ack
ap_vld
D
ap_ovld
D
ap_hs
ap_memory
D
D
D
bram
ap_fifo
D
ap_bus
Supported D = Default Interface
Not Supported
;
Figure 1-39:
Data Type and Interface Synthesis Support
Full details on the interfaces protocols, including waveform diagrams, are include in
Interface Synthesis Reference in Chapter 4. The following provides an overview of each
interface mode.
Block-Level Interface Protocols
The block-level interface protocols are ap_ctrl_none, ap_ctrl_hs, and
ap_ctrl_chain. These are specified, and can only be specified, on the function or the
function return. When the directive is specified in the GUI it will apply these protocols to the
function return. Even if the function does not use a return value, the block-level protocol
may be specified on the function return.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
89
Chapter 1: High-Level Synthesis
The ap_ctrl_hs mode described in the previous example is the default protocol. The
ap_ctrl_chain protocol is similar to ap_ctrl_hs but has an additional input port
ap_continue which provides back pressure from blocks consuming the data from this
block. If the ap_continue port is logic 0 when the function completes, the block will halt
operation and the next transaction will not proceed. The next transaction will only proceed
when the ap_continue is asserted to logic 1.
The ap_ctrl_none mode implements the design without any block-level I/O protocol.
If the function return is also specified as an AXI4-Lite interface (s_axilite) all the ports in
the block-level interface are grouped into the AXI4-Lite interface. This is a common practice
when another device, such as a CPU, is used to configure and control when this block starts
and stops operation.
Port-Level Interface Protocols: AXI4 Interfaces
The AXI4 interfaces supported by Vivado HLS include the AXI4-Stream (axis), AXI4-Lite
(s_axilite), and AXI4 master (m_axi) interfaces, which you can specify as follows:
•
AXI4-Stream interface: Specify on input arguments or output arguments only, not on
input/output arguments.
•
AXI4-Lite interface: Specify on any type of argument except arrays. You can group
multiple arguments into the same AXI4-Lite interface.
•
AXI4 master interface: Specify on arrays and pointers (and references in C++) only. You
can group multiple arguments into the same AXI4 interface.
For information on additional functionality provided by the AXI4 interface, see Using AXI4
Interfaces.
Port-Level Interface Protocols: No I/O Protocol
The ap_none and ap_stable modes specify that no I/O protocol be added to the port.
When these modes are specified the argument is implemented as a data port with no other
associated signals. The ap_none mode is the default for scalar inputs. The ap_stable
mode is intended for configuration inputs which only change when the device is in reset
mode.
Port-Level Interface Protocols: Wire Handshakes
Interface mode ap_hs includes a two-way handshake signal with the data port. The
handshake is an industry standard valid and acknowledge handshake. Mode ap_vld is the
same but only has a valid port and ap_ack only has a acknowledge port.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
90
Chapter 1: High-Level Synthesis
Mode ap_ovld is for use with in-out arguments. When the in-out is split into separate
input and output ports, mode ap_none is applied to the input port and ap_vld applied to
the output port. This is the default for pointer arguments which are both read and written.
The ap_hs mode can be applied to arrays which are read or written in sequential order. If
Vivado HLS can determine the read or write accesses are not sequential it will halt synthesis
with an error. If the access order cannot be determined Vivado HLS will issue a warning.
Port-Level Interface Protocols: Memory Interfaces
Array arguments are implemented by default as an ap_memory interface. This is a standard
block RAM interface with data, address, chip-enable and write-enable ports.
An ap_memory interface may be implemented as a single-port of dual-port interface. If
Vivado HLS can determine that a using a dual-port interface will reduce the initial interval
it will automatically implement a dual-port interface. The RESOURE directive is used to
specify the memory resource and if this directive is specified on the array with a single-port
block RAM, a single-port interface will be implemented. Conversely, if a dual-port interface
is specified using the RESOURCE directive and Vivado HLS determines this interface
provides no benefit it will automatically implement a single-port interface.
The bram interface mode is functional identical to the ap_memory interface. The only
difference is how the ports are implemented when the design is used in Vivado IP
Integrator:
•
An ap_memory interface is displayed as multiple and separate ports.
•
A bram interface is displayed as a single grouped port which can be connected to a
Xilinx block RAM using a single point-to-point connection.
If the array is accessed in a sequential manner an ap_fifo interface can be used. As with
the ap_hs interface, Vivado HLS will halt if determines the data access is not sequential,
report a warning if it cannot determine if the access is sequential or issue no message if it
determines the access is sequential. The ap_fifo interface can only be used for reading or
writing, not both.
The ap_bus interface can communicate with a bus bridge. The interface does not adhere to
any specific bus standard but is generic enough to be used with a bus bridge that in-turn
arbitrates with the system bus. The bus bridge must be able to cache all burst writes.
Interface Synthesis and Structs
Structs on the interface are by default de-composed into their member elements and ports
are implemented separately for each member element. Each member element of the struct
will be implemented, in the absence of any INTERFACE directive, as shown in Figure 1-39.
Arrays of structs are implemented as multiple arrays, with a separate array for each member
of the struct.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
91
Chapter 1: High-Level Synthesis
The DATA_PACK optimization directive is used for packing all the elements of a struct into a
single wide vector. This allows all members of the struct to be read and written to
simultaneously. The member elements of the struct are placed into the vector in the order
the appear in the C code: the first element of the struct is aligned on the LSB of the vector
and the final element of the struct is aligned with the MSB of the vector. Any arrays in the
struct are partitioned into individual array elements and placed in the vector from lowest to
highest, in order.
Care should be taken when using the DATA_PACK optimization on structs with large arrays.
If an array has 4096 elements of type int, this will result in a vector (and port) of width
4096*32=131072 bits. Vivado HLS can create this RTL design, however it is very unlikely
logic synthesis will be able to route this during the FPGA implementation.
The single wide-vector created by using the DATA_PACK directive allows more data to be
accessed in a single clock cycle. This is the case when the struct contains an array. When
data can be accessed in a single clock cycle, Vivado HLS automatically unrolls any loops
consuming this data, if doing so improves the throughput. The loop can be fully or partially
unrolled to create enough hardware to consume the additional data in a single clock cycle.
This feature is controlled using the config_unroll command and the option
tripcount_threshold. In the following example, any loops with a tripcount of less than
16 will be automatically unrolled if doing so improves the throughput.
config_unroll -tripcount_threshold 16
If a struct port using DATA_PACK is to be implemented with an AXI4 interface you may wish
to consider using the DATA_PACK byte_pad option. The byte_pad option is used to
automatically align the member elements to 8-bit boundaries. This alignment is sometimes
required by Xilinx IP. If an AXI4 port using DATA_PACK is to be implemented, refer to the
documentation for the Xilinx IP it will connect to and determine if byte alignment is
required.
For the following example code, the options for implementing a struct port are shown in
the following figure.
typedef struct{
int12 A;
int18 B[4];
int6 C;
} my_data;
void foo(my_data *a )
•
By default, the members are implemented as individual ports. The array has multiple
ports (data, addr, etc.)
•
Using DATA_PACK results in a single wide port.
•
Using DATA_PACK with struct_level byte padding aligns entire struct to the next
8-bit boundary.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
92
Chapter 1: High-Level Synthesis
•
Using DATA_PACK with field_level byte padding aligns each struct member to the
next 8-bit boundary.
Note: The maximum bit-width of any port or bus created by data packing is 8192 bits.
X-Ref Target - Figure 1-40
6WUXFW3RUW,PSOHPHQWDWLRQ
&
%BDGGU
%BFH
%BGDWD
$
ELW
ELW
ELW
ELW
ELW
'$7$B3$&.RSWLPL]DWLRQ
$
6LQJOHSDFNHGYHFWRU>@
&
%>@
%>@
%>@
%>@
$
ELW
ELW
ELW
ELW
ELW
ELW
'$7$B3$&.RSWLPL]DWLRQZLWKE\WHBSDGRQWKHVWUXFWBOHYHO
$
6LQJOHSDFNHGYHFWRUSRUW>@
&
%>@
%>@
%>@
%>@
$
ELW
ELW
ELW
ELW
ELW
ELW
ELW
'$7$B3$&.RSWLPL]DWLRQZLWKE\WHBSDGRQWKHILHOGBOHYHO
$
6LQJOHSDFNHGYHFWRUSRUW>@
ELW
&
ELW
ELW
%>@
ELW
ELW
%>@
ELW
ELW
%>@
ELW
ELW
%>@
$
ELW
ELW
ELW
;
Figure 1-40:
DATA_PACK byte_pad Alignment Options
If a struct contains arrays, those arrays can be optimized using the ARRAY_PARTITION
directive to partition the array or the ARRAY_RESHAPE directive to partition the array and
re-combine the partitioned elements into a wider array. The DATA_PACK directive performs
a similar operation as ARRAY_RESHAPE and combines the reshaped array with the other
elements in the struct.
A struct cannot be optimized with DATA_PACK and then partitioned or reshaped. The
DATA_PACK, ARRAY_PARTITION and ARRAY_RESHAPE directives are mutually exclusive.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
93
Chapter 1: High-Level Synthesis
Interface Synthesis and Multi-Access Pointers
Using pointers which are accessed multiple times can introduce unexpected behavior after
synthesis. In the following example pointer d_i is read four times and pointer d_o is
written to twice: the pointers perform multiple accesses.
#include "pointer_stream_bad.h"
void pointer_stream_bad ( dout_t *d_o,
din_t acc = 0;
acc +=
acc +=
*d_o =
acc +=
acc +=
*d_o =
din_t *d_i) {
*d_i;
*d_i;
acc;
*d_i;
*d_i;
acc;
}
After synthesis this code will result in an RTL design which reads the input port once and
writes to the output port once. As with any standard C compiler, Vivado HLS will optimize
away the redundant pointer accesses. To implement the above code with the “anticipated”
4 reads on d_i and 2 writes to the d_o the pointers must be specified as volatile as
shown in the next example.
#include "pointer_stream_better.h"
void pointer_stream_better ( volatile dout_t *d_o,
din_t acc = 0;
acc +=
acc +=
*d_o =
acc +=
acc +=
*d_o =
volatile din_t *d_i) {
*d_i;
*d_i;
acc;
*d_i;
*d_i;
acc;
}
Even this C code is problematic. Using a test bench, there is no way to supply anything but
a single value to d_i or verify any write to d_o other than the final write. Although
multi-access pointers are supported, it is highly recommended to implement the behavior
required using the hls::stream class. Details on the hls::stream class are in HLS
Stream Library in Chapter 2.
Specifying Interfaces
Interface synthesis is controlled by the INTERFACE directive or by using a configuration
setting. To specify the interface mode on ports, select the port in the GUI Directives tab and
right-click the mouse to open the Vivado HLS Directive Editor as shown in the following
figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
94
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-41
Figure 1-41:
Specifying Port Interfaces
In the Vivado HLS Directives Editor, set the following options:
•
mode
Select the interface mode from the drop-down menu.
•
register
If you select this option, all pass-by-value reads are performed in the first cycle of
operation. For output ports, the register option guarantees the output is registered. You
can apply the register option to any function in the design. For memory, FIFO, and AXI4
interfaces, the register option has no effect.
•
depth
This option specifies how many samples are provided to the design by the test bench
and how many output values the test bench must store. Use whichever number is
greater.
Note: For cases in which a pointer is read from or written to multiple times within a single
transaction, the depth option is required for C/RTL co-simulation. The depth option is not
required for arrays or when using the hls::stream construct. It is only required when using
pointers on the interface.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
95
Chapter 1: High-Level Synthesis
If the depth option is set too small, the C/RTL co-simulation might deadlock as follows:
•
°
The input reads might stall waiting for data that the test bench cannot provide.
°
The output writes might stall when trying to write data, because the storage is full.
port
This option is required. By default, Vivado HLS does not register ports.
Note: To specify a block-level I/O protocol, select the top-level function in the Vivado HLS GUI,
and specify the port as the function return.
•
offset
This option is used for AXI4 interfaces. For information, see Using AXI4 Interfaces.
To set the interface configuration, select Solution > Solution Settings > General >
config_interface. You can use configuration settings to:
•
Add a global clock enable to the RTL design.
•
Remove dangling ports, such as those created by elements of a struct that are not used
in the design.
•
Create RTL ports for any global variables.
Any C function can use global variables: those variables defined outside the scope of any
function. By default, global variables do not result in the creation of RTL ports: Vivado HLS
assumes the global variable is inside the final design. The config_interface
configuration setting expose_global instructs Vivado HLS to create a ports for global
variables. For more information on the synthesis of global variables, see Global Variables in
Chapter 3.
Interface Synthesis for SystemC
In general, interface synthesis is not supported for SystemC designs. The I/O ports for
SystemC designs are fully specified in the SC_MODULE interface and the behavior of the
ports fully described in the source code. Interface synthesis is provided to support:
•
Memory block RAM interfaces
•
AXI4-Stream interfaces
•
AXI4-Lite interfaces
•
AXI4 master interfaces
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
96
Chapter 1: High-Level Synthesis
The processes for performing interface synthesis on a SystemC design is different from
adding the same interfaces to C or C++ designs.
•
Memory block RAM and AXI4 master interfaces require the SystemC data port is
replaced with a Vivado HLS port.
•
AXI4-Stream and AXI4-Lite slave interfaces only require directives but there is a
different process for adding directives to a SystemC design.
Applying Interface Directives with SystemC
When adding directives as pragmas to SystemC source code, the pragma directives cannot
be added where the ports are specified in the SC_MODULE declaration, they must be added
inside a function called by the SC_MODULE.
When adding directives using the GUI:
•
Open the C source code and directives tab.
•
Select the function which requires a directive.
•
Right-click with the mouse and the INTERFACE directive to the function.
The directives can be applied to any member function of the SC_MODULE, however it is a
good design practice to add them to the function where the variables are used.
Block RAM Memory Ports
Given a SystemC design with an array port on the interface:
SC_MODULE(my_design) {
//”RAM” Port
sc_uint<20> my_array[256];
…
The port my_array is synthesized into an internal block RAM, not a block RAM interface
port.
Including the Vivado HLS header file ap_mem_if.h allows the same port to be specified as
an ap_mem_port port. The ap_mem_port data type is
synthesized into a standard block RAM interface with the specified data and address
bus-widths and using the ap_memory port protocol.
#include "ap_mem_if.h"
SC_MODULE(my_design) {
//”RAM” Port
ap_mem_port,sc_uint<8>, 256> my_array;
…
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
97
Chapter 1: High-Level Synthesis
When an ap_mem_port is added to a SystemC design, an associated ap_mem_chn must be
added to the SystemC test bench to drive the ap_mem_port. In the test bench, an
ap_mem_chn is defined and attached to the instance as shown:
#include "ap_mem_if.h"
ap_mem_chn bus_mem;
…
// Instantiate the top-level module
my_design U_dut (“U_dut”)
U_dut.my_array.bind(bus_mem);
…
The header file ap_mem_if.h is located in the include directory located in the Vivado HLS
installation area and must be included if simulation is performed outside Vivado HLS.
SystemC AXI4-Stream Interface
An AXI4-Stream interface can be added to any SystemC ports that are of the sc_fifo_in
or sc_fifo_out type. The following shows the top-level of a typical SystemC design. As is
typical, the SC_MODULE and ports are defined in a header file:
SC_MODULE(sc_FIFO_port)
{
//Ports
sc_in  clock;
sc_in  reset;
sc_in  start;
sc_out done;
sc_fifo_out dout;
sc_fifo_in din;
//Variables
int share_mem[100];
bool write_done;
//Process Declaration
void Prc1();
void Prc2();
//Constructor
SC_CTOR(sc_FIFO_port)
{
//Process Registration
SC_CTHREAD(Prc1,clock.pos());
reset_signal_is(reset,true);
SC_CTHREAD(Prc2,clock.pos());
reset_signal_is(reset,true);
}
};
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
98
Chapter 1: High-Level Synthesis
To create an AXI4-Stream interface the RESOURCE directive must be used to specify the
ports are connected an AXI4-Stream resource. For the example interface shown above, the
directives are shown added in the function called by the SC_MODULE: ports din and dout
are specified to have an AXI4-Stream resource.
#include "sc_FIFO_port.h"
void sc_FIFO_port::Prc1()
{
//Initialization
write_done = false;
wait();
while(true)
{
while (!start.read()) wait();
write_done = false;
for(int i=0;i<100; i++)
share_mem[i] = i;
write_done = true;
wait();
} //end of while(true)
}
void sc_FIFO_port::Prc2()
{
#pragma HLS resource core=AXI4Stream variable=din
#pragma HLS resource core=AXI4Stream variable=dout
//Initialization
done = false;
wait();
while(true)
{
while (!start.read()) wait();
wait();
while (!write_done) wait();
for(int i=0;i<100; i++)
{
dout.write(share_mem[i]+din.read());
}
done = true;
wait();
} //end of while(true)
}
When the SystemC design is synthesized, it results in an RTL design with standard RTL FIFO
ports. When the design is packaged as IP using the Export RTL toolbar button
, the
output is a design with an AXI4-Stream interfaces.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
99
Chapter 1: High-Level Synthesis
SystemC AXI4-Lite Interface
An AXI4-Lite slave interface can be added to any SystemC ports of type sc_in or sc_out.
The following example shows the top-level of a typical SystemC design. In this case, as is
typical, the SC_MODULE and ports are defined in a header file:
SC_MODULE(sc_sequ_cthread){
//Ports
sc_in  clk;
sc_in  reset;
sc_in  start;
sc_in > a;
sc_in en;
sc_out > sum;
sc_out vld;
//Variables
sc_uint<16> acc;
//Process Declaration
void accum();
//Constructor
SC_CTOR(sc_sequ_cthread){
//Process Registration
SC_CTHREAD(accum,clk.pos());
reset_signal_is(reset,true);
}
};
To create an AXI4-Lite interface the RESOURCE directive must be used to specify the ports
are connected to an AXI4-Lite resource. For the example interface shown above, the
following example shows how ports start, a, en, sum and vld are grouped into the same
AXI4-Lite interface slv0: all the ports are specified with the same bus_bundle name and
are grouped into the same AXI4-Lite interface.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
100
Chapter 1: High-Level Synthesis
#include "sc_sequ_cthread.h"
void sc_sequ_cthread::accum(){
//Group ports into AXI4 slave slv0
#pragma HLS resource core=AXI4LiteS
#pragma HLS resource core=AXI4LiteS
#pragma HLS resource core=AXI4LiteS
#pragma HLS resource core=AXI4LiteS
#pragma HLS resource core=AXI4LiteS
metadata="-bus_bundle
metadata="-bus_bundle
metadata="-bus_bundle
metadata="-bus_bundle
metadata="-bus_bundle
slv0"
slv0"
slv0"
slv0"
slv0"
variable=start
variable=a
variable=en
variable=sum
variable=vld
//Initialization
acc=0;
sum.write(0);
vld.write(false);
wait();
// Process the data
while(true) {
// Wait for start
wait();
while (!start.read()) wait();
// Read if valid input available
if (en) {
acc = acc + a.read();
sum.write(acc);
vld.write(true);
} else {
vld.write(false);
}
}
}
When the SystemC design is synthesized, it results in an RTL design with standard RTL ports.
When the design is packaged as IP using Export RTL toolbar button
, the output is a
design with an AXI4-Lite interface.
SystemC AXI4 Master Interface
In most standard SystemC designs, you have no need to specify a port with the behavior of
the Vivado HLS ap_bus I/O protocol. However, if the design requires an AXI4 master bus
interface the ap_bus I/O protocol is required.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
101
Chapter 1: High-Level Synthesis
To specify an AXI4 master interface on a SystemC design:
•
Use the Vivado HLS type AXI4M_bus_port to create an interface with the ap_bus I/O
protocol.
•
Assign an AXI4M resource to the port.
The following example shows how an AXI4M_bus_port called bus_if is added to a
SystemC design.
•
The header file AXI4_if.h must be added to the design.
•
The port is defined as AXI4M_bus_port, where type specifies the data type to
be used (in this example, an sc_fixed type is used).
Note: The data type used in the AXI4M_bus_port must be multiples of 8-bit. In addition, structs
are not supported for this data type.
#include "systemc.h"
#include "AXI4_if.h"
#include "tlm.h"
using namespace tlm;
#define DT sc_fixed<32, 8>
SC_MODULE(dut)
{
//Ports
sc_in clock; //clock input
sc_in reset;
sc_in start;
sc_out dout;
AXI4M_bus_port > bus_if;
//Variables
//Constructor
SC_CTOR(dut)
//:bus_if ("bus_if")
{
//Process Registration
SC_CTHREAD(P1,clock.pos());
reset_signal_is(reset,true);
}
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
102
Chapter 1: High-Level Synthesis
The following shows how the variable bus_if can be accessed in the SystemC function to
produce standard or burst read and write operations.
//Process Declaration
void P1() {
//Initialization
dout.write(10);
int addr = 10;
DT tmp[10];
wait();
while(1) {
tmp[0]=10;
tmp[1]=11;
tmp[2]=12;
while (!start.read()) wait();
// Port read
tmp[0] = bus_if->read(addr);
// Port burst read
bus_if->burst_read(addr,2,tmp);
// Port write
bus_if->write(addr, tmp);
// Port burst write
bus_if->burst_write(addr,2,tmp);
dout.write(tmp[0].to_int());
addr+=2;
wait();
}
}
When the port class AXI4M_bus_port is used in a design, it must have a matching HLS bus
interface channel hls_bus_chn in the test bench, as shown in the
following example:
#include 
#include "tlm.h"
using namespace tlm;
#include "hls_bus_if.h"
#include "AE_clock.h"
#include "driver.h"
#ifdef __RTL_SIMULATION__
#include "dut_rtl_wrapper.h"
#define dut dut_rtl_wrapper
#else
#include "dut.h"
#endif
int sc_main (int argc , char *argv[])
{
sc_report_handler::set_actions("/IEEE_Std_1666/deprecated", SC_DO_NOTHING);
sc_report_handler::set_actions( SC_ID_LOGIC_X_TO_BOOL_, SC_LOG);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
103
Chapter 1: High-Level Synthesis
sc_report_handler::set_actions( SC_ID_VECTOR_CONTAINS_LOGIC_VALUE_, SC_LOG);
sc_report_handler::set_actions( SC_ID_OBJECT_EXISTS_, SC_LOG);
// hls_bus_chan
// bus_variable(“name”, start_addr, end_addr)
//
hls_bus_chn > bus_mem("bus_mem",0,1024);
sc_signal
sc_signal
sc_signal
sc_signal
AE_Clock
dut
driver
s_clk;
reset;
start;
dout;
U_AE_Clock("U_AE_Clock", 10);
U_dut("U_dut");
U_driver("U_driver");
U_AE_Clock.reset(reset);
U_AE_Clock.clk(s_clk);
U_dut.clock(s_clk);
U_dut.reset(reset);
U_dut.start(start);
U_dut.dout(dout);
U_dut.bus_if(bus_mem);
U_driver.clk(s_clk);
U_driver.start(start);
U_driver.dout(dout);
int end_time = 8000;
cout << "INFO: Simulating " << endl;
// start simulation
sc_start(end_time, SC_NS);
return U_driver.ret;
};
The synthesized RTL design contains an interface with the ap_bus I/O protocol.
When the AXI4M_bus_port class is used, it results in an RTL design with an ap_bus
interface. When the design is packaged as IP using Export RTL the output is a design with an
AXI4 master port.
Specifying Manual Interface
You can use Vivado HLS to identify blocks of code that define a specific I/O protocol. This
allows you to specify an I/O protocol using a directive instead of using Interface Synthesis
or SystemC.
Note: You can also specify an I/O protocol with SystemC designs to provide greater I/O control.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
104
Chapter 1: High-Level Synthesis
The following examples show the requirements and advantages of manual interface
specifications. In the first code example, the following occurs:
1. Input response[0] is read.
2. Output request is written.
3. Input response[1] is read.
void test (
int
*z1,
int
a,
int
b,
int
*mode,
volatile int
volatile int
int
*z2
) {
int
int
int
*request,
response[2],
read1, read2;
opcode;
i;
P1: {
read1
opcode
*request
read2
}
C1: {
*z1
*z2
}
=
=
=
=
response[0];
5;
opcode;
response[1];
= a + b;
= read1 + read2;
}
When Vivado HLS implements this code, the write to request does not need to occur
between the two reads on response. The code uses this I/O behavior, but there are no
dependencies in the code enforce the I/O behavior. Vivado HLS might schedule the I/O
accesses using the same access pattern as the C code or use a different access pattern.
If there is an external requirement that the I/O accesses must occur in this order, you can
use a protocol block to enforce a specific I/O protocol behavior. Because the accesses occur
in the scope defined by block P1, you can apply an I/O protocol as follows:
1. Include the ap_utils.h header file that defines applet().
2. Place an ap_wait() statement after the write to request but before the read on
response[1].
Note: The ap_wait() statement does not alter the behavior of the C simulation. It instructs
Vivado HLS to insert a clock between the I/O accesses during synthesis.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
105
Chapter 1: High-Level Synthesis
The modified code now contains the header file and ap_wait() statements:
#include "ap_utils.h" // Added include file
void test (
int
*z1,
int
a,
int
b,
int
*mode,
volatile int
volatile int
int
*z2
) {
int
int
int
*request,
response[2],
read1, read2;
opcode;
i;
P1: {
read1
= response[0];
opcode
= 5;
ap_wait();// Added ap_wait statement
*request
= opcode;
read2
= response[1];
}
C1: {
*z1
= a + b;
*z2
= read1 + read2;
}
}
3. Specify that block P1 is a protocol region using the PROTOCOL directive:
set_directive_protocol test P1 -mode floating
This instructs Vivado HLS to schedule the code within this region as is. There is no
reordering of the I/O or ap_wait() statements.
This results in the following exact I/O behavior specified in the code:
1. Input response[0] is read.
2. Output request is written.
3. Input response[1] is read.
Note: If allowed by data dependencies, the -mode floating option allows other code to execute in
parallel with this block. The -fixed mode prevents this.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
106
Chapter 1: High-Level Synthesis
Use the following guidelines when manually specifying I/O protocols:
•
Do not use an I/O protocol on the ports used in a manual interface. Explicitly set all
ports to I/O protocol ap_none to ensure interface synthesis does not add any
additional protocol signals.
•
You must specify all the control signals used in a manually specified interface in the C
code with volatile type qualifier. These signals typically change value multiple times
within the function (for example, typically set to 0, then 1, then back to zero). Without
the volatile qualifier, Vivado HLS follows standard C semantics and optimizes out all
intermediate operations, leaving only the first read and final write.
•
Use the volatile qualifier to specify data signals with values that will be updated
multiples times.
•
If multiple clocks are required, use ap_wait_n() to specify multiple cycles.
Do not use multiple ap_wait() statements.
•
Group signals that need to change in the same clock cycle using the latency directive.
For example:
{
#pragma HLS PROTOCOL fixed
// A protocol block may span multiple clock cycles
// To ensure both these signals are scheduled in the exact same clock cycle.
// create a region { } with a latency = 0
{
#pragma HLS LATENCY max=0 min=0
*data = 0xFF;
*data_vld = 1;
}
ap_wait_n(2);
}
Using AXI4 Interfaces
AXI4-Stream Interfaces
An AXI4-Stream interface can be applied to any input argument and any array or pointer
output argument. Since an AXI4-Stream interface transfers data in a sequential streaming
manner it cannot be used with arguments which are both read and written. An AXI4-Stream
interface is always sign-extended to the next byte. For example, a 12-bit data value is
sign-extended to 16-bit.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
107
Chapter 1: High-Level Synthesis
AXI4-Stream interfaces are always implemented as registered interfaces to ensure no
combinational feedback paths are created when multiple HLS IP blocks with AXI-Stream
interfaces are integrated into a larger design. For AXI-Stream interfaces, four types of
register modes are provided to control how the AXI-Stream interface registers are
implemented.
•
Forward: Only the TDATA and TVALID signals are registered.
•
Reverse: Only the TREADY signal is registered.
•
Both: All signals (TDATA, TREADY and TVALID) are registered. This is the default.
•
Off: None of the port signals are registered.
The AXI-Stream side-channel signals, discussed later in AXI4-Stream Interfaces with
Side-Channels, are considered to be data signals and are registered whenever TDATA is
registered.
When connecting HLS generated IP blocks with AXI4-Stream interfaces at
least one interface should be implemented as a registered interface or the blocks should be
connected via an AXI4-Stream Register Slice.
RECOMMENDED:
There are two basic ways to use an AXI4-Stream in your design.
•
Use an AXI4-Stream without side-channels.
•
Use an AXI4-Stream with side-channels.
This second use model provides additional functionality, allowing the optional
side-channels which are part of the AXI4-Stream standard, to be used directly in the C code.
AXI4-Stream Interfaces without Side-Channels
An AXI4-Stream is used without side-channels when the function argument does not
contain any AXI4 side-channel elements. The following example shown a design where the
data type is a standard C int type. In this example, both interfaces are implemented using
an AXI4-Stream.
void example(int A[50], int B[50]) {
//Set the HLS native interface types
#pragma HLS INTERFACE axis port=A
#pragma HLS INTERFACE axis port=B
int i;
for(i = 0; i < 50; i++){
B[i] = A[i] + 5;
}
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
108
Chapter 1: High-Level Synthesis
After synthesis, both arguments are implemented with a data port and the standard
AXI4-Stream TVALID and TREADY protocol ports as shown in the following figure.
X-Ref Target - Figure 1-42
Figure 1-42:
AXI4-Stream Interfaces Without Side-Channels
Multiple variables can be combined into the same AXI4-Stream interface by using a struct
and the DATA_PACK directive. If an argument to the top-level function is a struct, Vivado
HLS by default partitions the struct into separate elements and implements each member of
the struct as a separate port. However, the DATA_PACK directive may be used to pack the
elements of a struct into a single wide-vector, allowing all elements of the struct to be
implemented in the same AXI4-Stream interface. Complete details on packing structs and
using the byte padding option to align the data fields in the wide-vector are provided in
Interface Synthesis and Structs.
AXI4-Stream Interfaces with Side-Channels
Side-channels are optional signals which are part of the AXI4-Stream standard. The
side-channel signals may be directly referenced and controlled in the C code using a struct,
provided the member elements of the struct match the names of the AXI4-Stream
side-channel signals. The AXI-Stream side-channel signals are considered data signals and
are registered whenever TDATA is registered. An example of this is provided with Vivado
HLS. The Vivado HLS include directory contains the file ap_axi_sdata.h. This header
file contains the following structs:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
109
Chapter 1: High-Level Synthesis
#include "ap_int.h"
template
struct ap_axis{
ap_int
data;
ap_uint keep;
ap_uint strb;
ap_uint
user;
ap_uint<1>
last;
ap_uint id;
ap_uint dest;
};
template
struct ap_axiu{
ap_uint
data;
ap_uint keep;
ap_uint strb;
ap_uint
user;
ap_uint<1>
last;
ap_uint id;
ap_uint dest;
};
Both structs contain as top-level members, variables whose names match those of the
optional AXI4-Stream side-channel signals. Provided the struct contains elements with
these names, there is no requirement to use the header file provided. You can create your
own user defined structs. Since the structs shown above use ap_int types and templates,
this header file is only for use in C++ designs.
Note: The valid and ready signals are mandatory signals in an AXI4-Stream and will always be
implemented by Vivado HLS. These cannot be controlled using a struct.
The following example shows how the side-channels can be used directly in the C code and
implemented on the interface. In this example a signed 32-bit data type is used.
#include "ap_axi_sdata.h"
void example(ap_axis<32,2,5,6> A[50], ap_axis<32,2,5,6> B[50]){
//Map ports to Vivado HLS interfaces
#pragma HLS INTERFACE axis port=A
#pragma HLS INTERFACE axis port=B
int i;
for(i = 0; i < 50; i++){
B[i].data = A[i].data.to_int() + 5;
B[i].keep = A[i].keep;
B[i].strb = A[i].strb;
B[i].user = A[i].user;
B[i].last = A[i].last;
B[i].id = A[i].id;
B[i].dest = A[i].dest;
}
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
110
Chapter 1: High-Level Synthesis
After synthesis, both arguments are implemented with data ports, the standard
AXI4-Stream TVALID and TREADY protocol ports and all of the optional ports described in
the struct.
X-Ref Target - Figure 1-43
Figure 1-43:
AXI4-Stream Interfaces With Side-Channels
Packing Structs into AXI4-Stream Interfaces
There is a difference in the default synthesis behavior when using structs with AXI4-Stream
interfaces. The default synthesis behavior for struct is described in Interface Synthesis and
Structs in Chapter 1.
When using AXI4-Stream interfaces without side-channels and the function argument is a
struct:
•
Vivado HLS automatically applies the DATA_PACK directive and all elements of the
struct are combined into a single wide-data vector. The interface is implemented as a
single wide-data vector with associated TVALID and TREADY signals.
•
If the DATA_PACK directive is manually applied to the struct, all elements of the struct
are combined into a single wide-data vector and the AXI alignment options to the
DATA_PACK directive may be applied. The interface is implemented as a single
wide-data vector with associated TVALID and TREADY signals.
When using AXI4-Stream interfaces with side-channels, the function argument is itself a
struct (AXI-Stream struct). It can contain data which is itself a struct (data struct) along with
the side-channels:
•
Vivado HLS automatically applies the DATA_PACK directive to the data struct and all
elements of the data struct are combined into a single wide-data vector. The interface is
implemented as a single wide-data vector with associated side-channels, TVALID and
TREADY signals.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
111
Chapter 1: High-Level Synthesis
•
If the DATA_PACK directive is manually applied to the data struct, all elements of the
data struct are combined into a single wide-data vector and the AXI alignment options
to the DATA_PACK directive may be applied. The interface is implement as a single
wide-data vector with associated side-channels, TVALID and TREADY signals.
•
If the DATA_PACK directive is applied to AXI-Stream struct, the function argument, the
data struct and the side-channel signals are combined into a single wide-vector. The
interface is implemented as a single wide-data vector with TVALID and TREADY
signals.
AXI4-Lite Interface
You can use an AXI4-Lite interface to allow the design to be controlled by a CPU or
microcontroller. Using the Vivado HLS AXI4-Lite interface, you can:
•
Group multiple ports into the same AXI4-Lite interface.
•
Output C driver files for use with the code running on a processor.
Note: This provides a set of C application program interface (API) functions, which allows you to
easily control the hardware from the software. This is useful when the design is exported to the
IP Catalog.
The following example shows how Vivado HLS implements multiple arguments, including
the function return, as an AXI4-Lite interface. Because each directive uses the same name
for the bundle option, each of the ports is grouped into the same AXI4-Lite interface.
void example(char *a,
{
#pragma HLS INTERFACE
#pragma HLS INTERFACE
#pragma HLS INTERFACE
#pragma HLS INTERFACE
#pragma HLS INTERFACE
char *b, char *c)
s_axilite port=return
s_axilite port=a
s_axilite port=b
s_axilite port=c
ap_vld port=b
bundle=BUS_A
bundle=BUS_A
bundle=BUS_A
bundle=BUS_A offset=0x0400
*c += *a + *b;
}
Note: If you do not use the bundle option, Vivado HLS groups all arguments specified with an
AXI4-Lite interface into the same default bundle and automatically names the port.
You can also assign an I/O protocol to ports grouped into an AXI4-Lite interface. In the
example above, Vivado HLS implements port b as an ap_vld interface and groups port b
into the AXI4-Lite interface. As a result, the AXI4-Lite interface contains a register for the
port b data, a register for the output to acknowledge that port b was read, and a register
for the port b input valid signal.
Each time port b is read, Vivado HLS automatically clears the input valid register and resets
the register to logic 0. If the input valid register is not set to logic 1, the data in the b data
register is not considered valid, and the design stalls and waits for the valid register to be
set.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
112
Chapter 1: High-Level Synthesis
RECOMMENDED: For ease of use during the operation of the design, Xilinx recommends that you do not
include additional I/O protocols in the ports grouped into an AXI4-Lite interface. However, Xilinx
recommends that you include the block-level I/O protocol associated with the return port in the
AXI4-Lite interface.
You cannot assign arrays to an AXI4-Lite interface using the bram interface. You can only
assign arrays to an AXI4-Lite interface using the default ap_memory interface. You also
cannot assign any argument specified with ap_stable I/O protocol to an AXI4-Lite
interface.
Since the variables grouped into an AXI-Lite interface are function arguments, which
themselves cannot be assigned a default value in the C code, none of the registers in an
AXI-Lite interface may be assigned a default value. The registers can be implemented with
a reset with the config_rtl command, but they cannot be assigned any other default
value.
By default, Vivado HLS automatically assigns the address for each port that is grouped into
an AXI4-Lite interface. Vivado HLS provides the assigned addresses in the C driver files. For
more information, see C Driver Files. To explicitly define the address, you can use the
offset option, as shown for argument c in the example above.
IMPORTANT: In an AXI4-Lite interface, Vivado HLS reserves addresses 0x0000 through 0x000C for the
block-level I/O protocol signals and interrupt controls.
After synthesis, Vivado HLS implements the ports in the AXI4-Lite port, as shown in the
following figure. Vivado HLS creates the interrupt port by including the function return in
the AXI4-Lite interface. You can program the interrupt through the AXI4-Lite interface. You
can also drive the interrupt from the following block-level protocols:
•
ap_done: Indicates when the function completes all operations.
•
ap_ready: Indicates when the function is ready for new input data.
You can program the interface using the C driver files.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
113
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-44
Figure 1-44:
AXI4-Lite Slave Interfaces with Grouped RTL Ports
Control Clock and Reset in AXI4-Lite Interfaces
By default, Vivado HLS uses the same clock for the AXI4-Lite interface and the synthesized
design. Vivado HLS connects all registers in the AXI4-Lite interface to the clock used for the
synthesized logic (ap_clk).
Optionally, you can use the INTERFACE directive clock option to specify a separate clock
for each AXI4-Lite port. When connecting the clock to the AXI4-Lite interface, you must use
the following protocols:
•
AXI4-Lite interface clock must be synchronous to the clock used for the synthesized
logic (ap_clk). That is, both clocks must be derived from the same master generator
clock.
•
AXI4-Lite interface clock frequency must be equal to or less than the frequency of the
clock used for the synthesized logic (ap_clk).
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
114
Chapter 1: High-Level Synthesis
If you use the clock option with the interface directive, you only need to specify the
clock option on one function argument in each bundle. Vivado HLS implements all other
function arguments in the bundle with the same clock and reset. Vivado HLS names the
generated reset signal with the prefix ap_rst_ followed by the clock name. The generated
reset signal is active Low independent of the config_rtl command. For more
information, see Controlling the Reset Behavior.
The following example shows how Vivado HLS groups function arguments a and b into an
AXI4-Lite port with a clock named AXI_clk1 and an associated reset port.
// Default AXI-Lite interface implemented with independent clock called AXI_clk1
#pragma HLS interface s_axilite port=a clock=AXI_clk1
#pragma HLS interface s_axilite port=b
In the following example, Vivado HLS groups function arguments c and d into AXI4-Lite
port CTRL1 with a separate clock called AXI_clk2 and an associated reset port.
// CTRL1 AXI-Lite bundle implemented with a separate clock (called AXI_clk2)
#pragma HLS interface s_axilite port=c bundle=CTRL1 clock=AXI_clk2
#pragma HLS interface s_axilite port=d bundle=CTRL1
C Driver Files
When an AXI4-Lite slave interface is implemented, a set of C driver files are automatically
created. These C driver files provide a set of APIs that can be integrated into any software
running on a CPU and used to communicate with the device via the AXI4-Lite slave
interface.
The C driver files are created when the design is packaged as IP in the IP Catalog. For more
details on packing IP, see Exporting the RTL Design.
Driver files are created for standalone and Linux modes. In standalone mode the drivers are
used in the same way as any other Xilinx standalone drivers. In Linux mode, copy all the C
files (.c) and header files (.h) files into the software project.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
115
Chapter 1: High-Level Synthesis
The driver files and API functions derive their name from the top-level function for
synthesis. In the above example, the top-level function is called “example”. If the top-level
function was named “DUT” the name “example” would be replaced by “DUT” in the
following description. The driver files are created in the packaged IP (located in the impl
directory inside the solution).
Table 1-9:
C Driver Files for a Design Named example
File Path
Usage Mode
Description
data/example.mdd
Standalone
Driver definition file.
data/example.tcl
Standalone
Used by SDK to integrate the software into an SDK
project.
src/xexample_hw.h
Both
Defines address offsets for all internal registers.
src/xexample.h
Both
API definitions
src/xexample.c
Both
Standard API implementations
src/xexample_sinit.c
Standalone
Initialization API implementations
src/xexample_linux.c
Linux
Initialization API implementations
src/Makefile
Standalone
Makefile
In file xexample.h, two structs are defined.
•
XExample_Config: This is used to hold the configuration information (base address of
each AXI4-Lite slave interface) of the IP instance.
•
XExample: This is used to hold the IP instance pointer. Most APIs take this instance
pointer as the first argument.
The standard API implementations are provided in files xexample.c,
xexample_sinit.c,xexample_linux.c, and provide functions to perform the
following operations.
•
Initialize the device
•
Control the device and query its status
•
Read/write to the registers
•
Set up, monitor, and control the interrupts
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
116
Chapter 1: High-Level Synthesis
The following table lists each of the API function provided in the C driver files.
Table 1-10:
C Driver API Functions
API Function
Description
XExample_Initialize
This API will write value to InstancePtr which then can be used
in other APIs. It is recommended to call this API to initialize a
device except when an MMU is used in the system.
XExample_CfgInitialize
Initialize a device configuration. When a MMU is used in the
system, replace the base address in the XDut_Config variable
with virtual base address before calling this function. Not for
use on Linux systems.
XExample_LookupConfig
Used to obtain the configuration information of the device by
ID. The configuration information contain the physical base
address. Not for user on Linux.
XExample_Release
Release the uio device in linux. Delete the mappings by
munmap: the mapping will automatically be deleted if the
process terminated. Only for use on Linux systems.
XExample_Start
Start the device. This function will assert the ap_start port on
the device. Available only if there is ap_start port on the
device.
XExample_IsDone
Check if the device has finished the previous execution: this
function will return the value of the ap_done port on the device.
Available only if there is an ap_done port on the device.
XExample_IsIdle
Check if the device is in idle state: this function will return the
value of the ap_idle port. Available only if there is an ap_idle
port on the device.
XExample_IsReady
Check if the device is ready for the next input: this function will
return the value of the ap_ready port. Available only if there is
an ap_ready port on the device.
XExample_Continue
Assert port ap_continue. Available only if there is an
ap_continue port on the device.
XExample_EnableAutoRestart
Enables “auto restart” on device. When this is set the device will
automatically start the next transaction when the current
transaction completes.
XExample_DisableAutoRestart
Disable the “auto restart” function.
XExample_Set_ARG
Write a value to port ARG (a scalar argument of the top
function). Available only if ARG is input port.
XExample_Set_ARG_vld
Assert port ARG_vld. Available only if ARG is an input port and
implemented with an ap_hs or ap_vld interface protocol.
XExample_Set_ARG_ack
Assert port ARG_ack. Available only if ARG is an output port and
implemented with an ap_hs or ap_ack interface protocol.
XExample_Get_ARG
Read a value from ARG. Only available if port ARG is an output
port on the device.
XExample_Get_ARG_vld
Read a value from ARG_vld. Only available if port ARG is an
output port on the device and implemented with an ap_hs or
ap_vld interface protocol.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
117
Chapter 1: High-Level Synthesis
Table 1-10:
C Driver API Functions (Cont’d)
API Function
Description
XExample_Get_ARG_ack
Read a value from ARG_ack. Only available if port ARG is an
input port on the device and implemented with an ap_hs or
ap_ack interface protocol.
XExample_Get_ARG_BaseAddress
Return the base address of the array inside the interface. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XExample_Get_ARG_HighAddress
Return the address of the uppermost element of the array. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XExample_Get_ARG_TotalBytes
Return the total number of bytes used to store the array. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XExample_Get_ARG_BitWidth
Return the bit width of each element in the array. Only available
when ARG is an array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XExample_Get_ARG_Depth
Return the total number of elements in the array. Only available
when ARG is an array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XExample_Write_ARG_Words
Write the length of a 32-bit word into the specified address of
the AXI4-Lite interface. This API requires the offset address
from BaseAddress and the length of the data to be stored. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XExample_Read_ARG_Words
Read the length of a 32-bit word from the array. This API
requires the data target, the offset address from BaseAddress,
and the length of the data to be stored. Only available when
ARG is an array grouped into the AXI4-Lite interface.
XExample_Write_ARG_Bytes
Write the length of bytes into the specified address of the
AXI4-Lite interface. This API requires the offset address from
BaseAddress and the length of the data to be stored. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XExample_Read_ARG_Bytes
Read the length of bytes from the array. This API requires the
data target, the offset address from BaseAddress, and the
length of data to be loaded. Only available when ARG is an array
grouped into the AXI4-Lite interface.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
118
Chapter 1: High-Level Synthesis
Table 1-10:
C Driver API Functions (Cont’d)
API Function
Description
XExample_InterruptGlobalEnable
Enable the interrupt output. Interrupt functions are available
only if there is ap_start.
XExample_InterruptGlobalDisable
Disable the interrupt output.
XExample_InterruptEnable
Enable the interrupt source. There may be at most 2 interrupt
sources (source 0 for ap_done and source 1 for ap_ready)
XExample_InterruptDisable
Disable the interrupt source.
XExample_InterruptClear
Clear the interrupt status.
XExample_InterruptGetEnabled
Check which interrupt sources are enabled.
XExample_InterruptGetStatus
Check which interrupt sources are triggered.
IMPORTANT: The C driver APIs always use an unsigned 32-bit type (U32). You might be required to cast
the data in the C code into the expected type.
C Driver Files and Float Types
C driver files always use a data 32-bit unsigned integer (U32) for data transfers. In the
following example, the function uses float type arguments a and r1. It sets the value of a
and returns the value of r1:
float caculate(float a, float *r1)
{
#pragma HLS INTERFACE ap_vld register port=r1
#pragma HLS INTERFACE s_axilite port=a
#pragma HLS INTERFACE s_axilite port=r1
#pragma HLS INTERFACE s_axilite port=return
*r1 = 0.5f*a;
return (a>0);
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
119
Chapter 1: High-Level Synthesis
After synthesis, Vivado HLS groups all ports into the default AXI4-Lite interface and creates
C driver files. However, as shown in the following example, the driver files use type U32:
// API to set the value of A
void XCaculate_SetA(XCaculate *InstancePtr, u32 Data) {
Xil_AssertVoid(InstancePtr != NULL);
Xil_AssertVoid(InstancePtr->IsReady == XIL_COMPONENT_IS_READY);
XCaculate_WriteReg(InstancePtr->Hls_periph_bus_BaseAddress,
XCACULATE_HLS_PERIPH_BUS_ADDR_A_DATA, Data);
}
// API to get the value of R1
u32 XCaculate_GetR1(XCaculate *InstancePtr) {
u32 Data;
Xil_AssertNonvoid(InstancePtr != NULL);
Xil_AssertNonvoid(InstancePtr->IsReady == XIL_COMPONENT_IS_READY);
Data = XCaculate_ReadReg(InstancePtr->Hls_periph_bus_BaseAddress,
XCACULATE_HLS_PERIPH_BUS_ADDR_R1_DATA);
return Data;
}
If these functions work directly with float types, the write and read values are not consistent
with expected float type. When using these functions in software, you can use the following
casts in the code:
float a=3.0f,r1;
u32 ua,ur1;
// cast float “a” to type U32
XCaculate_SetA(&calculate,*((u32*)&a));
ur1=XCaculate_GetR1(&caculate);
// cast return type U32 to float type for “r1”
r1=*((float*)&ur1);
For a complete description of the API functions, see AXI4-Lite Slave C Driver Reference in
Chapter 4.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
120
Chapter 1: High-Level Synthesis
Controlling Hardware
The hardware header file xexample_hw.h (in this example) provides a complete list of the
memory mapped locations for the ports grouped into the AXI4-Lite slave interface.
// 0x00 : Control signals
//
bit 0 - ap_start (Read/Write/SC)
//
bit 1 - ap_done (Read/COR)
//
bit 2 - ap_idle (Read)
//
bit 3 - ap_ready (Read)
//
bit 7 - auto_restart (Read/Write)
//
others - reserved
// 0x04 : Global Interrupt Enable Register
//
bit 0 - Global Interrupt Enable (Read/Write)
//
others - reserved
// 0x08 : IP Interrupt Enable Register (Read/Write)
//
bit 0 - Channel 0 (ap_done)
//
bit 1 - Channel 1 (ap_ready)
// 0x0c : IP Interrupt Status Register (Read/TOW)
//
bit 0 - Channel 0 (ap_done)
//
others - reserved
// 0x10 : Data signal of a
//
bit 7~0 - a[7:0] (Read/Write)
//
others - reserved
// 0x14 : reserved
// 0x18 : Data signal of b
//
bit 7~0 - b[7:0] (Read/Write)
//
others - reserved
// 0x1c : reserved
// 0x20 : Data signal of c_i
//
bit 7~0 - c_i[7:0] (Read/Write)
//
others - reserved
// 0x24 : reserved
// 0x28 : Data signal of c_o
//
bit 7~0 - c_o[7:0] (Read)
//
others - reserved
// 0x2c : Control signal of c_o
//
bit 0 - c_o_ap_vld (Read/COR)
//
others - reserved
// (SC = Self Clear, COR = Clear on Read, TOW = Toggle on Write, COH = Clear on
Handshake)
To correctly program the registers in the AXI4-Lite slave interface, there is some
requirement to understand how the hardware ports operate. The block will operate with the
same port protocols described in Interface Synthesis.
For example, to start the block operation the ap_start register must be set to 1. The
device will then proceed and read any inputs grouped into the AXI4-Lite slave interface
from the register in the interface. When the block completes operation, the ap_done,
ap_idle and ap_ready registers will be set by the hardware output ports and the results
for any output ports grouped into the AXI4-Lite slave interface read from the appropriate
register. This is the same operation described in Figure 1-38.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
121
Chapter 1: High-Level Synthesis
The implementation of function argument c in the example above also highlights the
importance of some understanding how the hardware ports are operate. Function
argument c is both read and written to, and is therefore implemented as separate input and
output ports c_i and c_o, as explained in Interface Synthesis.
The first recommended flow for programing the AXI4-Lite slave interface is for a one-time
execution of the function:
•
Use the interrupt function to determine how you wish the interrupt to operate.
•
Load the register values for the block input ports. In the above example this is
performed using API functions XExample_Set_a, XExample_Set_b, and
XExample_Set_c_i.
•
Set the ap_start bit to 1 using XExample_Start to start executing the function.
This register is self-clearing as noted in the header file above. After one transaction, the
block will suspend operation.
•
Allow the function to execute. Address any interrupts which are generated.
•
Read the output registers. In the above example this is performed using API functions
XExample_Get_c_o_vld, to confirm the data is valid, and XExample_Get_c_o.
Note: The registers in the AXI4-Lite slave interface obey the same I/O protocol as the ports. In
this case, the output valid is set to logic 1 to indicate if the data is valid.
•
Repeat for the next transaction.
The second recommended flow is for continuous execution of the block. In this mode, the
input ports included in the AXI4-Lite slave interface should only be ports which perform
configuration. The block will typically run must faster than a CPU. If the block must wait for
inputs, the block will spend most of its time waiting:
•
Use the interrupt function to determine how you wish the interrupt to operate.
•
Load the register values for the block input ports. In the above example this is
performed using API functions XExample_Set_a, XExample_Set_a and
XExample_Set_c_i.
•
Set the auto-start function using API XExample_EnableAutoRestart
•
Allow the function to execute. The individual port I/O protocols will synchronize the
data being processed through the block.
•
Address any interrupts which are generated. The output registers could be accessed
during this operation but the data may change often.
•
Use the API function XExample_DisableAutoRestart to prevent any more
executions.
•
Read the output registers. In the above example this is performed using API functions
XExample_Get_c_o and XExample_Set_c_o_vld.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
122
Chapter 1: High-Level Synthesis
Controlling Software
The API functions can be used in the software running on the CPU to control the hardware
block. An overview of the process is:
•
Create an instance of the HW instance
•
Look Up the device configuration
•
Initialize the Device
•
Set the input parameters of the HLS block
•
Start the device and read the results
An abstracted versions of this process is shown below. Complete examples of the software
control are provided in the Zynq-7000 AP SoC tutorials noted in Table 1-4.
#include "xexample.h"
// Device driver for HLS HW block
#include "xparameters.h"
// HLS HW instance
XExample HlsExample;
XExample_Config *ExamplePtr
int main() {
int res_hw;
// Look Up the device configuration
ExamplePtr = XExample_LookupConfig(XPAR_XEXAMPLE_0_DEVICE_ID);
if (!ExamplePtr) {
print("ERROR: Lookup of accelerator configuration failed.\n\r");
return XST_FAILURE;
}
// Initialize the Device
status = XExample_CfgInitialize(&HlsExample, ExamplePtr);
if (status != XST_SUCCESS) {
print("ERROR: Could not initialize accelerator.\n\r");
exit(-1);
}
//Set the input parameters of the HLS block
XExample_Set_a(&HlsExample, 42);
XExample_Set_b(&HlsExample, 12);
XExample_Set_c_i(&HlsExample, 1);
// Start the device and read the results
XExample_Start(&HlsExample);
do {
res_hw = XExample_Get_c_o(&HlsExample);
} while (XExample_Get_c_o(&HlsExample) == 0); // wait for valid data output
print("Detected HLS peripheral complete. Result received.\n\r");
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
123
Chapter 1: High-Level Synthesis
Customizing AXI4-Lite Slave Interfaces in IP Integrator
When an HLS RTL design using an AXI4-Lite slave interface is incorporated into a design in
Vivado IP Integrator, you can customize the block. From the block diagram in IP Integrator,
select the HLS block, right-click with the mouse button and select Customize Block.
The address width is by default configured to the minimum required size. Modify this to
connect to blocks with address sizes less than 32-bit.
X-Ref Target - Figure 1-45
Figure 1-45:
Customizing AXI4-Lite Slave Interfaces in IP Integrator
AXI4 Master Interface
You can use an AXI4 master interface on array or pointer/reference arguments, which
Vivado HLS implements in one of the following modes:
•
Individual data transfers
•
Burst mode data transfers
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
124
Chapter 1: High-Level Synthesis
With individual data transfers, Vivado HLS reads or writes a single element of data for each
address. The following example shows a single read and single write operation. In this
example, Vivado HLS generates an address on the AXI interface to read a single data value
and an address to write a single data value. The interface transfers one data value per
address.
void bus (int *d) {
static int acc = 0;
acc += *d;
*d = acc;
}
With burst mode transfers, Vivado HLS reads or writes data using a single base address
followed by multiple sequential data samples, which makes this mode capable of higher
data throughput. Burst mode of operation is possible when you use the C memcpy function
or a pipelined for loop.
Note: The C memcpy function is only supported for synthesis when used to transfer data to or from
a top-level function argument specified with an AXI4 master interface.
The following example shows a copy of burst mode using the memcpy function. The
top-level function argument a is specified as an AXI4 master interface.
void example(volatile int *a){
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE s_axilite port=return
//Port a is assigned to an AXI4 master interface
int i;
int buff[50];
//memcpy creates a burst access to memory
memcpy(buff,(const int*)a,50*sizeof(int));
for(i=0; i < 50; i++){
buff[i] = buff[i] + 100;
}
memcpy((int *)a,buff,50*sizeof(int));
}
When this example is synthesized, it results in the interface shown in the following figure.
Note: In this figure, the AXI4 interfaces are collapsed.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
125
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-46
Figure 1-46:
AXI4 Interface
The following example shows the same code as the preceding example but uses a for loop
to copy the data out:
void example(volatile int *a){
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE s_axilite port=return
//Port a is assigned to an AXI4 master interface
int i;
int buff[50];
//memcpy creates a burst access to memory
memcpy(buff,(const int*)a,50*sizeof(int));
for(i=0; i < 50; i++){
buff[i] = buff[i] + 100;
}
for(i=0; i < 50; i++){
#pragma HLS PIPELINE
a[i] = buff[i];
}
}
When using a for loop to implement burst reads or writes, follow these requirements:
•
Pipeline the loop
•
Access addresses in increasing order
•
Do not place accesses inside a conditional statement
•
For nested loops, do not flatten loops, because this inhibits the burst operation
Note: Only one read and one write is allowed in a for loop unless the ports are bundled in different
AXI ports. The following example shows how to perform two reads in burst mode using different AXI
interfaces.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
126
Chapter 1: High-Level Synthesis
In the following example, Vivado HLS implements the port reads as burst transfers. Port a
is specified without using the bundle option and is implemented in the default AXI
interface. Port b is specified using a named bundle and is implemented in a separate AXI
interface called d2_port.
void example(volatile int *a, int *b){
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE m_axi depth=50 port=b bundle=d2_port
int i;
int buff[50];
//copy data in
for(i=0; i < 50; i++){
#pragma HLS PIPELINE
buff[i] = a[i] + b[i];
}
...
}
Controlling AXI4 Burst Behavior
An optimal AXI4 interface is one in which the design never stalls while waiting to access the
bus, and after bus access is granted, the bus never stalls while waiting for the design to
read/write. To create the optimal AXI4 interface, the following options are provided in the
INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the
AXI4 interface.
Some of these options use internal storage to buffer data and may have an impact on area
and resources:
•
latency: Specifies the expected latency of the AXI4 interface, allowing the design to
initiate a bus request a number of cycles (latency) before the read or write is expected.
If this figure it too low, the design will be ready too soon and may stall waiting for the
bus. If this figure is too high, bus access may be granted but the bus may stall waiting
on the design to start the access.
•
max_read_burst_length: Specifies the maximum number of data values read
during a burst transfer.
•
num_read_outstanding: Specifies how many read requests can be made to the AXI4
bus, without a response, before the design stalls. This implies internal storage in the
design, a FIFO of size:
num_read_outstanding*max_read_burst_length*word_size.
•
max_write_burst_length: Specifies the maximum number of data values written
during a burst transfer.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
127
Chapter 1: High-Level Synthesis
•
num_write_outstanding: Specifies how many write requests can be made to the
AXI4 bus, without a response, before the design stalls. This implies internal storage in
the design, a FIFO of size:
num_read_outstanding*max_read_burst_length*word_size
The following example can be used to help explain these options:
#pragma HLS interface m_axi port=input offset=slave bundle=gmem0
depth=1024*1024*16/(512/8)
latency=100
num_read_outstanding=32
num_write_outstanding=32
max_read_burst_length=16
max_write_burst_length=16
The interface is specified as having a latency of 100. Vivado HLS seeks to schedule the
request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
To further improve bus efficiency, the options num_write_outstanding and
num_read_outstanding ensure the design contains enough buffering to store up to 32
read and write accesses. This allows the design to continue processing until the bus
requests are serviced. Finally, the options max_read_burst_length and
max_write_burst_length ensure the maximum burst size is 16 and that the AXI4
interface does not hold the bus for longer than this.
These options allow the behavior of the AXI4 interface to be optimized for the system in
which it will operate. The efficiency of the operation does depend on these values being set
accuracy.
Creating an AXI4 Interface with 64-bit Address Capability
By default, Vivado HLS implements the AXI4 port with a 32-bit address bus. Optionally, you
can implement the AXI4 interface with a 64-bit address bus using the m_axi_addr64
interface configuration option as follows:
1. Select Solution > Solution Settings.
2. In the Solution Settings dialog box, click the General category, and click Add.
3. In the Add Command dialog box, select config_interface, and enable m_axi_addr64.
IMPORTANT: When you select the m_axi_addr64 option, Vivado HLS implements all AXI4 interfaces in
the design with a 64-bit address bus.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
128
Chapter 1: High-Level Synthesis
Controlling the Address Offset in an AXI4 Interface
By default, the AXI4 master interface starts all read and write operations from address
0x00000000. For example, given the following code, the design reads data from addresses
0x00000000 to 0x000000c7 (50 32-bit words, gives 200 bytes), which represents 50 address
values. The design then writes data back to the same addresses.
void example(volatile int *a){
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS
int i;
int buff[50];
memcpy(buff,(const int*)a,50*sizeof(int));
for(i=0; i < 50; i++){
buff[i] = buff[i] + 100;
}
memcpy((int *)a,buff,50*sizeof(int));
}
To apply an address offset, use the -offset option with the INTERFACE directive, and
specify one of the following options:
•
off: Does not apply an offset address. This is the default.
•
direct: Adds a 32-bit port to the design for applying an address offset.
•
slave: Adds a 32-bit register inside the AXI4-Lite interface for applying an address
offset.
In the final RTL, Vivado HLS applies the address offset directly to any read or write address
generated by the AXI4 master interface. This allows the design to access any address
location in the system.
If you use the slave option in an AXI interface, you must use an AXI4-Lite port on the
design interface. Xilinx recommends that you implement the AXI4-Lite interface using the
following pragma:
#pragma HLS INTERFACE s_axilite port=return
In addition, if you use the slave option and you used several AXI4-Lite interfaces, you
must ensure that the AXI master port offset register is bundled into the correct AXI4-Lite
interface. In the following example, port a is implemented as an AXI master interface with
an offset and AXI4-Lite interfaces called AXI_Lite_1 and AXI_Lite_2:
#pragma HLS INTERFACE m_axi port=a depth=50 offset=slave
#pragma HLS INTERFACE s_axilite port=return bundle=AXI_Lite_1
#pragma HLS INTERFACE s_axilite port=b bundle=AXI_Lite_2
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
129
Chapter 1: High-Level Synthesis
The following INTERFACE directive is required to ensure that the offset register for port a is
bundled into the AXI4-Lite interface called AXI_Lite_1:
#pragma HLS INTERFACE s_axilite port=a bundle=AXI_Lite_1
Customizing AXI4 Master Interfaces in IP Integrator
When you incorporate an HLS RTL design that uses an AXI4 master interface into a design
in the Vivado IP Integrator, you can customize the block. From the block diagram in IP
Integrator, select the HLS block, right-click, and select Customize Block to customize any
of the settings provided. A complete description of the AXI4 parameters is provided in this
link in the AXI Reference Guide (UG1037)[Ref 8].
The following figure shows the Re-Customize IP dialog box for the design shown in
Figure 1-46. This design includes an AXI4-Lite port.
X-Ref Target - Figure 1-47
Figure 1-47:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
Customizing AXI4 Master Interfaces in IP Integrator
www.xilinx.com
Send Feedback
130
Chapter 1: High-Level Synthesis
Managing Interfaces with SSI Technology Devices
Certain Xilinx devices use stacked silicon interconnect (SSI) technology. In these devices, the
total available resources are divided over multiple super logic regions (SLRs). The
connections between SLRs use super long line (SSL) routes. SSL routes incur delays costs
that are typically greater than standard FPGA routing. To ensure designs operate at
maximum performance, use the following guidelines:
•
Register all signals that cross between SLRs at both the SLR output and SLR input.
•
You do not need to register a signal if it enters or exits an SLR via an I/O buffer.
•
Ensure that the logic created by Vivado HLS fits within a single SLR.
Note: When you select an SSI technology device as the target technology, the utilization report
includes details on both the SLR usage and the total device usage.
If the logic is contained within a single SLR device, Vivado HLS provides a register_io
option to the config_interface command. This option provides a way to automatically
register all block inputs, outputs, or both. This option is only required for scalars. All array
ports are automatically registered.
The settings for the register_io option are:
•
off: None of the input or outputs are registered.
•
scalar_in: All inputs are registered.
•
scalar_out: All outputs are registered.
•
scalar_all: All input and outputs are registered.
Note: Using the register_io option with block-level floorplanning of the RTL ensures that logic
targeted to an SSI technology device executes at the maximum clock rate.
Optimizing the Design
This section outlines the various optimizations and techniques you can use to direct Vivado
HLS to produce a micro-architecture that satisfies the desired performance and area goals.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
131
Chapter 1: High-Level Synthesis
The following table lists the optimization directives provided by Vivado HLS.
Table 1-11:
Vivado HLS Optimization Directives
Directive
Description
ALLOCATION
Specify a limit for the number of operations, cores or
functions used. This can force the sharing or hardware
resources and may increase latency
ARRAY_MAP
Combines multiple smaller arrays into a single large array
to help reduce block RAM resources.
ARRAY_PARTITION
Partitions large arrays into multiple smaller arrays or into
individual registers, to improve access to data and remove
block RAM bottlenecks.
ARRAY_RESHAPE
Reshape an array from one with many elements to one with
greater word-width. Useful for improving block RAM
accesses without using more block RAM.
CLOCK
For SystemC designs multiple named clocks can be
specified using the create_clock command and applied
to individual SC_MODULEs using this directive.
DATA_PACK
Packs the data fields of a struct into a single scalar with a
wider word width.
DATAFLOW
Enables task level pipelining, allowing functions and loops
to execute concurrently. Used to minimize interval.
DEPENDENCE
Used to provide additional information that can overcome
loop-carry dependencies and allow loops to be pipelined
(or pipelined with lower intervals).
EXPRESSION_BALANCE
Allows automatic expression balancing to be turned off.
FUNCTION_INSTANTIATE
Allows different instances of the same function to be locally
optimized.
INLINE
Inlines a function, removing all function hierarchy. Used to
enable logic optimization across function boundaries and
improve latency/interval by reducing function call
overhead.
INTERFACE
Specifies how RTL ports are created from the function
description.
LATENCY
Allows a minimum and maximum latency constraint to be
specified.
LOOP_FLATTEN
Allows nested loops to be collapsed into a single loop with
improved latency.
LOOP_MERGE
Merge consecutive loops to reduce overall latency, increase
sharing and improve logic optimization.
LOOP_TRIPCOUNT
Used for loops which have variables bounds. Provides an
estimate for the loop iteration count. This has no impact on
synthesis, only on reporting.
OCCURRENCE
Used when pipelining functions or loops, to specify that the
code in a location is executed at a lesser rate than the code
in the enclosing function or loop.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
132
Chapter 1: High-Level Synthesis
Table 1-11:
Vivado HLS Optimization Directives (Cont’d)
Directive
Description
PIPELINE
Reduces the initiation interval by allowing the concurrent
execution of operations within a loop or function.
PROTOCOL
This commands specifies a region of the code to be a
protocol region. A protocol region can be used to manually
specify an interface protocol.
RESET
This directive is used to add or remove reset on a specific
state variable (global or static).
RESOURCE
Specify that a specific library resource (core) is used to
implement a variable (array, arithmetic operation or
function argument) in the RTL.
STREAM
Specifies that a specific array is to be implemented as a
FIFO or RAM memory channel during dataflow
optimization.
TOP
The top-level function for synthesis is specified in the
project settings. This directive may be used to specify any
function as the top-level for synthesis. This then allows
different solutions within the same project to be specified
as the top-level function for synthesis without needing to
create a new project.
UNROLL
Unroll for-loops to create multiple independent operations
rather than a single collection of operations.
In addition to the optimization directives, Vivado HLS provides a number of configuration
settings. Configurations settings are used to change the default behavior of synthesis. The
configuration settings are shown in the following table.
Table 1-12:
Vivado HLS Configurations
GUI Directive
Description
Config Array Partition
This configuration determines how arrays are partitioned,
including global arrays and if the partitioning impacts array
ports.
Config Bind
Determines the effort level to use during the synthesis
binding phase and can be used to globally minimize the
number of operations used.
Config Compile
Controls synthesis specific optimizations such as the
automatic loop pipelining and floating point math
optimizations.
Config Dataflow
This configuration specifies the default memory channel
and FIFO depth in dataflow optimization.
Config Interface
This configuration controls I/O ports not associated with
the top-level function arguments and allows unused ports
to be eliminated from the final RTL.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
133
Chapter 1: High-Level Synthesis
Table 1-12:
Vivado HLS Configurations (Cont’d)
GUI Directive
Description
Config RTL
Provides control over the output RTL including file and
module naming, reset style and FSM encoding.
Config Schedule
Determines the effort level to use during the synthesis
scheduling phase and the verbosity of the output
messages
Details on how to apply the optimizations and configurations is provided in Applying
Optimization Directives. The configurations are accessed using the menu Solution >
Solution Settings > General and selecting the configuration using the Add button.
The optimizations are presented in the context of how they are typically applied on a
design.
The Clock, Reset and RTL output are discussed together. The clock frequency along with the
target device is the primary constraint which drives optimization. Vivado HLS seeks to place
as many operations from the target device into each clock cycle. The reset style used in the
final RTL is controlled, along setting such as the FSM encoding style, using the config_rtl
configuration.
The primary optimizations for Optimizing for Throughput are presented together in the
manner in which they are typically used: pipeline the tasks to improve performance,
improve the data flow between tasks and optimize structures to improve address issues
which may limit performance.
Optimizing for Latency uses the techniques of latency constraints and the removal of loop
transitions to reduce the number of clock cycles required to complete.
A focus on how operations are implemented - controlling the number of operations and
how those operations are implemented in hardware - is the principal technique for
improving the area.
Clock, Reset, and RTL Output
Specifying the Clock Frequency
For C and C++ designs only a single clock is supported. The same clock is applied to all
functions in the design.
For SystemC designs, each SC_MODULE may be specified with a different clock. To specify
multiple clocks in a SystemC design, use the -name option of the create_clock
command to create multiple named clocks and use the CLOCK directive or pragma to
specify which function contains the SC_MODULE to be synthesized with the specified clock.
Each SC_MODULE can only be synthesized using a single clock: clocks may be distributed
through functions, such as when multiple clocks are connected from the top-level ports to
individual blocks, but each SC_MODULE can only be sensitive to a single clock.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
134
Chapter 1: High-Level Synthesis
The clock period, in ns, is set in the Solutions > Solutions Setting. Vivado HLS uses the
concept of a clock uncertainty to provide a user defined timing margin. Using the clock
frequency and device target information Vivado HLS estimates the timing of operations in
the design but it cannot know the final component placement and net routing: these
operations are performed by logic synthesis of the output RTL. As such, Vivado HLS cannot
know the exact delays.
To calculate the clock period used for synthesis, Vivado HLS subtracts the clock uncertainty
from the clock period, as shown in the following figure.
X-Ref Target - Figure 1-48
&ORFN3HULRG
&ORFN8QFHUWDLQW\
(IIHFWLYH&ORFN3HULRG
XVHGE\9LYDGR+/6
0DUJLQIRU/RJLF
6\QWKHVLVDQG3 5
;
Figure 1-48:
Clock Period and Margin
This provides a user specified margin to ensure downstream processes, such as logic
synthesis and place & route, have enough timing margin to complete their operations. If
the FPGA device is mostly utilized the placement of cells and routing of nets to connect the
cells might not be ideal and might result in a design with larger than expected timing
delays. For a situation such as this, an increased timing margin ensures Vivado HLS does not
create a design with too much logic packed into each clock cycle and allows RTL synthesis
to satisfy timing in cases with less than ideal placement and routing options.
By default, the clock uncertainty is 12.5% of the cycle time. The value can be explicitly
specified beside the clock period.
Vivado HLS aims to satisfy all constraints: timing, throughput, latency. However, if a
constraints cannot be satisfied, Vivado HLS always outputs an RTL design.
If the timing constraints inferred by the clock period cannot be met Vivado HLS issues
message SCHED-644, as shown below, and creates a design with the best achievable
performance.
@W [SCHED-644] Max operation delay ( 2.39ns) exceeds the effective
cycle time
Even if Vivado HLS cannot satisfy the timing requirements for a particular path, it still
achieves timing on all other paths. This behavior allows you to evaluate if higher
optimization levels or special handling of those failing paths by downstream logic
syntheses can pull-in and ultimately satisfy the timing.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
135
Chapter 1: High-Level Synthesis
IMPORTANT: It is important to review the constraint report after synthesis to determine if all
constraints is met: the fact that Vivado HLS produces an output design does not guarantee the design
meets all performance constraints. Review the “Performance Estimates” section of the design report.
The option relax_ii_for_timing of the config_schedule command can be used to
change the default timing behavior. When this option is specified, Vivado HLS automatically
relaxes the II for any pipeline directive when it detects a path is failing to meet the clock
period. This option only applies to cases where the PIPELINE directive is specified without
an II value (and an II=1 is implied). If the II value is explicitly specified in the PIPELINE
directive, the relax_ii_for_timing option has no effect.
A design report is generated for each function in the hierarchy when synthesis completes
and can be viewed in the solution reports folder. The worse case timing for the entire design
is reported as the worst case in each function report. There is no need to review every
report in the hierarchy.
If the timing violations are too severe to be further optimized and corrected by downstream
processes, review the techniques for specifying an exact latency and specifying exact
implementation cores before considering a faster target technology.
Specifying the Reset
Typically the most important aspect of RTL configuration is selecting the reset behavior.
When discussing reset behavior it is important to understand the difference between
initialization and reset.
Initialization Behavior
In C, variables defined with the static qualifier and those defined in the global scope, are by
default initialized to zero. Optionally, these variables may be assigned a specific initial
value. For these type of variables, the initial value in the C code is assigned at compile time
(at time zero) and never again. In both cases, the same initial value is implemented in the
RTL.
•
During RTL simulation the variables are initialized with the same values as the C code.
•
The same variables are initialized in the bitstream used to program the FPGA. When the
device powers up, the variables will start in their initialized state.
The variables start with the same initial state as the C code. However, there is no way to
force a return to this initial state. To return to their initial state the variables must be
implemented with a reset.
IMPORTANT: Top-level function arguments may be implemented in an AXI4-Lite interface. Since there
is no way to provide an initial value in C/C++ for function arguments, these variable cannot be
initialized in the RTL as doing so would create an RTL design with different functional behavior from the
C/C++ code which would fail to verify during C/RTL co-simulation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
136
Chapter 1: High-Level Synthesis
Controlling the Reset Behavior
The reset port is used in an FPGA to return the registers and block RAM connected to the
reset port to an initial value any time the reset signal is applied. The presence and behavior
of the RTL reset port is controlled using the config_rtl configuration shown in the
following figure. To access this configuration, select Solution > Solution Settings >
General > Add > config_rtl.
X-Ref Target - Figure 1-49
Figure 1-49:
RTL Configurations
The reset settings include the ability to set the polarity of the reset and whether the reset is
synchronous or asynchronous but more importantly it controls, through the reset option,
which registers are reset when the reset signal is applied.
IMPORTANT: When AXI4 interfaces are used on a design the reset polarity is automatically changed to
active-Low irrespective of the setting in the config_rtl configuration. This is required by the AXI4
standard.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
137
Chapter 1: High-Level Synthesis
The reset option has four settings:
•
none: No reset is added to the design.
•
control: This is the default and ensures all control registers are reset. Control registers
are those used in state machines and to generate I/O protocol signals. This setting
ensures the design can immediately start its operation state.
•
state: This option adds a reset to control registers (as in the control setting) plus any
registers or memories derived from static and global variables in the C code. This
setting ensures static and global variable initialized in the C code are reset to their
initialized value after the reset is applied.
•
all: This adds a reset to all registers and memories in the design.
Finer grain control over reset is provided through the RESET directive. If a variable is a static
or global, the RESET directive is used to explicitly add a reset, or the variable can be
removed from those being reset by using the RESET directive’s off option. This can be
particularly useful when static or global arrays are present in the design.
IMPORTANT: Is is important when using the reset state or all option to consider the effect on
arrays.
Initializing and Resetting Arrays
Arrays are often defined as static variables, which implies all elements be initialized to zero,
and arrays are typically implemented as block RAM. When reset options state or all are
used, it forces all arrays implemented as block RAM to be returned to their initialized state
after reset. This may result in two very undesirable conditions in the RTL design:
•
Unlike a power-up initialization, an explicit reset requires the RTL design iterate
through each address in the block RAM to set the value: this can take many clock cycles
if N is large and require more area resources to implement.
•
A reset is added to every array in the design.
To prevent placing reset logic onto every such block RAM and incurring the cycle overhead
to reset all elements in the RAM:
•
Use the default control reset mode and use the RESET directive to specify individual
static or global variables to be reset.
•
Alternatively, use reset mode state and remove the reset from specific static or global
variables using the off option to the RESET directive.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
138
Chapter 1: High-Level Synthesis
RTL Output
Various characteristics of the RTL output by Vivado HLS can be controlled using the
config_rtl configuration shown in Figure 1-49.
•
Specify the type of FSM encoding used in the RTL state machines.
•
Add an arbitrary comment string, such as a copyright notice, to all RTL files using the
-header option.
•
Specify a unique name with the prefix option which is added to all RTL output file
names.
•
Force the RTL ports to use lower case names.
The default FSM coding is style is onehot. Other possible options are auto, binary, and
gray. If you select auto, Vivado HLS implements the style of encoding using the onehot
default, but Vivado Design Suite might extract and re-implement the FSM style during logic
synthesis. If you select any other encoding style (binary, onehot, gray), the encoding
style cannot be re-optimized by Xilinx logic synthesis tools.
The names of the RTL output files are derived from the name of the top-level function for
synthesis. If different RTL blocks are created from the same top-level function, the RTL files
will have the same name and cannot be combined in the same RTL project. The prefix
option allows RTL files generated from the same top-level function (and which by default
have the same name as the top-level function) to be easily combined in the same directory.
The lower_case_name option ensures the only lower case names are used in the output
RTL. This option ensures the IO protocol ports created by Vivado HLS, such as those for AXI
interfaces, are specified as s_axis__tdata in the final RTL rather than the default
port name of s_axis__TDATA.
Optimizing for Throughput
Use the following optimizations to improve throughput or reduce the initiation interval.
Task Pipelining
Pipelining allows operations to happen concurrently: the task does not have to complete all
operations before it begin the next operation. Pipelining is applied to functions and loops.
The throughput improvements in function pipelining are shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
139
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-50
void
func(…)
{
op_Read;
op_Compute;
op_Write;
RD
CMP
WR
}
3 cycles
RD
CMP
1 cycle
WR
RD
CMP
WR
RD
CMP
RD
WR
CMP
WR
2 cycles
2 cycles
(A) Without Function Pipelining
(B) With Function Pipelining
;
Figure 1-50:
Function Pipelining Behavior
Without pipelining the function reads an input every 3 clock cycles and outputs a value
every 2 clock cycles. The function has an Initiation Interval (II) of 3 and a latency of 2. With
pipelining, a new input is read every cycle (II=1) with no change to the output latency or
resources used.
Loop pipelining allows the operations in a loop to be implemented in a concurrent manner
as shown in the following figure. In this figure, (A) shows the default sequential operation
where there are 3 clock cycles between each input read (II=3), and it requires 8 clock cycles
before the last output write is performed.
In the pipelined version of the loop shown in (B), a new input sample is read every cycle
(II=1) and the final output is written after only 4 clock cycles: substantially improving both
the II and latency while using the same hardware resources.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
140
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-51
void func(m,n,o) {
for (i=2;i>=0;i--) {
op_Read;
op_Compute;
op_Write;
}
}
F\FOHV
5'
&03
F\FOH
:5
5'
&03
:5
5'
&03
:5
5'
F\FOHV
&03
:5
5'
&03
:5
5'
&03
:5
F\FOHV
$ :LWKRXW/RRS3LSHOLQLQJ
% :LWK/RRS3LSHOLQLQJ
;
Figure 1-51:
Loop Pipelining
Tasks are pipelined using the PIPELINE directive. The initiation interval defaults to 1 if not
specified but may be explicitly specified.
Pipelining is applied to the specified task not to the hierarchy below: all loops in the
hierarchy below are automatically unrolled. Any sub-functions in the hierarchy below the
specified task must be pipelined individually. If the sub-functions are pipelined, the
pipelined tasks above it can take advantage of the pipeline performance. Conversely, any
sub-function below the pipelined task that is not pipelined, may be the limiting factor in the
performance of the pipeline.
There is a difference in how pipelined functions and loops behave.
•
In the case of functions, the pipeline runs forever and never ends.
•
In the case of loops, the pipeline executes until all iterations of the loop are completed.
This difference in behavior is summarized in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
141
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-52
3LSHOLQHG)XQFWLRQ
5'
3LSHOLQHG/RRS
&03
:5
5'
5'
&03
:5
5'
&03
&03
:5
5'
&03
:5
5'
&03
:5
5'1
&03
:5
([HFXWH)XQFWLRQ
([HFXWH1H[W
([HFXWH1H[W
3LSHOLQHG)XQFWLRQ,2$FFHVVHV
5'
5'
5'
5'1
:5
:5
&03
:5
5'
&03
5'
:51
([HFXWH/RRS
([HFXWH1H[W
/RRS
3LSHOLQHG/RRS,2$FFHVVHV
5'
:5
5'
5'
:51
5'
5'1
:5
:5
%XEEOH
:5
:51
5'
5'
%XEEOH
5'
:5
;
Figure 1-52:
Function and Loop Pipelining Behavior
An implication from the difference in behavior is the difference in how inputs and outputs
to the pipeline are processed. As seen the figure above, a pipelined function will
continuously read new inputs and write new outputs. By contrast, because a loop must first
finish all operations in the loop before starting the next loop, a pipelined loop causes a
“bubble” in the data stream: a point when no new inputs are read as the loop completes the
execution of the final iterations, and a point when no new outputs are written as the loop
starts new loop iterations.
Rewinding Pipelined Loops for Performance
Loops which are the top-level loop in a function or are used in a region where the
DATAFLOW optimization is used can be made to continuously execute using the PIPELINE
directive with the rewind option.
The following figure shows the operation when the rewind option is used when pipelining
a loop. At the end of the loop iteration count, the loop immediately starts to re-execute.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
142
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-53
Loop:for(i=1;i Solution
Settings > General > Add > config_compile.
The pipeline_loops option set the iteration limit. All loops with an iteration count below
this limit are automatically pipelined. The default is 0: no automatic loop pipelining is
performed.
Given the following example code:
for (y = 0; y < 480; y++) {
for (x = 0; x < 640; x++) {
for (i = 0; i < 5; i++) {
// do something 5 times
…
}
}
}
If the pipeline_loops option is set to 10 (a value above 5 but below 5*640), the
following pipelining is performed automatically:
for (y = 0; y < 480; y++) {
for (x = 0; x < 640; x++) {
#pragma HLS PIPELINE II=1
for (i = 0; i < 5; i++) {
// This loop will be automatically unrolled
// do something 5 times in parallel
…
}
}
}
If there are loops in the design that you do not want to use automatic pipelining, apply the
PIPELINE directive with the off option to that loop. The off option prevents automatic
loop pipelining.
IMPORTANT: Vivado HLS applies the config_compile pipeline_loops option after performing
all user-specified directives. For example, if Vivado HLS applies a user-specified UNROLL directive to a
loop, the loop is first unrolled, and automatic loop pipelining cannot be applied.
Addressing Failure to Pipeline
When a task is pipelined, all loops in the hierarchy are automatically unrolled. This is a
requirement for pipelining to proceed. If a loop has variables bounds it cannot be unrolled.
This will prevent the task from being pipelined. Refer to Variable Loop Bounds in Chapter 3
for techniques to remove such loops from the design.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
144
Chapter 1: High-Level Synthesis
Partitioning Arrays to Improve Pipelining
A common issue when pipelining tasks is the following message:
INFO: [SCHED 204-61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('mem_load_2',
bottleneck.c:62) on array 'mem' due to limited memory ports.
WARNING: [SCHED 204-69] The resource limit of core:RAM:mem:p0 is 1, current
assignments:
WARNING: [SCHED 204-69]
'load' operation ('mem_load', bottleneck.c:62) on array
'mem',
WARNING: [SCHED 204-69] The resource limit of core:RAM:mem:p1 is 1, current
assignments:
WARNING: [SCHED 204-69]
'load' operation ('mem_load_1', bottleneck.c:62) on array
'mem',
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
In this example, Vivado HLS states it cannot reach the specified initiation interval (II) of 1
because it cannot schedule a load (read) operation onto the memory because of limited
memory ports. The above message notes that the resource limit for "core:RAM:mem:p0
is 1" which is used by the operation on line 64. The 2nd port of the BlockRAM also only
has 1 resource which is also used. It reports a final II of 2 instead of the desired 1.
This issue is typically caused by arrays. Arrays are implemented as block RAM which only
has a maximum of two data ports. This can limit the throughput of a read/write (or
load/store) intensive algorithm. The bandwidth can be improved by splitting the array (a
single block RAM resource) into multiple smaller arrays (multiple block RAMs), effectively
increasing the number of ports.
Arrays are partitioned using the ARRAY_PARTITION directive. Vivado HLS provides three
types of array partitioning, as shown in the following figure. The three styles of partitioning
are:
•
block: The original array is split into equally sized blocks of consecutive elements of
the original array.
•
cyclic: The original array is split into equally sized blocks interleaving the elements of
the original array.
•
complete: The default operation is to split the array into its individual elements. This
corresponds to resolving a memory into registers.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
145
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-55
1
1
1
1
1
1
1
EORFN
1
1
1
F\FOLF
FRPSOHWH
1
1
1
;
Figure 1-55:
Array Partitioning
For block and cyclic partitioning the factor option specifies the number of arrays that are
created. In the preceding figure, a factor of 2 is used, that is, the array is divided into two
smaller arrays. If the number of elements in the array is not an integer multiple of the factor,
the final array has fewer elements.
When partitioning multi-dimensional arrays, the dimension option is used to specify
which dimension is partitioned. The following figure shows how the dimension option is
used to partition the following example code:
void foo (...) {
int my_array[10][6][4];
...
}
The examples in the figure demonstrate how partitioning dimension 3 results in 4 separate
arrays and partitioning dimension 1 results in 10 separate arrays. If zero is specified as the
dimension, all dimensions are partitioned.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
146
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-56
my_array[10][6][4]
partition dimension 3
my_array[10][6][4]
partition dimension 1
my_array[10][6][4]
partition dimension 0
my_array_0[10][6]
my_array_1[10][6]
my_array_2[10][6]
my_array_3[10][6]
my_array_0[6][4]
my_array_1[6][4]
my_array_2[6][4]
my_array_3[6][4]
my_array_4[6][4]
my_array_5[6][4]
my_array_6[6][4]
my_array_7[6][4]
my_array_8[6][4]
my_array_9[6][4]
10x6x4 = 240 registers
;
Figure 1-56:
Partitioning Array Dimensions
Automatic Array Partitioning
The config_array_partition configuration determines how arrays are automatically
partitioned based on the number of elements. This configuration is accessed through the
menu Solution > Solution Settings > General > Add > config_array_partition.
The partition thresholds can be adjusted and partitioning can be fully automated with the
throughput_driven option. When the throughput_driven option is selected Vivado
HLS automatically partitions arrays to achieve the specified throughput.
Dependencies with Vivado HLS
Vivado HLS constructs a hardware datapath that corresponds to the C source code.
When there is no pipeline directive, the execution is sequential so there is no dependencies
to take into account but when the design has been pipelined, the tool needs to deal with
the same dependencies as found in processor architectures for the hardware that Vivado
HLS generates.
The data dependencies or memory dependencies are when a read or a write occurs after a
previous read or write.
•
A read-after-write (RAW) is a true dependency when an instruction (and data it
reads/uses) depends on the result of a previous operation.
°
I1: t = a * b;
°
I2: c = t + 1;
The read in I2 depends on the write of t in I1. If the instructions are reordered, it uses the
previous value of t.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
147
Chapter 1: High-Level Synthesis
•
A write-after-read (WAR) is an anti-dependence when an instruction cannot update a
register or memory (by a write) before a previous instruction has read the data.
°
I1: b = t + a;
°
I2: t = 3;
The write in I2 cannot execute before I1 otherwise the result of b is invalid: this is a
write-after-read dependence.
•
A write-after-write (WAW) is a dependence when a register or memory must be written
in specific order otherwise other instructions might be corrupted.
°
I1: t = a * b;
°
I2: c = t + 1;
°
I3: t = 1;
The write in I3 must happen after the write in I1. Otherwise, the I2 result is incorrect.
•
A read-after-read has no dependency as instructions can be freely reordered.
For example, when a pipeline is generated, the tool needs to take care that a register or
memory location read at a later stage has not been modified by a previous write. This is a
true dependency or read-after-write (RAW) dependency. A specific example is:
int top(int a, int b) {
int t,c;
I1: t = a * b;
I2: c = t + 1;
return c;
}
Instruction I2 cannot start before instruction I1 has completed because there is a
dependency on variable t. In hardware, if the multiplication takes 3 clock cycles, then I2 is
delayed for that amount of time. It would be incorrect for VHLS to generate hardware that
takes the previous value of t. If this datapath is pipelined, then the latency would be 3 but
the initiation interval II would be 1 as this is a strict feed-forward datapath.
Memory dependencies arise when the example applies to an array and not just variables.
int top(int a) {
int r=1,rnext,m,i,out;
static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1:
m = r * a , mem[i+1]=m;
// line 7
I2:
rnext = mem[i], r = rnext; // line 8
}
return r;
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
148
Chapter 1: High-Level Synthesis
In the above example, scheduling of loop L1 leads to a scheduling warning message:
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1,
distance = 1)
between 'store' operation (top.cpp:7) of variable 'm', top.cpp:7 on array 'mem' and
'load' operation ('rnext', top.cpp:8) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
There are no issues within the same iteration of the loop as you write an index and read
another one. The two instructions could execute at the same time, concurrently. However,
observe the read and writes over a few iterations:
// Iteration for i=0
I1:
m = r * a , mem[1]=m;
I2:
rnext = mem[0], r = rnext;
// Iteration for i=1
I1:
m = r * a , mem[2]=m;
I2:
rnext = mem[1], r = rnext;
// Iteration for i=2
I1:
m = r * a , mem[3]=m;
I2:
rnext = mem[2], r = rnext;
// line 7
// line 8
// line 7
// line 8
// line 7
// line 8
When considering 2 successive iterations, the multiplication result m (with a latency = 2)
from I1 is written to a location that is read by I2 of the next iteration of the loop into
rnext. In this situation, there is a RAW true dependence as the next loop iteration cannot
start reading mem[i] before the previous computation's write completes.
X-Ref Target - Figure 1-57
Figure 1-57:
Dependency Example
Note that if the clock frequency is increased, then the multiplier needs more pipeline stages
and increased latency. This will force II to increase as well.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
149
Chapter 1: High-Level Synthesis
int top(int a) {
int r,m,i;
static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1:
r = mem[i];
// line 7
I2:
m = r * a , mem[i+1]=m; // line 8
}
return r;
}
In the above example, the operations are swapped, changing the functionality. The
scheduling warning is:
INFO: [SCHED 204-61] Pipelining loop 'L1'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 2,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 3,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 4, Depth: 4.
However, observe the continued read and writes over a few iterations:
Iteration
I1:
r
I2:
m
Iteration
I1:
r
I2:
m
Iteration
I1:
r
I2:
m
with i=0
= mem[0];
// line 7
= r * a , mem[1]=m; // line 8
with i=1
= mem[1];
// line 7
= r * a , mem[2]=m; // line 8
with i=2
= mem[2];
// line 7
= r * a , mem[3]=m; // line 8
The longer II is needed because the WAR dependence is via reading r from mem[i],
performing the multiplication, and writing to mem[i+1].
Removing False Dependencies to Improve Loop Pipelining
Loop pipelining can be prevented by loop carry dependencies. Under certain complex
scenarios automatic dependence analysis can be too conservative and fail to filter out false
dependencies.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
150
Chapter 1: High-Level Synthesis
In this example, the Vivado HLS does not have any knowledge about the value of cols and
conservatively assumes that there is always a dependence between the write to
buff_A[1][col]and the read from buff_A[1][col].
void foo(int rows, int cols, ...)
for (row = 0; row < rows + 1; row++) {
for (col = 0; col < cols + 1; col++) {
#pragma HLS PIPELINE II=1
if (col < cols) {
buff_A[2][col] = buff_A[1][col]; // read from buff_A[1][col]
buff_A[1][col] = buff_A[0][col]; // write to buff_A[1][col]
buff_B[1][col] = buff_B[0][col];
temp = buff_A[0][col];
}
The issue is highlighted in the following figure. If cols=0, the next iteration of the rows
loop starts immediately, and the read from buff_A[0][cols] cannot happen at the same
time as the write.
X-Ref Target - Figure 1-58
%XII>@>FRO@DFFHVVHVLIFROV 
5RZ&RO
5HDG
:ULWH
(WF
5HDG
:ULWH
(WF
;
Figure 1-58:
Partitioning Array Dimensions
In an algorithm such as this, it is unlikely cols will ever be zero but Vivado HLS cannot
make assumptions about data dependencies. To overcome this deficiency, you can use the
DEPENDENCE directive to provide Vivado HLS with additional information about the
dependencies. In this case, state there is no dependence between loop iterations (in this
case, for both buff_A and buff_B).
void foo(int rows, int cols, ...)
for (row = 0; row < rows + 1; row++) {
for (col = 0; col < cols + 1; col++) {
#pragma HLS PIPELINE II=1
#pragma HLS dependence variable=buff_A inter false
#pragma HLS dependence variable=buff_B inter false
if (col < cols) {
buff_A[2][col] = buff_A[1][col]; // read from buff_A[1][col]
buff_A[1][col] = buff_A[0][col]; // write to buff_A[1][col]
buff_B[1][col] = buff_B[0][col];
temp = buff_A[0][col];
}
Note: Specifying a false dependency, when in fact the dependency is not false, can result in
incorrect hardware. Be sure dependencies are correct (true or false) before specifying them.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
151
Chapter 1: High-Level Synthesis
When specifying dependencies there are two main types:
•
Inter: Specifies the dependency is between different iterations of the same loop.
If this is specified as false it allows Vivado HLS to perform operations in parallel if the
pipelined or loop is unrolled or partially unrolled and prevents such concurrent
operation when specified as true.
•
Intra: Specifies dependence within the same iteration of a loop, for example an array
being accessed at the start and end of the same iteration.
When intra dependencies are specified as false Vivado HLS may move operations freely
within the loop, increasing their mobility and potentially improving performance or
area. When the dependency is specified as true, the operations must be performed in
the order specified.
Data dependencies are a much harder issues to resolve and often require changes to the
source code. A scalar data dependency could look like the following:
while (a != b) {
if (a > b) a -= b;
else b -= a;
}
The next iteration of this loop cannot start until the current iteration has calculated the
updated the values of a and b, as shown in the following figure.
X-Ref Target - Figure 1-59
!
!
;
Figure 1-59:
Scalar Dependency
If the result of the previous loop iteration must be available before the current iteration can
begin, loop pipelining is not possible. If Vivado HLS cannot pipeline with the specified
initiation interval it increases the initiation internal. If it cannot pipeline at all, as shown by
the above example, it halts pipelining and proceeds to output a non-pipelined design.
Optimal Loop Unrolling to Improve Pipelining
By default loops are kept rolled in Vivado HLS. That is to say that the loops are treated as a
single entity: all operations in the loop are implemented using the same hardware resources
for iteration of the loop.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
152
Chapter 1: High-Level Synthesis
Vivado HLS provides the ability to unroll or partially unroll for-loops using the UNROLL
directive.
The following figure shows both the powerful advantages of loop unrolling and the
implications that must be considered when unrolling loops. This example assumes the
arrays a[i], b[i] and c[i] are mapped to block RAMs. This example shows how easy
it is to create many different implementations by the simple application of loop unrolling.
X-Ref Target - Figure 1-60
void
top(...) {
...
for_mult:for (i=3;i>0;i--)
a[i] = b[i] * c[i];
}
...
{
}
5ROOHG/RRS
3DUWLDOO\8QUROOHG/RRS
5HDGE>@
5HDGE>@
5HDGE>@
5HDGE>@
5HDGF>@
5HDGF>@
5HDGF>@
5HDGF>@
:ULWHD>@
:ULWHD>@
:ULWHD>@
5HDGE>@
5HDGE>@
8QUROOHG/RRS
5HDGE>@
5HDGF>@
5HDGF>@
5HDGF>@
5HDGE>@
5HDGE>@
5HDGE>@
5HDGF>@
5HDGF>@
5HDGF>@
5HDGE>@
:ULWHD>@
5HDGF>@
5HDGE>@
:ULWHD>@
:ULWHD>@
:ULWHD>@
:ULWHD>@
5HDGF>@
:ULWHD>@
:ULWHD>@
:ULWHD>@
:ULWHD>@
;
Figure 1-60:
Loop Unrolling Details
•
Rolled Loop: When the loop is rolled, each iteration is performed in a separate clock
cycle. This implementation takes four clock cycles, only requires one multiplier and
each block RAM can be a single-port block RAM.
•
Partially Unrolled Loop: In this example, the loop is partially unrolled by a factor of 2.
This implementation required two multipliers and dual-port RAMs to support two reads
or writes to each RAM in the same clock cycle. This implementation does however only
take 2 clock cycles to complete: half the initiation interval and half the latency of the
rolled loop version.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
153
Chapter 1: High-Level Synthesis
•
Unrolled loop: In the fully unrolled version all loop operation can be performed in a
single clock cycle. This implementation however requires four multipliers. More
importantly, this implementation requires the ability to perform 4 reads and 4 write
operations in the same clock cycle. Because a block RAM only has a maximum of two
ports, this implementation requires the arrays be partitioned.
To perform loop unrolling, you can apply the UNROLL directives to individual loops in the
design. Alternatively, you can apply the UNROLL directive to a function, which unrolls all
loops within the scope of the function.
If a loop is completely unrolled, all operations will be performed in parallel: if data
dependencies allow. If operations in one iteration of the loop require the result from a
previous iteration, they cannot execute in parallel but will execute as soon as the data is
available. A completely unrolled loop will mean multiple copies of the logic in the loop
body.
The following example code demonstrates how loop unrolling can be used to create an
optimal design. In this example, the data is stored in the arrays as interleaved channels. If
the loop is pipelined with II=1 each channel is only read and written every 8th block cycle.
// Array Order :
// Sample Order:
// Output Order:
0 1 2 3 4 5 6 7 8
9
10
etc. 16
etc...
A0 B0 C0 D0 E0 F0 G0 H0 A1
B1
C2
etc. A2
etc...
A0 B0 C0 D0 E0 F0 G0 H0 A0+A1 B0+B1 C0+C2 etc. A0+A1+A2 etc...
#define CHANNELS 8
#define SAMPLES 400
#define N CHANNELS * SAMPLES
void foo (dout_t d_o[N], din_t d_i[N]) {
int i, rem;
// Store accumulated data
static dacc_t acc[CHANNELS];
// Accumulate each channel
For_Loop: for (i=0;i= N) break;
a[i+1] = b[i+1] + c[i+1];
}
Because N is a variable, Vivado HLS may not be able to determine its maximum value (it
could be driven from an input port). If you know the unrolling factor, 2 in this case, is an
integer factor of the maximum iteration count N, the skip_exit_check option removes
the exit check and associated logic. The effect of unrolling can now be represented as:
for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
}
This helps minimize the area and simplify the control logic.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
155
Chapter 1: High-Level Synthesis
Task Level Pipelining: Dataflow Optimization
The DATAFLOW optimization starts with a series of sequential tasks (functions, loops, or
both) as shown in the following figure.
X-Ref Target - Figure 1-61
LQ
LQ
RXW
IXQFWLRQB
WPS
LQ
RXW
WPS
LQ
RXW
IXQFWLRQB1
RXW
723
;
Figure 1-61:
Sequential Functional Description
Using this series of sequential tasks, DATAFLOW optimization creates a parallel process
architecture as shown in the following figure. Dataflow optimization is a powerful method
for improving design throughput.
X-Ref Target - Figure 1-62
,QWHUIDFH
3URFHVVB
&KDQQHO
&KDQQHO
3URFHVVB1
,QWHUIDFH
723
;
Figure 1-62:
Parallel Process Architecture
The channels shown in the preceding figure ensure a task is not required to wait until the
previous task has completed all operations before it can begin. The following figure shows
how DATAFLOW optimization allows the execution of tasks to overlap, increasing the overall
throughput of the design and reducing latency.
In the example without dataflow pipelining (A) in the following figure, the implementation
requires 8 cycles before a new input can be processed by func_A and 8 cycles before an
output is written by func_C.
In the example with dataflow pipelining (B) in the following figure, func_A can begin
processing a new input every 3 clock cycles (lower initiation interval) and it now only
requires 5 clocks to output a final value (shorter latency).
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
156
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-63
void top (a,b,c,d) {
...
func_A(a,b,i1);
func_B(c,i1,i2);
func_C(i2,d)
IXQFB$
IXQFB%
IXQFB&
return d;
}
F\FOHV
F\FOHV
IXQFB$
IXQFB%
IXQFB&
IXQFB$
IXQFB%
IXQFB$
IXQFB%
IXQFB&
F\FOHV
IXQFB&
F\FOHV
$ :LWKRXW'DWDIORZ3LSHOLQLQJ
% :LWK'DWDIORZ3LSHOLQLQJ
;
Figure 1-63:
Dataflow Optimization
Dataflow Optimization Limitations
For the DATAFLOW optimization to work, the data must flow through the design from one
task to the next. The following coding styles prevent Vivado HLS from performing the
DATAFLOW optimization:
•
Single-producer-consumer violations
•
Bypassing tasks
•
Feedback between tasks
•
Conditional execution of tasks
•
Loops with multiple exit conditions
IMPORTANT: If any of these coding styles are present, Vivado HLS issues a message and does not
perform DATAFLOW optimization.
Note: The dataflow viewer in the Analysis Perspective may be used to view the structure when the
DATAFLOW directive is applied. Refer to Analysis Perspective for more details.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
157
Chapter 1: High-Level Synthesis
For Vivado HLS to perform the DATAFLOW optimization, all elements passed between tasks
must follow a single-producer-consumer model. Each variable must be driven from a single
task and only be consumed by a single task. In the following code example, temp1 fans out
and is consumed by both Loop2 and Loop3. This violates the single-producer-consumer
model.
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N];
Loop1: for(int i = 0; i <
temp1[i] = data_in[i] *
}
Loop2: for(int j = 0; j <
data_out1[j] = temp1[j]
}
Loop3: for(int k = 0; k <
data_out2[k] = temp1[k]
}
N; i++) {
scale;
N; j++) {
* 123;
N; k++) {
* 456;
}
A modified version of this code uses function Split to create a single-producer-consumer
design. In this case, data flows from Loop1 to function Split and then to Loop2 and
Loop3. The data now flows between all four tasks, and Vivado HLS can perform the
DATAFLOW optimization.
void Split (in[N], out1[N], out2[N]) {
// Duplicated data
L1:for(int i=1;i> scale;
}
Loop2: for(int j = 0; j < N; j++) {
temp3[j] = temp1[j] + 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp2[k] + temp3[k];
}
}
Because the loop iteration limits are all the same in this example, you can modify the code
so that Loop2 consumes temp2 and produces temp4 as follows. This ensures that the data
flows from one task to the next.
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N], temp2[N]. temp3[N], temp4[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
temp2[i] = data_in[i] >> scale;
}
Loop2: for(int j = 0; j < N; j++) {
temp3[j] = temp1[j] + 123;
temp4[j] = temp2[j];
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp4[k] + temp3[k];
}
}
Feedback occurs when the output from a task is consumed by a previous task in the
DATAFLOW region. Feedback between tasks is not permitted in a DATAFLOW region. When
Vivado HLS detects feedback, it issues a warning and does not perform the DATAFLOW
optimization.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
159
Chapter 1: High-Level Synthesis
The DATAFLOW optimization does not optimize tasks that are conditionally executed. The
following example highlights this limitation. In this example, the conditional execution of
Loop1 and Loop2 prevents Vivado HLS from optimizing the data flow between these loops,
because the data does not flow from one loop into the next.
void foo(int data_in1[N], int data_out[N], int sel) {
int temp1[N], temp2[N];
if (sel) {
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * 123;
temp2[i] = data_in[i];
}
} else {
Loop2: for(int j = 0; j < N; j++) {
temp1[j] = data_in[j] * 321;
temp2[j] = data_in[j];
}
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
To ensure each loop is executed in all cases, you must transform the code as shown in the
following example. In this example, the conditional statement is moved into the first loop.
Both loops are always executed, and data always flows from one loop to the next.
void foo(int data_in[N], int data_out[N], int sel) {
int temp1[N], temp2[N];
Loop1: for(int i = 0; i < N; i++) {
if (sel) {
temp1[i] = data_in[i] * 123;
} else {
temp1[i] = data_in[i] * 321;
}
}
Loop2: for(int j = 0; j < N; j++) {
temp2[j] = data_in[j];
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
Loops with multiple exit points cannot be used in a DATAFLOW region. In the following
example, Loop2 has three exit conditions:
•
An exit defined by the value of N; the loop will exit when k>=N.
•
An exit defined by the break statement.
•
An exit defined by the continue statement.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
160
Chapter 1: High-Level Synthesis
#include "ap_cint.h"
#define N 16
typedef
typedef
typedef
typedef
int8 din_t;
int15 dout_t;
uint8 dsc_t;
uint1 dsel_t;
void multi_exit(din_t data_in[N], dsc_t scale, dsel_t select, dout_t data_out[N]) {
dout_t temp1[N], temp2[N];
int i,k;
Loop1: for(i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
temp2[i] = data_in[i] >> scale;
}
Loop2: for(k = 0; k < N; k++) {
switch(select) {
case 0: data_out[k] = temp1[k] + temp2[k];
case 1: continue;
default: break;
}
}
}
Because a loop’s exit condition is always defined by the loop bounds, the use of break or
continue statements will prohibit the loop being used in a DATAFLOW region.
Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or
loop contains additional tasks that might benefit from the DATAFLOW optimization, you
must apply the DATAFLOW optimization to the loop, the sub-function, or inline the
sub-function.
Configuring Dataflow Memory Channels
Vivado HLS implements channels between the tasks as either ping-pong or FIFO buffers,
depending on the access patterns of the producer and the consumer of the data:
•
For scalar, pointer, and reference parameters as well as the function return, Vivado HLS
implements the channel as a FIFO.
Note: For scalar values, the maximum channel size is one, that is, only one value is passed from
one function to another.
•
If the parameter (producer or consumer) is an array, Vivado HLS implements the
channel as a ping-pong buffer or a FIFO as follows:
°
If Vivado HLS determines the data is accessed in sequential order, Vivado HLS
implements the memory channel as a FIFO channel of depth 1.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
161
Chapter 1: High-Level Synthesis
°
If Vivado HLS is unable to determine that the data is accessed in sequential order or
determines the data is accessed in an arbitrary manner, Vivado HLS implements the
memory channel as a ping-pong buffer, that is, as two block RAMs each defined by
the maximum size of the consumer or producer array.
Note: A ping-pong buffer ensures that the channel always has the capacity to hold all
samples without a loss. However, this might be an overly conservative approach in some
cases. For example, if tasks are pipelined with an interval of 1 and use data in a streaming,
sequential manner but Vivado HLS is unable to automatically determine the sequential data
usage, Vivado HLS implements a ping-pong buffer. In this case, the channel only requires a
single register and not 2 block RAM defined by the size of the array.
To explicitly specify the default channel used between tasks, use the config_dataflow
configuration. This configuration sets the default channel for all channels in a design. To
reduce the size of the memory used in the channel, you can use a FIFO. To explicitly set the
depth or number of elements in the FIFO, use the fifo_depth option.
Specifying the size of the FIFO channels overrides the default safe approach. If any task in
the design can produce or consume samples at a greater rate than the specified size of the
FIFO, the FIFOs might become empty (or full). In this case, the design halts operation,
because it is unable to read (or write). This might result in a stalled, unrecoverable state.
Note: This issue only appears when executing C/RTL co-simulation or when the block is used in a
complete system.
When setting the depth of the FIFOs, it is recommended that you use FIFOs with the default
depth, confirm the design passes C/RTL co-simulation, and then reduce the size of the
FIFOs and confirm C/RTL co-simulation still completes without issues. If RTL co-simulation
fails, the size of the FIFO is likely too small to prevent stalling.
Specifying Arrays as Block RAM or FIFOs
By default all arrays are implemented as block RAM elements, unless complete partitioning
reduces them to individual registers. To use a FIFO instead of a block RAM, the array must
be specified as streaming using the STREAM directive.
The following arrays are automatically specified as streaming:
•
If an array on the top-level function interface is set as interface type ap_fifo, axis or
ap_hs it is automatically set as streaming.
•
The arrays used in a region where the DATAFLOW optimization is applied are
automatically set to streaming if Vivado HLS determines the data is streaming between
the tasks or if the config_dataflow configuration sets the default memory channel
as FIFO.
All other arrays must be specified as streaming using the STREAM directive if a FIFO is
required for the implementation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
162
Chapter 1: High-Level Synthesis
Note: When the STREAM directive is applied to an array, the resulting FIFO implemented in the
hardware contains as many elements as the array. The -depth option can be used to specify the size
of the FIFO.
The STREAM directive is also used to change any arrays in a DATAFLOW region from the
default implementation specified by the config_dataflow configuration.
•
If the config_dataflow default_channel is set as ping-pong, any array can be
implemented as a FIFO by applying the STREAM directive to the array.
Note: To use a FIFO implementation, the array must be accessed in a streaming manner.
•
If the config_dataflow default_channel is set to FIFO or Vivado HLS has
automatically determined the data in a DATAFLOW region is accessed in a streaming
manner, any array can be implemented as a ping-pong implementation by applying the
STREAM directive to the array with the off option.
When an array in a DATAFLOW region is specified as streaming and implemented as a FIFO,
the FIFO is typically not required to hold the same number of elements as the original array.
The tasks in a DATAFLOW region consume each data sample as soon as it becomes
available. The config_dataflow command with the -fifo_depth option or the
STREAM directive with the -depth can be used to reduce the size of the FIFO to the
minimum number of elements required to ensure flow of data never stalls.
Optimizing for Latency
Using Latency Constraints
Vivado HLS supports the use of a latency constraint on any scope. Latency constraints are
specified using the LATENCY directive.
When a maximum and/or minimum LATENCY constraint is placed on a scope, Vivado HLS
tries to ensure all operations in the function complete within the range of clock cycles
specified.
The latency directive applied to a loop specifies the required latency for a single iteration of
the loop: it specifies the latency for the loop body, as the following examples shows:
Loop_A: for (i=0; i=0;i--) {
if (d[i])
a[i] = b[i] + c[i];
}
Sub: for (i=3;i>=0;i--)
if (!d[i])
a[i] = b[i] - c[i];
}
...
% :LWK/RRS
0HUJLQJ
F\FOH
{
F\FOH
F\FOH
F\FOHV
$
F\FOH
F\FOHV
}
F\FOH
F\FOH
;
Figure 1-64:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
Loop Directives
www.xilinx.com
Send Feedback
164
Chapter 1: High-Level Synthesis
In the preceding figure, (A) shows how, by default, each rolled loop in the design creates at
least one state in the FSM. Moving between those states costs clock cycles: assuming each
loop iteration requires one clock cycle, it take a total of 11 cycles to execute both loops:
•
1 clock cycle to enter the ADD loop.
•
4 clock cycles to execute the add loop.
•
1 clock cycle to exit ADD and enter SUB.
•
4 clock cycles to execute the SUB loop.
•
1 clock cycle to exit the SUB loop.
•
For a total of 11 clock cycles.
In this simple example it is obvious that an else branch in the ADD loop would also solve the
issue but in a more complex example it may be less obvious and the more intuitive coding
style may have greater advantages.
The LOOP_MERGE optimization directive is used to automatically merge loops. The
LOOP_MERGE directive will seek so to merge all loops within the scope it is placed. In the
above example, merging the loops creates a control structure similar to that shown in (B) in
the preceding figure, which requires only 6 clocks to complete.
Merging loops allows the logic within the loops to be optimized together. In the example
above, using a dual-port block RAM allows the add and subtraction operations to be
performed in parallel.
Currently, loop merging in Vivado HLS has the following restrictions:
•
If loop bounds are all variables, they must have the same value.
•
If loops bounds are constants, the maximum constant value is used as the bound of the
merged loop.
•
Loops with both variable bound and constant bound cannot be merged.
•
The code between loops to be merged cannot have side effects: multiple execution of
this code should generate the same results (a=b is allowed, a=a+1 is not).
•
Loops cannot be merged when they contain FIFO accesses: merging would change the
order of the reads and writes from a FIFO: these must always occur in sequence.
Flattening Nested Loops to Improve Latency
In a similar manner to the consecutive loops discussed in the previous section, it requires
additional clock cycles to move between rolled nested loops. It requires one clock cycle to
move from an outer loop to an inner loop and from an inner loop to an outer loop.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
165
Chapter 1: High-Level Synthesis
In the small example shown here, this implies 200 extra clock cycles to execute loop Outer.
void foo_top { a, b, c, d} {
...
Outer: while(j<100)
Inner: while(i<6)// 1 cycle to enter inner
...
LOOP_BODY
...
}
// 1 cycle to exit inner
}
...
}
Vivado HLS provides the set_directive_loop_flatten command to allow labeled perfect and
semi-perfect nested loops to be flattened, removing the need to re-code for optimal
hardware performance and reducing the number of cycles it takes to perform the
operations in the loop.
•
Perfect loop nest: only the innermost loop has loop body content, there is no logic
specified between the loop statements and all the loop bounds are constant.
•
Semi-perfect loop nest: only the innermost loop has loop body content, there is no
logic specified between the loop statements but the outermost loop bound can be a
variable.
For imperfect loop nests, where the inner loop has variables bounds or the loop body is not
exclusively inside the inner loop, designers should try to restructure the code, or unroll the
loops in the loop body to create a perfect loop nest.
When the directive is applied to a set of nested loops it should be applied to the inner most
loop that contains the loop body.
set_directive_loop_flatten top/Inner
Loop flattening can also be performed using the directive tab in the GUI, either by applying
it to individual loops or applying it to all loops in a function by applying the directive at the
function level.
Optimizing for Area
Data Types and Bit-Widths
The bit-widths of the variables in the C function directly impact the size of the storage
elements and operators used in the RTL implementation. If a variables only requires 12-bits
but is specified as an integer type (32-bit) it will result in larger and slower 32-bit operators
being used, reducing the number of operations that can be performed in a clock cycle and
potentially increasing initiation interval and latency.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
166
Chapter 1: High-Level Synthesis
•
Use the appropriate precision for the data types. Refer to Data Types for Efficient
Hardware.
•
Confirm the size of any arrays that are to be implemented as RAMs or registers. The
area impact of any over-sized elements is wasteful in hardware resources.
•
Pay special attention to multiplications, divisions, modulus or other complex arithmetic
operations. If these variables are larger than they need to be, they negatively impact
both area and performance.
Function Inlining
Function inlining removes the function hierarchy. A function is inlined using the INLINE
directive. Inlining a function may improve area by allowing the components within the
function to be better shared or optimized with the logic in the calling function. This type of
function inlining is also performed automatically by Vivado HLS. Small functions are
automatically inlined.
Inlining allows functions sharing to be better controlled. For functions to be shared they
must be used within the same level of hierarchy. In this code example, function foo_top
calls foo twice and function foo_sub.
foo_sub (p, q) {
int q1 = q + 10;
foo(p1,q);// foo_3
...
}
void foo_top { a, b, c, d} {
...
foo(a,b);//foo_1
foo(a,c);//foo_2
foo_sub(a,d);
...
}
Inlining function foo_sub and using the ALLOCATION directive to specify only 1 instance
of function foo is used, results in a design which only has one instance of function foo:
one-third the area of the example above.
foo_sub (p, q) {
#pragma HLS INLINE
int q1 = q + 10;
foo(p1,q);// foo_3
...
}
void foo_top { a, b, c, d} {
#pragma HLS ALLOCATION instances=foo limit=1 function
...
foo(a,b);//foo_1
foo(a,c);//foo_2
foo_sub(a,d);
...
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
167
Chapter 1: High-Level Synthesis
The INLINE directive optionally allows all functions below the specified function to be
recursively inlined by using the recursive option. If the recursive option is used on the
top-level function, all function hierarchy in the design is removed.
The INLINE off option can optionally be applied to functions to prevent them being
inlined. This option may be used to prevent Vivado HLS from automatically inlining a
function.
The INLINE directive is a powerful way to substantially modify the structure of the code
without actually performing any modifications to the source code and provides a very
powerful method for architectural exploration.
Mapping Many Arrays into One Large Array
When there are many small arrays in the C Code, mapping them into a single larger array
typically reduces the number of block RAM required.
Each array is mapped into a block RAM or UltraRAM, when supported by the device. The
basic block RAM unit provide in an FPGA is 18K. If many small arrays do not use the full 18K,
a better use of the block RAM resources is map many of the small arrays into a larger array.
If a block RAM is larger than 18K, they are automatically mapped into multiple 18K units. In
the synthesis report, review Utilization Report > Details > Memory for a complete
understanding of the block RAMs in your design.
The ARRAY_MAP directive supports two ways of mapping small arrays into a larger one:
•
Horizontal mapping: this corresponds to creating a new array by concatenating the
original arrays. Physically, this gets implemented as a single array with more elements.
•
Vertical mapping: this corresponds to creating a new array by concatenating the
original words in the array. Physically, this gets implemented by a single array with a
larger bit-width.
Horizontal Array Mapping
The following code example has two arrays that would result in two RAM components.
void foo (...) {
int8 array1[M];
int12 array2[N];
...
loop_1: for(i=0;i0@
DUUD\>1@
0
0
1
1
/RQJHUDUUD\
KRUL]RQWDOH[SDQVLRQ
ZLWKPRUHHOHPHQWV
DUUD\>01@
0
0
1
1
;
Figure 1-65:
Horizontal Mapping
When using horizontal mapping, the smaller arrays are mapped into a larger array. The
mapping starts at location 0 in the larger array and follows in the order the commands are
specified. In the Vivado HLS GUI, this is based on the order the arrays are specified using
the menu commands. In the Tcl environment, this is based on the order the commands are
issued.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
169
Chapter 1: High-Level Synthesis
When you use the horizontal mapping shown in Figure 1-65, the implementation in the
block RAM appears as shown in the following figure.
X-Ref Target - Figure 1-66
RAM1P
M+N-1
N-1
N-2
1
0
Addresses
M-1
M-2
1
0
MSB
Figure 1-66:
0
LSB
;
Memory for Horizontal Mapping
The offset option to the ARRAY_MAP directive is used to specify at which location
subsequent arrays are added when using the horizontal option. Repeating the previous
example, but reversing the order of the commands (specifying array2 then array1) and
adding an offset, as shown below:
void foo (...) {
int8 array1[M];
int12 array2[N];
#pragma HLS ARRAY_MAP variable=array2 instance=array3 horizontal
#pragma HLS ARRAY_MAP variable=array1 instance=array3 horizontal offset=2
...
loop_1: for(i=0;i0@
DUUD\>1@
0
0
1
1
/RQJHUDUUD\
KRUL]RQWDOH[SDQVLRQ
ZLWKPRUHHOHPHQWV
DUUD\>10@
1
1
2
2
0
0
2IIVHWRIIURPWKHHQG
RIDUUD\HOHPHQWV
;
Figure 1-67:
Horizontal Mapping with Offset
After mapping, the newly formed array, array3 in the above examples, can be targeted
into a specific block RAM or UltraRAM by applying the RESOURCE directive to any of the
variables mapped into the new instance.
Although horizontal mapping can result in using less block RAM components and therefore
improve area, it does have an impact on the throughput and performance as there are now
fewer block RAM ports. To overcome this limitation, Vivado HLS also provides vertical
mapping.
Mapping Vertical Arrays
In vertical mapping, arrays are concatenated by to produce an array with higher
bit-widths.Vertical mapping is applied using the vertical option to the INLINE directive. The
following figure shows how the same example as before transformed when vertical
mapping mode is applied.
void foo (...) {
int8 array1[M];
int12 array2[N];
#pragma HLS ARRAY_MAP variable=array2 instance=array3 vertical
#pragma HLS ARRAY_MAP variable=array1 instance=array3 vertical
...
loop_1: for(i=0;i0@
DUUD\>1@
0
0
1
1
9HUWLFDOH[SDQVLRQ
ZLWKPRUHELWV
06%
DUUD\>1@
0
0
1
1
/6%
;
Figure 1-68:
Vertical Mapping
In vertical mapping, the arrays are concatenated in the order specified by the command,
with the first arrays starting at the LSB and the last array specified ending at the MSB. After
vertical mapping the newly formed array, is implemented in a single block RAM component
as shown in the following figure.
X-Ref Target - Figure 1-69
5$03
1
1
0
1
0
$GGUHVVHV
06%
/6%
Figure 1-69:
;
Memory for Vertical Mapping
Array Mapping and Special Considerations
IMPORTANT: The object for an array transformation must be in the source code prior to any other
directives being applied.
To map elements from a partitioned array into a single array with horizontal mapping,
the individual elements of the array to be partitioned must be specified in the ARRAY_MAP
directive. For example, the following Tcl commands partition array accum and map the
resulting elements back together.
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
HLS
HLS
HLS
HLS
HLS
HLS
array_partition variable=m_accum cyclic factor=2 dim=1
array_partition variable=v_accum cyclic factor=2 dim=1
array_map variable=m_accum[0] instance=_accum horizontal
array_map variable=v_accum[0] instance=mv_accum horizontal
array_map variable=m_accum[1] instance=mv_accum_1 horizontal
array_map variable=v_accum[1] instance=mv_accum_1 horizontal
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
172
Chapter 1: High-Level Synthesis
It is possible to map a global array. However, the resulting array instance is global and any
local arrays mapped onto this same array instance become global. When local arrays of
different functions get mapped onto the same target array, then the target array instance
becomes global.
Array function arguments may only be mapped if they are arguments to the same function.
Array Reshaping
The ARRAY_RESHAPE directive combines ARRAY_PARTITIONING with the vertical mode of
ARRAY_MAP and is used to reduce the number of block RAM while still allowing the
beneficial attributes of partitioning: parallel access to the data.
Given the following example code:
void foo (...) {
int array1[N];
int array2[N];
int array3[N];
#pragma HLS ARRAY_RESHAPE variable=array1 block factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array2 cycle factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array3 complete dim=1
...
}
The ARRAY_RESHAPE directive transforms the arrays into the form shown in the following
figure.
X-Ref Target - Figure 1-70
EORFN
06%
/6%
DUUD\>1@
1
1
1
1
F\FOLF
06%
/6%
DUUD\>1@
1
1
1
DUUD\>1@
1
1
1
DUUD\>1@
1
1
1
DUUD\>1@
DUUD\>@
1
1
/6%
06%
1
1
1
FRPSOHWH
;
Figure 1-70:
Array Reshaping
The ARRAY_RESHAPE directive allows more data to be accessed in a single clock cycle. In
cases where more data can be accessed in a single clock cycle, Vivado HLS may
automatically unroll any loops consuming this data, if doing so will improve the throughput.
The loop can be fully or partially unrolled to create enough hardware to consume the
additional data in a single clock cycle. This feature is controlled using the config_unroll
command and the option tripcount_threshold. In the following example, any loops
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
173
Chapter 1: High-Level Synthesis
with a tripcount of less than 16 will be automatically unrolled if doing so improves the
throughput.
config_unroll -tripcount_threshold 16
Function Instantiation
Function instantiation is an optimization technique that has the area benefits of
maintaining the function hierarchy but provides an additional powerful option: performing
targeted local optimizations on specific instances of a function. This can simplify the control
logic around the function call and potentially improve latency and throughput.
The FUNCTION_INSTANTIATE directive exploits the fact that some inputs to a function may
be a constant value when the function is called and uses this to both simplify the
surrounding control structures and produce smaller more optimized function blocks. This is
best explained by example.
Given the following code:
void foo_sub(bool mode){
#pragma HLS FUNCTION_INSTANTIATE variable=mode
if (mode) {
// code segment 1
} else {
// code segment 2
}
}
void foo(){
#pragma HLS FUNCTION_INSTANTIATE variable=select
foo_sub(true);
foo_sub(false);
}
It is clear that function foo_sub has been written to perform multiple but exclusive
operations (depending on whether mode is true or not). Each instance of function foo_sub
is implemented in an identical manner: this is great for function reuse and area optimization
but means that the control logic inside the function must be more complex.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
174
Chapter 1: High-Level Synthesis
The FUNCTION_INSTANTIATE optimization allows each instance to be independently
optimized, reducing the functionality and area. After FUNCTION_INSTANTIATE
optimization, the code above can effectively be transformed to have two separate
functions, each optimized for different possible values of mode, as shown:
void foo_sub1() {
// code segment 1
}
void foo_sub1() {
// code segment 2
}
void A(){
B1();
B2();
}
If the function is used at different levels of hierarchy such that function sharing is difficult
without extensive inlining or code modifications, function instantiation can provide the
best means of improving area: many small locally optimized copies are better than many
large copies that cannot be shared.
Controlling Hardware Resources
During synthesis Vivado HLS performs the following basic tasks:
•
First, elaborates the C, C++ or SystemC source code into an internal database
containing operators.
The operators represent operations in the C code such as additions, multiplications,
array reads, and writes.
•
Then, maps the operators on to cores which implement the hardware operations.
Cores are the specific hardware components used to create the design (such as adders,
multipliers, pipelined multipliers, and block RAM).
Control is provided over each of these steps, allowing you to control the hardware
implementation at a fine level of granularity.
Limiting the Number of Operators
Explicitly limiting the number of operators to reduce area may be required in some cases:
the default operation of Vivado HLS is to first maximize performance. Limiting the number
of operators in a design is a useful technique to reduce the area: it helps reduce area by
forcing sharing of the operations.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
175
Chapter 1: High-Level Synthesis
The ALLOCATION directive allows you to limit how many operators, or cores or functions
are used in a design. For example, if a design called foo has 317 multiplications but the
FPGA only has 256 multiplier resources (DSP48s). The ALLOCATION directive shown below
directs Vivado HLS to create a design with maximum of 256 multiplication (mul) operators:
dout_t array_arith (dio_t d[317]) {
static int acc;
int i;
#pragma HLS ALLOCATION instances=mul limit=256 operation
for (i=0;i<317;i++) {
#pragma HLS UNROLL
acc += acc * d[i];
}
rerun acc;
}
Note: If you specify an ALLOCATION limit that is greater than needed, Vivado HLS attempts to use
the number of resources specified by the limit, or the maximum necessary, which reduces the
amount of sharing.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
176
Chapter 1: High-Level Synthesis
You can use the type option to specify if the ALLOCATION directives limits operations,
cores, or functions. The following table lists all the operations that can be controlled using
the ALLOCATION directive.
Table 1-13:
Vivado HLS Operators
Operator
Description
add
Integer Addition
ashr
Arithmetic Shift-Right
dadd
Double-precision floating point addition
dcmp
Double -precision floating point comparison
ddiv
Double -precision floating point division
dmul
Double -precision floating point multiplication
drecip
Double -precision floating point reciprocal
drem
Double -precision floating point remainder
drsqrt
Double -precision floating point reciprocal square root
dsub
Double -precision floating point subtraction
dsqrt
Double -precision floating point square root
fadd
Single-precision floating point addition
fcmp
Single-precision floating point comparison
fdiv
Single-precision floating point division
fmul
Single-precision floating point multiplication
frecip
Single-precision floating point reciprocal
frem
Single-precision floating point remainder
frsqrt
Single-precision floating point reciprocal square root
fsub
Single-precision floating point subtraction
fsqrt
Single-precision floating point square root
icmp
Integer Compare
lshr
Logical Shift-Right
mul
Multiplication
sdiv
Signed Divider
shl
Shift-Left
srem
Signed Remainder
sub
Subtraction
udiv
Unsigned Division
urem
Unsigned Remainder
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
177
Chapter 1: High-Level Synthesis
Globally Minimizing Operators
The ALLOCATION directive, like all directives, is specified inside a scope: a function, a loop
or a region. The config_bind configuration allows the operators to be minimized
throughout the entire design.
The minimization of operators through the design is performed using the min_op option in
the config_bind configuration. An any of the operators listed in Table 1-13 can be
limited in this fashion.
After the configuration is applied it applies to all synthesis operations performed in the
solution: if the solution is closed and re-opened the specified configuration still applies to
any new synthesis operations.
Any configurations applied with the config_bind configuration can be removed by using
the reset option or by using open_solution -reset to open the solution.
Controlling the Hardware Cores
When synthesis is performed, Vivado HLS uses the timing constraints specified by the clock,
the delays specified by the target device together with any directives specified by you, to
determine which core is used to implement the operators. For example, to implement a
multiplier operation Vivado HLS could use the combinational multiplier core or use a
pipeline multiplier core.
The cores which are mapped to operators during synthesis can be limited in the same
manner as the operators. Instead of limiting the total number of multiplication operations,
you can choose to limit the number of combinational multiplier cores, forcing any
remaining multiplications to be performed using pipelined multipliers (or vice versa). This is
performed by specifying the ALLOCATION directive type option to be core.
The RESOURCE directive is used to explicitly specify which core to use for specific
operations. In the following example, a 2-stage pipelined multiplier is specified to
implement the multiplication for variable The following command informs Vivado HLS to
use a 2-stage pipelined multiplier for variable c. It is left to Vivado HLS which core to use for
variable d.
int foo (int a, int b) {
int c, d;
#pragma HLS RESOURCE variable=c latency=2
c = a*b;
d = a*c;
return d;
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
178
Chapter 1: High-Level Synthesis
In the following example, the RESOURCE directives specify that the add operation for
variable temp and is implemented using the AddSub_DSP core. This ensures that the
operation is implemented using a DSP48 primitive in the final design - by default, add
operations are implemented using LUTs.
void apint_arith(dinA_t inA, dinB_t
dout1_t *out1
) {
inB,
dout2_t temp;
#pragma HLS RESOURCE variable=temp core=AddSub_DSP
temp = inB + inA;
*out1 = temp;
}
The list_core command is used to obtain details on the cores available in the library. The
list_core can only be used in the Tcl command interface and a device must be specified
using the set_part command. If a device has not been selected, the command does not
have any effect.
The -operation option of the list_core command lists all the cores in the library that
can be implemented with the specified operation.The following table lists the cores used to
implement standard RTL logic operations (such as add, multiply, and compare).
Table 1-14:
Functional Cores
Core
Description
AddSub
This core is used to implement both adders and subtractors.
AddSubnS
N-stage pipelined adder or subtractor. Vivado HLS determines how many pipeline
stages are required.
AddSub_DSP
This core ensures that the add or sub operation is implemented using a DSP48 (Using
the adder or subtractor inside the DSP48).
DivnS
N-stage pipelined divider.
DSP48
Multiplications with bit-widths that allow implementation in a single DSP48
macrocell. This can include pipelined multiplications and multiplications grouped
with a pre-adder, post-adder, or both. This core can only be pipelined with a maximum
latency of 4. Values above 4 saturate at 4.
Mul
Combinational multiplier with bit-widths that exceed the size of a standard DSP48
macrocell.
Note: Multipliers that can be implemented with a single DSP48 macrocell are mapped to the
DSP48 core.
MulnS
N-stage pipelined multiplier with bit-widths that exceed the size of a standard DSP48
macrocell.
Note: Multipliers that can be implemented with a single DSP48 macrocell are mapped to the
DSP48 core.
Mul_LUT
Multiplier implemented with LUTs.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
179
Chapter 1: High-Level Synthesis
In addition to the standard cores, the following floating point cores are used when the
operation uses floating-point types. Refer to the documentation for each device to
determine if the floating-point core is supported in the device.
Table 1-15:
Floating Point Cores
Core
Description
FAddSub_nodsp
Floating-point adder or subtractor implemented without any DSP48
primitives.
FAddSub_fulldsp
Floating-point adder or subtractor implemented using only DSP48s
primitives.
FDiv
Floating-point divider.
FExp_nodsp
Floating-point exponential operation implemented without any DSP48
primitives.
FExp_meddsp
Floating-point exponential operation implemented with balance of DSP48
primitives.
FExp_fulldsp
Floating-point exponential operation implemented with only DSP48
primitives.
FLog_nodsp
Floating-point logarithmic operation implemented without any DSP48
primitives.
FLog_meddsp
Floating-point logarithmic operation with balance of DSP48 primitives.
FLog_fulldsp
Floating-point logarithmic operation with only DSP48 primitives.
FMul_nodsp
Floating-point multiplier implemented without any DSP48 primitives.
FMul_meddsp
Floating-point multiplier implemented with balance of DSP48 primitives.
FMul_fulldsp
Floating-point multiplier implemented with only DSP48 primitives.
FMul_maxdsp
Floating-point multiplier implemented the maximum number of DSP48
primitives.
FRSqrt_nodsp
Floating-point reciprocal square root implemented without any DSP48
primitives.
FRSqrt_fulldsp
Floating-point reciprocal square root implemented with only DSP48
primitives.
FRecip_nodsp
Floating-point reciprocal implemented without any DSP48 primitives.
FRecip_fulldsp
Floating-point reciprocal implemented with only DSP48 primitives.
FSqrt
Floating-point square root.
DAddSub_nodsp
Double precision floating-point adder or subtractor implemented without
any DSP48 primitives.
DAddSub_fulldsp
Double precision floating-point adder or subtractor implemented using
only DSP48s primitives.
DDiv
Double precision floating-point divider.
DExp_nodsp
Double precision floating-point exponential operation implemented
without any DSP48 primitives.
DExp_meddsp
Double precision floating-point exponential operation implemented with
balance of DSP48 primitives.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
180
Chapter 1: High-Level Synthesis
Table 1-15:
Floating Point Cores (Cont’d)
Core
Description
FAddSub_nodsp
Floating-point adder or subtractor implemented without any DSP48
primitives.
DExp_fulldsp
Double precision floating-point exponential operation implemented with
only DSP48 primitives.
DLog_nodsp
Double precision floating-point logarithmic operation implemented
without any DSP48 primitives.
DLog_meddsp
Double precision floating-point logarithmic operation with balance of
DSP48 primitives.
DLog_fulldsp
Double precision floating-point logarithmic operation with only DSP48
primitives.
DMul_nodsp
Double precision floating-point multiplier implemented without any DSP48
primitives.
DMul_meddsp
Double precision floating-point multiplier implemented with a balance of
DSP48 primitives.
DMul_fulldsp
Double precision floating-point multiplier implemented with only DSP48
primitives.
DMul_maxdsp
Double precision floating-point multiplier implemented with a maximum
number of DSP48 primitives.
DRSqrt
Double precision floating-point reciprocal square root.
DRecip
Double precision floating-point reciprocal.
DSqrt
Double precision floating-point square root.
HAddSub_nodsp
Half-precision floating-point adder or subtractor implemented without
DSP48 primitives.
HDiv
Half-precision floating-point divider.
HMul_nodsp
Half-precision floating-point multiplier implemented without DSP48
primitives.
HMul_fulldsp
Half-precision floating-point multiplier implemented with only DSP48
primitives.
HMul_maxdsp
Half-precision floating-point multiplier implemented with a maximum
number of DSP48 primitives.
HSqrt
Half-precision floating-point square root.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
181
Chapter 1: High-Level Synthesis
The following table lists the cores used to implement storage elements, such as registers or
memories.
Table 1-16:
Storage Cores
Core
Description
FIFO
A FIFO. Vivado HLS determines whether to implement this in the RTL with a
block RAM or as distributed RAM.
FIFO_ BRAM
A FIFO implemented with a block RAM.
FIFO_LUTRAM
A FIFO implemented as distributed RAM.
FIFO_SRL
A FIFO implemented as with an SRL.
RAM_1P
A single-port RAM. Vivado HLS determines whether to implement this in the
RTL with a block RAM or as distributed RAM.
RAM_1P_BRAM
A single-port RAM implemented with a block RAM.
RAM_1P_LUTRAM
A single-port RAM implemented as distributed RAM.
RAM_2P
A dual-port RAM that allows read operations on one port and both read and
write operations on the other port. Vivado HLS determines whether to
implement this in the RTL with a block RAM or as distributed RAM.
RAM_2P_BRAM
A dual-port RAM implemented with a block RAM that allows read operations
on one port and both read and write operations on the other port.
RAM_2P_LUTRAM
A dual-port RAM implemented as distributed RAM that allows read operations
on one port and both read and write operations on the other port.
RAM_S2P_BRAM
A dual-port RAM implemented with a block RAM that allows read operations
on one port and write operations on the other port.
RAM_S2P_LUTRAM
A dual-port RAM implemented as distributed RAM that allows read operations
on one port and write operations on the other port.
RAM_T2P_BRAM
A true dual-port RAM with support for both read and write on both ports
implemented with a block RAM.
ROM_1P
A single-port ROM. Vivado HLS determines whether to implement this in the
RTL with a block RAM or with LUTs.
ROM_1P_BRAM
A single-port ROM implemented with a block RAM.
ROM_nP_BRAM
A multi-port ROM implemented with a block RAM. Vivado HLS automatically
determines the number of ports.
ROM_1P_LUTRAM
A single-port ROM implemented with distributed RAM.
ROM_nP_LUTRAM
A multi-port ROM implemented with distributed RAM. Vivado HLS
automatically determines the number of ports.
ROM_2P
A dual-port ROM. Vivado HLS determines whether to implement this in the RTL
with a block RAM or as distributed ROM.
ROM_2P_BRAM
A dual-port ROM implemented with a block RAM.
ROM_2P_LUTRAM
A dual-port ROM implemented as distributed ROM.
XPM_MEMORY
Specifies the array is to be implemented with an UltraRAM. This core is only
usable with devices supporting UltraRAM blocks.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
182
Chapter 1: High-Level Synthesis
The resource directives uses the assigned variable as the target for the resource. Given the
code, the RESOURCE directive specifies the multiplication for out1 is implemented with a
3-stage pipelined multiplier.
void foo(...) {
#pragma HLS RESOURCE variable=out1 latency=3
// Basic arithmetic operations
*out1 = inA * inB;
*out2 = inB + inA;
*out3 = inC / inA;
*out4 = inD % inA;
}
If the assignment specifies multiple identical operators, the code must be modified to
ensure there is a single variable for each operator to be controlled. For example if only the
first multiplication in this example (inA * inB) is to be implemented with a pipelined
multiplier:
*out1 = inA * inB * inC;
The code should be changed to the following with the directive specified on the
Result_tmp variable:
#pragma HLS RESOURCE variable=Result_tmp latency=3
Result_tmp = inA * inB;
*out1 = Result_tmp * inC;
Globally Optimizing Hardware Cores
The config_bind configuration provides control over the binding process. The
configuration allows you to direct how much effort is spent when binding cores to
operators. By default Vivado HLS chooses cores which are the best balance between timing
and area. The config_bind influences which operators are used.
config_bind -effort [low | medium | high] -min_op 
The config_bind command can only be issued inside an active solution. The default run
strategies for the binding operation is medium.
•
Low Effort: Spend less timing sharing, run time is faster but the final RTL may be
larger. Useful for cases when the designer knows there is little sharing possible or
desirable and does not wish to waste CPU cycles exploring possibilities.
•
Medium Effort: The default, where Vivado HLS tries to share operations but endeavors
to finish in a reasonable time.
•
High Effort: Try to maximize sharing and do not limit run time. Vivado HLS keeps
trying until all possible combinations of sharing is explored.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
183
Chapter 1: High-Level Synthesis
Optimizing Logic
Controlling Operator Pipelining
Vivado HLS automatically determines the level of pipelining to use for internal operations.
You can use the RESOURCE directive with the -latency option to explicitly specify the
number of pipeline stages and override the number determined by Vivado HLS.
RTL synthesis might use the additional pipeline registers to help improve timing issues that
might result after place and route. Registers added to the output of the operation typically
help improve timing in the output datapath. Registers added to the input of the operation
typically help improve timing in both the input datapath and the control logic from the
FSM.
The rules for adding these additional pipeline stages are:
•
If the latency is specified as 1 cycle more than the latency decided by Vivado HLS,
Vivado HLS adds new output registers to the output of the operation.
•
If the latency is specified as 2 more than the latency decided by Vivado HLS, Vivado
HLS adds registers to the output of the operation and to the input side of the
operation.
•
If the latency is specified as 3 or more cycles than the latency decided by Vivado HLS,
Vivado HLS adds registers to the output of the operation and to the input side of the
operation. Vivado HLS automatically determines the location of any additional
registers.
You can use the config_core configuration to pipeline all instances of a specific core
used in the design that have the same pipeline depth. To set this configuration:
1. Select Solutions > Solution Settings.
2. In the Solution Settings dialog box, select the General category, and click Add.
3. In the Add Command dialog box, select the config_core command, and specify the
parameters.
For example, the following configuration specifies that all operations implemented with
the DSP48 core are pipelined with a latency of 4, which is the maximum latency allowed
by this core:
config_core DSP48 -latency 4
The following configuration specifies that all block RAM implemented with the
RAM_1P_BRAM core are pipelined with a latency of 3:
config_core RAM_1P_BRAM -latency 3
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
184
Chapter 1: High-Level Synthesis
IMPORTANT: Vivado HLS only applies the core configuration to block RAM with an explicit RESOURCE
directive that specifies the core used to implemented the array. If an array is implemented using a
default core, the core configuration does not affect the block RAM.
See Table 1-16 for a list of all the cores you can use to implement arrays.
Optimizing Logic Expressions
During synthesis several optimizations, such as strength reduction and bit-width
minimization are performed. Included in the list of automatic optimizations is expression
balancing.
Expression balancing rearranges operators to construct a balanced tree and reduce latency.
•
For integer operations expression balancing is on by default but may be disabled.
•
For floating-point operations, expression balancing is off by default but may be
enabled.
Given the highly sequential code using assignment operators such as += and *= in the
following example:
data_t foo_top (data_t a, data_t b, data_t c, data_t d)
{
data_t sum;
sum = 0;
sum += a;
sum += b;
sum += c;
sum += d;
return sum;
}
Without expression balancing, and assuming each addition requires one clock cycle, the
complete computation for sum requires four clock cycles shown in the following figure.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
185
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-71
G
E
F
D
ಯರ
&\FOH
&\FOH
&\FOH
&\FOH
VXP
;
Figure 1-71:
Adder Tree
However additions a+b and c+d can be executed in parallel allowing the latency to be
reduced. After balancing the computation completes in two clock cycles as shown in the
following figure. Expression balancing prohibits sharing and results in increased area.
X-Ref Target - Figure 1-72
D E
F G
&\FOH
&\FOH
VXP
Figure 1-72:
;
Adder Tree After Balancing
For integers, you can disable expression balancing using the EXPRESSION_BALANCE
optimization directive with the off option. By default, Vivado HLS does not perform the
EXPRESSION_BALANCE optimization for operations of type float or double. When
synthesizing float and double types, Vivado HLS maintains the order of operations
performed in the C code to ensure that the results are the same as the C simulation. For
example, in the following code example, all variables are of type float or double. The
values of O1 and O2 are not the same even though they appear to perform the same basic
calculation.
A=B*C;
D=E*F;
O1=A*D
A=B*F;
D=E*C;
O2=A*D;
This behavior is a function of the saturation and rounding in the C standard when
performing operation with types float or double. Therefore, Vivado HLS always
maintains the exact order of operations when variables of type float or double are
present and does not perform expression balancing by default.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
186
Chapter 1: High-Level Synthesis
You can enable expression balancing with float and double types using the
configuration config_compile option as follows:
1. Select Solution > Solution Settings.
2. In the Solution Settings dialog box, click the General category, and click Add.
3. In the Add Command dialog box, select config_compile, and enable
unsafe_math_operations.
With this setting enabled, Vivado HLS might change the order of operations to produce a
more optimal design. However, the results of C/RTL cosimulation might differ from the C
simulation.
The unsafe_math_operations feature also enables the no_signed_zeros
optimization. The no_signed_zeros optimization ensures that the following expressions
used with float and double types are identical:
x - 0.0
x + 0.0
0.0 - x
x - x =
x*0.0 =
= x;
= x;
= -x;
0.0;
0.0;
Without the no_signed_zeros optimization the expressions above would not be
equivalent due to rounding. The optimization may be optionally used without expression
balancing by selecting only this option in the config_compile configuration.
TIP: When the unsafe_math_operations and no_signed_zero optimizations are used, the RTL
implementation will have different results than the C simulation. The test bench should be capable of
ignoring minor differences in the result: check for a range, do not perform an exact comparison.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
187
Chapter 1: High-Level Synthesis
Verifying the RTL
Post-synthesis verification is automated through the C/RTL co-simulation feature which
reuses the pre-synthesis C test bench to perform verification on the output RTL.
Automatically Verifying the RTL
C/RTL co-simulation uses the C test bench to automatically verify the RTL design. The
verification process consists of three phases, shown in Figure 1-73.
•
The C simulation is executed and the inputs to the top-level function, or the
Device-Under-Test (DUT), are saved as “input vectors”.
•
The “input vectors” are used in an RTL simulation using the RTL created by Vivado HLS.
The outputs from the RTL are save as “output vectors”.
•
The “output vectors” from the RTL simulation are applied to C test bench, after the
function for synthesis, to verify the results are correct. The C test bench performs the
verification of the results.
The following messages are output by Vivado HLS to show the progress of the verification.
C simulation:
[SIM-14] Instrumenting C test bench (wrapc)
[SIM-302] Generating test vectors(wrapc)
At this stage, since the C simulation was executed, any messages written by the C test bench
will be output in console window or log file.
RTL simulation:
[SIM-333] Generating C post check test bench
[SIM-12] Generating RTL test bench
[SIM-323] Starting Verilog simulation (Issued when Verilog is the RTL verified)
[SIM-322] Starting VHDL simulation (Issued when VHDL is the RTL verified)
At this stage, any messages from the RTL simulation are output in console window or log
file.
C test bench results checking:
[SIM-316] Starting C post checking
[SIM-1000] C/RTL co-simulation finished: PASS (If test bench returns a 0)
[SIM-4] C/RTL co-simulation finished: FAIL (If the test bench returns non-zero)
The importance of the C test bench in the C/RTL co-simulation flow is discussed below.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
188
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-73
:UDS&6LPXODWLRQ
7HVW%HQFK
57/6LPXODWLRQ
$XWR7%
79,QGDW
3RVW&KHFNLQJ
6LPXODWLRQ
792XWGDW
5HVXOW
&KHFNLQJ
7HVW%HQFK
5HVXOW
&KHFNLQJ
'87
57/0RGXOH
;
Figure 1-73:
RTL Verification Flow
The following is required to use C/RTL co-simulation feature successfully:
•
The test bench must be self-checking and return a value of 0 if the test passes or
returns a non-zero value if the test fails.
•
The correct interface synthesis options must be selected.
•
Any 3rd-party simulators must be available in the search path.
•
Any arrays or structs on the design interface cannot use the optimization directives or
combinations of optimization directives listed in Unsupported Optimizations for
Cosimulation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
189
Chapter 1: High-Level Synthesis
Test Bench Requirements
To verify the RTL design produces the same results as the original C code, use a
self-checking test bench to execute the verification. The following code example shows the
important features of a self-checking test bench:
int main () {
int ret=0;
…
// Execute (DUT) Function
…
// Write the output results to a file
…
// Check the results
ret = system("diff --brief
-w output.dat output.golden.dat");
if (ret != 0) {
printf("Test failed !!!\n");
ret=1;
} else {
printf("Test passed !\n");
}
…
return ret;
}
This self-checking test bench compares the results against known good results in the
output.golden.dat file.
Note: There are many ways to perform this checking. This is just one example.
In the Vivado HLS design flow, the return value to function main() indicates the following:
•
Zero: Results are correct.
•
Non-zero value: Results are incorrect.
Note: The test bench can return any non-zero value. A complex test bench can return different
values depending on the type of difference or failure. If the test bench returns a non-zero value
after C simulation or C/RTL co-simulation, Vivado HLS reports an error and simulation fails.
RECOMMENDED: Because the system environment (for example, Linux, Windows, or Tcl) interprets the
return value of the main() function, it is recommended that you constrain the return value to an 8-bit
range for portability and safety.
CAUTION! You are responsible for ensuring that the test bench checks the results. If the test bench does
not check the results but returns zero, Vivado HLS indicates that the simulation test passed even though
the results were not actually checked.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
190
Chapter 1: High-Level Synthesis
Interface Synthesis Requirements
To use the C/RTL cosimulation feature to verify the RTL design, one or more of the following
conditions must be true:
•
Top-level function must be synthesized using an ap_ctrl_hs or ap_ctrl_chain
block-level interface.
•
Design must be purely combinational.
•
Top-level function must have an initiation interval of 1.
•
Interface must be all arrays that are streaming and implemented with ap_fifo, ap_hs,
or axis interface modes.
Note: The hls::stream variables are automatically implemented as ap_fifo interfaces.
If at least one of these conditions is not met, C/RTL co-simulation halts with the following
message:
@E [SIM-345] Cosim only supports the following 'ap_ctrl_none' designs: (1)
combinational designs; (2) pipelined design with task interval of 1; (3) designs with
array streaming or hls_stream ports.
@E [SIM-4] *** C/RTL co-simulation finished: FAIL ***
IMPORTANT: If the design is specified to use the block-level IO protocol ap_ctrl_none and the design
contains any hls::stream variables which employ non-blocking behavior, C/RTL co-simulation is not
guaranteed to complete.
If any top-level function argument is specified as an AXI-Lite interface, the function return
must also be specified as an AXI-Lite interface.
RTL Simulator Support
After ensuring that the preceding requirements are met, you can use C/RTL co-simulation to
verify the RTL design using Verilog or VHDL. The default simulation language is Verilog.
However, you can also specify VHDL. For information on changing the defaults, see Using
C/RTL Co-Simulation. While the default simulator is Vivado Simulator (XSim), you can use
any of the following simulators to run C/RTL co-simulation:
•
Vivado Simulator (XSim)
•
ModelSim simulator
•
VCS simulator (Linux only)
•
NC-Sim simulator (Linux only)
•
Riviera simulator (PC only)
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
191
Chapter 1: High-Level Synthesis
IMPORTANT: To verify an RTL design using the third-party simulators (for example, ModelSim, VCS,
Riviera), you must include the executable to the simulator in the system search path, and the
appropriate license must be available. See the third-party vendor documentation for details on
configuring these simulators.
IMPORTANT: When verifying a SystemC design, you must select the ModelSim simulator and ensure it
includes C compiler capabilities with appropriate licensing.
Unsupported Optimizations for Cosimulation
The automatic RTL verification does not support cases where multiple transformations that
are performed upon arrays or arrays within structs on the interface.
In order for automatic verification to be performed, arrays on the function interface, or
array inside structs on the function interface, can use any of the following optimizations,
but not two or more:
•
Vertical mapping on arrays of the same size
•
Reshape
•
Partition
•
Data Pack on structs
Verification by C/RTL co-simulation cannot be performed when the following optimizations
are used on top-level function interface.
•
Horizontal Mapping
•
Vertical Mapping of arrays of different sizes
•
Data Pack on structs containing other structs as members
Simulating IP Cores
When the design is implemented with floating-point cores, bit-accurate models of the
floating-point cores must be made available to the RTL simulator. This is automatically
accomplished if the RTL simulation is performed using the following:
•
Verilog and VHDL using the Xilinx Vivado Simulator
•
Verilog and VHDL using the Mentor Graphics Questa Advanced Simulator
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
192
Chapter 1: High-Level Synthesis
For other supported HDL simulators the Xilinx floating point library must be pre-compiled
and added to the simulator libraries. The following example steps demonstrate how the
floating point library may be compiled in verilog for use with the VCS simulator:
1. Open Vivado (not Vivado HLS) and issue the following command in the Tcl console
window:
compile_simlib -simulator vcs_mx -family all -language verilog
2. This command creates floating-point library in the current directory.
3. Refer to the Vivado console window for directory name, example ./rev3_1
This library may then be referred to from within Vivado HLS:
cosim_design -trace_level all -tool vcs -compiled_library_dir/
/rev3_1
Using C/RTL Co-Simulation
To perform C/RTL co-simulation from the GUI, click the C/RTL Cosimulation toolbar button
. This opens the simulation wizard window shown in the following figure.
X-Ref Target - Figure 1-74
Figure 1-74:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
C/RTL Co-Simulation Wizard
www.xilinx.com
Send Feedback
193
Chapter 1: High-Level Synthesis
Select the RTL that is simulated (Verilog or VHDL). The drop-down menu allows the
simulator to be selected. The defaults and possible selections are noted above in RTL
Simulator Support.
Following are the options:
•
Setup Only: This creates all the files (wrappers, adapters, and scripts) required to run
the simulation but does not execute the simulator. The simulation can be run in the
command shell from within the appropriate RTL simulation folder
/sim/.
•
Dump Trace: This generates a trace file for every function, which is saved to the
/sim/ folder. The drop-down menu allows you to select which
signals are saved to the trace file. You can choose to trace all signals in the design,
trace just the top-level ports, or trace no signals. For details on using the trace file, see
the documentation for the selected RTL simulator.
•
Optimizing Compile: This ensures a high level of optimization is used to compile the C
test bench. Using this option increases the compile time but the simulation executes
faster.
•
Reduce Disk Space: The flow shown Figure 1-73 in saves the results for all transactions
before executing RTL simulation. In some cases, this can result in large data files. The
reduce_diskspace option can be used to execute one transaction at a time and
reduce the amount of disk space required for the file. If the function is executed N
times in the C test bench, the reduce_diskspace option ensure N separate RTL
simulations are performed. This causes the simulation to run slower.
•
Compiled Library Location: This specifies the location of the compiled library for a
third-party RTL simulator.
Note: If you are simulating with a third-party RTL simulator and the design uses IP, you must use
an RTL simulation model for the IP before performing RTL simulation. To create or obtain the RTL
simulation model, contact your IP provider.
•
Input Arguments: This allows the specification of any arguments required by the test
bench.
Executing RTL Simulation
Vivado HLS executes the RTL simulation in the project sub-directory:
/sim/
where
•
SOLUTION is the name of the solution.
•
RTL is the RTL type chosen for simulation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
194
Chapter 1: High-Level Synthesis
Any files written by the C test bench during co-simulation and any trace files generated by
the simulator are written to this directory. For example, if the C test bench save the output
results for comparison, review the output file in this directory and compare it with the
expected results.
Verification of Directives
C/RTL co-simulation automatically verifies aspects of the DEPENDENCE and DATAFLOW
directives.
If the DATAFLOW directive is used to pipeline tasks, it inserts channels between the tasks to
facilitate the flow of data between them. It is typical for the channels to be implemented
with FIFOs and the FIFO depth specified using the STREAM directive or the
config_dataflow command. If a FIFO depth is sized too small, the RTL simulation can
stall. For example, if a FIFO is specified with a depth of 2 but the producer task writes three
values before any data values are read by the consumer task, the FIFO blocks the producer.
In some conditions this can cause the entire design to stall.
C/RTL co-simulation issues a message, as shown below, indicating the channel in the
DATAFLOW region is causing the RTL simulation to stall.
//////////////////////////////////////////////////////////////////////////////
// ERROR!!! DEADLOCK DETECTED at 1292000 ns! SIMULATION WILL BE STOPPED! //
//////////////////////////////////////////////////////////////////////////////
/////////////////////////
// Dependence circle 1:
// (1): Process: hls_fft_1kxburst.fft_rank_rad2_nr_man_9_U0
//
Channel: hls_fft_1kxburst.stage_chan_in1_0_V_s_U, FULL
//
Channel: hls_fft_1kxburst.stage_chan_in1_1_V_s_U, FULL
//
Channel: hls_fft_1kxburst.stage_chan_in1_0_V_1_U, FULL
//
Channel: hls_fft_1kxburst.stage_chan_in1_1_V_1_U, FULL
// (2): Process: hls_fft_1kxburst.fft_rank_rad2_nr_man_6_U0
//
Channel: hls_fft_1kxburst.stage_chan_in1_2_V_s_U, EMPTY
//
Channel: hls_fft_1kxburst.stage_chan_in1_2_V_1_U, EMPTY
/////////////////////////////////
// Totally 1 circles detected!
/////////////////////////////////////////////////////////////
In this case, review the implementation of the channels between the tasks and ensure any
FIFOs are large enough to hold the data being generated.
In a similar manner, the RTL test bench is also configured to automatically confirm false
dependencies specified using the DEPENDENCE directive. This indicates the dependency is
not false and must be removed to achieve a functionally valid design.
Analyzing RTL Simulations
When the C/RTL cosimulation completes, the simulation report opens and shows the
measured latency and II. These results may differ from the values reported after HLS
synthesis which are based on the absolute shortest and longest paths through the design.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
195
Chapter 1: High-Level Synthesis
The results provided after C/RTL cosimulation show the actual values of latency and II for
the given simulation data set (and may change if different input stimuli is used).
In non-pipelined designs, C/RTL Cosimulation measures latency between ap_start and
ap_done signals. The II is 1 more than the latency, because the design reads new inputs 1
cycle after all operations are complete. The design only starts the next transaction after the
current transaction is complete.
In pipelined designs, the design might read new inputs before the first transaction
completes, and there might be multiple ap_start and ap_ready signals before a
transaction completes. In this case, C/RTL cosimulation measures the latency as the number
of cycles between data input values and data output values. The II is the number of cycles
between ap_ready signals, which the design uses to requests new inputs.
Note: For pipelined designs, the II value for C/RTL cosimulation is only valid if the design is
simulated for multiple transactions.
Optionally, you can review the waveform from C/RTL cosimulation using the Open Wave
Viewer toolbar button. To view RTL waveforms, you must select the following options
before executing C/RTL cosimulation:
•
Verilog/VHDL Simulator Selection: Select Vivado Simulator. For Xilinx 7 series and
later devices, you can alternatively select Auto.
•
Dump Trace: Select all or port.
When C/RTL cosimulation completes, the Open Wave Viewer toolbar button opens the RTL
waveforms in the Vivado IDE.
Note: When you open the Vivado IDE using this method, you can only use the waveform analysis
features, such as zoom, pan, and waveform radix.
Debugging C/RTL Cosimulation
When C/RTL cosimulation completes, Vivado HLS typically indicates that the simulations
passed and the functionality of the RTL design matches the initial C code. When the C/RTL
cosimulation fails, Vivado HLS issues the following message:
@E [SIM-4] *** C/RTL co-simulation finished: FAIL ***
Following are the primary reasons for a C/RTL cosimulation failure:
•
Incorrect environment setup
•
Unsupported or incorrectly applied optimization directives
•
Issues with the C test bench or the C source code
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
196
Chapter 1: High-Level Synthesis
To debug a C/RTL cosimulation failure, run the checks described in the following sections.
If you are unable to resolve the C/RTL cosimulation failure, see Xilinx Support for support
resources, such as answers, documentation, downloads, and forums.
Setting up the Environment
Check the environment setup as shown in the following table.
Table 1-17:
Debugging Environment Setup
Questions
Are you using a third-party simulator?
Actions to Take
Ensure the path to the simulator executable is specified in the system
search path.
Note: When using the Vivado simulator, you do not need to specify a search
path.
Are you running Linux?
Ensure that your setup files (for example .cshrc or .bashrc) do not
have a change directory command. When C/RTL cosimulation starts, it
spawns a new shell process. If there is a cd command in your setup files,
it causes the shell to run in a different location and eventually C/RTL
cosimulation fails.
Optimization Directives
Check the optimization directives as shown in the following table.
Table 1-18:
Debugging Optimization Directives
Questions
Actions to Take
Are you using the DEPENDENCE
directive?
Remove the DEPENDENCE directives from the design to see if C/RTL
cosimulation passes. If cosimulation passes, it likely indicates that the
TRUE or FALSE setting for the DEPENDENCE directive is incorrect.
Does the design use volatile pointers
on the top-level interface?
Ensure the DEPTH option is specified on the INTERFACE directive. When
volatile pointers are used on the interface, you must specify the number
of read/writes performed on the port in each transaction or each
execution of the C function.
Are you using FIFOs with the
DATAFLOW optimization?
• Check to see if C/RTL cosimulation passes with the standard
ping-pong buffers.
• Check to see if C/RTL cosimulation passes without specifying the size
for the FIFO channels. This ensures that the channel defaults to the
size of the array in the C code.
• Reduce the size of the FIFO channels until C/RTL cosimulation stalls.
Stalling indicates a channel size that is too small. Review your design
to determine the optimal size for the FIFOs. You can use the STREAM
directive to specify the size of individual FIFOs.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
197
Chapter 1: High-Level Synthesis
Table 1-18:
Debugging Optimization Directives (Cont’d)
Questions
Actions to Take
Are you using supported interfaces?
Ensure you are using supported interface modes. For details, see
Interface Synthesis Requirements.
Are you applying multiple
optimization directives to arrays on
the interface?
Ensure you are using optimizations that are designed to work together.
For details, see Unsupported Optimizations for Cosimulation.
C Test Bench and C Source Code
Check the C test bench and C source code as shown in the following table.
Table 1-19:
Debugging the C Test Bench and C Source Code
Questions
Actions to Take
Does the C test bench check the results
and return the value 0 (zero) if the
results are correct?
Ensure the C test bench returns the value 0 for C/RTL cosimulation. Even
if the results are correct, the C/RTL cosimulation feature reports a failure
if the C test bench fails to return the value 0.
Is the C test bench creating input data
based on a random number?
Change the test bench to use a fixed seed for any random number
generation. If the seed for random number generation is based on a
variable, such as a time-based seed, the data used for simulation is
different each time the test bench is executed, and the results are
different.
Are you using pointers on the top-level
interface that are accessed multiple
times?
Use a volatile pointer for any pointer that is accessed multiple times
within a single transaction (one execution of the C function). If you do
not use a volatile pointer, everything except the first read and last
write is optimized out to adhere to the C standard.
Does the C code contain undefined
values or perform out-of-bounds array
accesses?
• Confirm all arrays are correctly sized to match all accesses. Loop
bounds that exceed the size of the array are a common source of
issues (for example, N accesses for an array sized at N-1).
• Confirm that the results of the C simulation are as expected and that
output values were not assigned random data values.
• Consider using the industry-standard Valgrind application outside of
the Vivado HLS design environment to confirm that the C code does
not have undefined or out-of-bounds issues.
Note: It is possible for a C function to execute and complete even if some
variables are undefined or are out-of-bounds. In the C simulation, undefined
values are assigned a random number. In the RTL simulation, undefined values
are assigned an unknown or X value.
Are you using floating-point math
operations in the design?
• Check that the C test bench results are within an acceptable error
range instead of performing an exact comparison. For some of the
floating point math operations, the RTL implementation is not
identical to the C. For details, see Verification and Math Functions in
Chapter 2.
• Ensure that the RTL simulation models for the floating-point cores are
provided to the third-party simulator. For details, see Simulating IP
Cores.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
198
Chapter 1: High-Level Synthesis
Table 1-19:
Debugging the C Test Bench and C Source Code (Cont’d)
Questions
Actions to Take
Are you using Xilinx IP blocks and a
third-party simulator?
Ensure that the path to the Xilinx IP HDL models is provided to the
third-party simulator.
Are you using the hls::stream
construct in the design that changes
the data rate (for example, decimation
or interpolation)?
Analyze the design and use the STREAM directive to increase the size of
the FIFOs used to implement the hls::stream.
Are you using very large data sets in
the simulation?
Note: By default, an hls::stream is implemented as a FIFO with a depth of 1.
If the design results in an increase in the data rate (for example, an interpolation
operation), a default FIFO size of 1 might be too small and cause the C/RTL
cosimulation to stall.
Use the reduce_diskspace option when executing C/RTL
cosimulation. In this mode, Vivado HLS only executes 1 transaction at a
time. The simulation might run marginally slower, but this limits storage
and system capacity issues.
Note: The C/RTL cosimulation feature verifies all transaction at one time. If the
top-level function is called multiple times (for example, to simulate multiple
frames of video), the data for the entire simulation input and output is stored on
disk. Depending on the machine setup and OS, this might cause performance or
execution issues.
Exporting the RTL Design
The final step in the Vivado HLS flow is to export the RTL design as a block of Intellectual
Property (IP) which can be used by other tools in the Xilinx design flow. The RTL design can
be packaged into the following output formats:
•
IP Catalog formatted IP for use with the Vivado Design Suite
•
System Generator for DSP IP for use with Vivado System Generator for DSP
•
Synthesized Checkpoint (.dcp)
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
199
Chapter 1: High-Level Synthesis
The following table shows the formats you can export with details about each.
Table 1-20:
RTL Export Selections
Format Selection
IP Catalog
Subfolder
ip
Comments
Contains a ZIP file which can be added to the Vivado IP
Catalog. The ip folder also contains the contents of the
ZIP file (unzipped).
This option is not available for FPGA devices older than
7-series or Zynq-7000 AP SoC.
System Generator for DSP
sysgen
This output can be added to the Vivado edition of System
Generator for DSP.
This option is not available for FPGA devices older than
7-series or Zynq-7000 AP SoC.
Synthesized Checkpoint
(.dcp)
ip
This option creates Vivado checkpoint files which can be
added directly into a design in the Vivado Design Suite.
This option requires RTL synthesis to be performed. When
this option is selected, the flow option with setting syn
is automatically selected.
The output includes an HDL wrapper you can use to
instantiate the IP into an HDL file.
In addition to the packaged output formats, the RTL files are available as standalone files
(not part of a packaged format) in the verilog and vhdl directories located within the
implementation directory //impl.
In addition to the RTL files, these directories also contain project files for the Vivado Design
Suite. Opening the file project.xpr causes the design (Verilog or VHDL) to be opened in
a Vivado project where the design may be analyzed. If C/RTL Cosimulation was executed in
the Vivado HLS project, the C/RTL C/RTL Cosimulation files are available inside the Vivado
project.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
200
Chapter 1: High-Level Synthesis
Synthesizing the RTL
When Vivado HLS reports on the results of synthesis, it provides an estimation of the results
expected after RTL synthesis: the expected clock frequency, the expected number of
registers, LUTs and block RAMs. These results are estimations because Vivado HLS cannot
know what exact optimizations RTL synthesis performs or what the actual routing delays
will be, and hence cannot know the final area and timing values.
Before exporting a design, you have the opportunity to execute logic synthesis and confirm
the accuracy of the estimates. The flow option shown the following figure invokes RTL
synthesis with the syn option or RTL synthesis and implementation with the impl option.
during the export process and synthesizes the RTL design to gates or the placed and routed
implementation.
Note: The RTL synthesis option is provided to confirm the reported estimates. In most cases, these
RTL results are not included in the packaged IP.
X-Ref Target - Figure 1-75
Figure 1-75:
Export RTL Dialog Box
For most export formats, the RTL synthesis is executed in the verilog or vhdl directories,
whichever HDL was chosen for RTL synthesis using the drop-down menu in the preceding
figure, but the results of RTL synthesis are not included in the packaged IP.
Synthesized Checkpoint (.dcp), a design checkpoint, is always exported as synthesized RTL.
The flow option may be used to evaluate the results of synthesis or implementation, but the
exported package always contains a synthesized netlist.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
201
Chapter 1: High-Level Synthesis
Packaging IP Catalog Format
Upon completion of synthesis and RTL verification, open the Export RTL dialog box by
clicking the Export RTL toolbar button
.
Select the IP Catalog format in the Format Selection section.
The configuration options allow the following identification tags to be embedded in
the exported package. These fields can be used to help identify the packaged RTL inside the
Vivado IP Catalog.
The configuration information is used to differentiate between multiple instances of the
same design when the design is loaded into the IP Catalog. For example, if an
implementation is packaged for the IP Catalog and then a new solution is created and
packaged as IP, the new solution by default has the same name and configuration
information. If the new solution is also added to the IP Catalog, the IP Catalog will identify
it as an updated version of the same IP and the last version added to the IP Catalog will be
used.
An alternative method is to use the prefix option in the config_rtl configuration to
rename the output design and files with a unique prefix.
If no values are provided in the configuration setting the following values are used:
•
Vendor: xilinx.com
•
Library: hls
•
Version: 1.0
•
Description: An IP generated by Vivado HLS
•
Display Name: This field is left blank by default
•
Taxonomy: This field is left blank by default
After the packaging process is complete, the.zip file archive in directory
//impl/ip can be imported into the Vivado IP
catalog and used in any Vivado design (RTL or IP Integrator).
Software Driver Files
For designs that include AXI4-Lite slave interfaces, a set of software driver files is created
during the export process. These C driver files can be included in an SDK C project and used
to access the AXI4-Lite slave port.
The software driver files are written to directory
//impl/ip/drivers and are included in the
package .zip archive. Refer to AXI4-Lite Interface for details on the C driver files.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
202
Chapter 1: High-Level Synthesis
Exporting IP to System Generator
Upon completion of synthesis and RTL verification, open the Export RTL dialog box by
clicking the Export RTL toolbar button
.
X-Ref Target - Figure 1-76
Figure 1-76:
Export RTL to System Generator
If post-place-and-route resource and timing statistic for the IP block are desired then select
the Flow option and select the desired RTL language.
Pressing OK generates the IP package. This package is written to the
//impl/sysgen directory. And contains
everything need to import the design to System Generator.
If the Flow option was selected, RTL synthesis is executed and the final timing and
resources reported but not included in the IP package. See the RTL synthesis section above
for more details on this process.
Importing the RTL into System Generator
A Vivado HLS generated System Generator package may be imported into System Generator
using the following steps:
1. Inside the System Generator design, right-click and use option XilinxBlockAdd to
instantiate new block.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
203
Chapter 1: High-Level Synthesis
2. Scroll down the list in dialog box and select Vivado HLS.
3. Double-click on the newly instantiated Vivado HLS block to open the Block Parameters
dialog box.
4. Browse to the solution directory where the Vivado HLS block was exported. Using the
example, //impl/sysgen, browse to the
/ directory and select apply.
Optimizing Ports
If any top-level function arguments are transformed during the synthesis process into a
composite port, the type information for that port cannot be determined and included in
the System Generator IP block.
The implication for this limitation is that any design that uses the reshape, mapping or data
packing optimization on ports must have the port type information, for these composite
ports, manually specified in System Generator.
To manually specify the type information in System Generator, you should know how the
composite ports were created and then use slice and reinterpretation blocks inside System
Generator when connecting the Vivado HLS block to other blocks in the system.
For example:
•
If three 8-bit in-out ports R, G and B are packed into a 24-bit input port (RGB_in) and a
24-bit output port (RGB_out) ports.
After the IP block has been included in System Generator:
•
The 24-bit input port (RGB_in) would need to be driven by a System Generator block
that correctly groups three 8-bit input signals (Rin, Gin and Bin) into a 24-bit input bus.
•
The 24-bit output bus (RGB_out) would need to be correctly split into three 8-bit
signals (Rout, Bout and Gout).
See the System Generator documentation for details on how to use the slice and
reinterpretation blocks for connecting to composite type ports.
Exporting a Synthesized Checkpoint
Upon completion of synthesis and RTL verification, open the Export RTL dialog box by
clicking the Export RTL toolbar button
.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
204
Chapter 1: High-Level Synthesis
X-Ref Target - Figure 1-77
Figure 1-77:
Export RTL to Synthesized Checkpoint
When the design is packaged as a design checkpoint IP, the design is first synthesized
before being packaged.
Selecting OK generates the design checkpoint package. This package is written to the
//impl/ip directory. The design checkpoint files
can be used in a Vivado Design Suite project in the same manner as any other design
checkpoint.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
205
Chapter 2
High-Level Synthesis C Libraries
Introduction to the Vivado HLS C Libraries
Vivado ® HLS C libraries allow common hardware design constructs and function to be
easily modeled in C and synthesized to RTL. The following C libraries are provided with
Vivado HLS:
•
Arbitrary Precision Data Types Library
•
HLS Stream Library
•
HLS Math Library
•
HLS Video Library
•
HLS IP Library
•
HLS Linear Algebra Library
•
HLS DSP Library
You can use each of the C libraries in your design by including the library header file. These
header files are located in the include directory in the Vivado HLS installation area.
IMPORTANT: The header files for the Vivado HLS C libraries do not have to be in the include path if the
design is used in Vivado HLS. The paths to the library header files are automatically added.
Arbitrary Precision Data Types Library
C-based native data types are on 8-bit boundaries (8, 16, 32, 64 bits). RTL buses
(corresponding to hardware) support arbitrary lengths. HLS needs a mechanism to allow
the specification of arbitrary precision bit-width and not rely on the artificial boundaries of
native C data types: if a 17-bit multiplier is required, you should not be forced to implement
this with a 32-bit multiplier.
Vivado HLS provides both integer and fixed-point arbitrary precision data types for C, C++
and supports the arbitrary precision data types which are part of SystemC.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
206
Chapter 2: High-Level Synthesis C Libraries
The advantage of arbitrary precision data types is that they allow the C code to be updated
to use variables with smaller bit-widths and then for the C simulation to be re-executed to
validate the functionality remains identical or acceptable.
Using Arbitrary Precision Data Types
Vivado HLS provides arbitrary precision integer data types that manage the value of the
integer numbers within the boundaries of the specified width, as shown in the following
table.
Table 2-1:
Integer Data Types
Language
Integer Data Type
Required Header
C
[u]int (1024 bits)
gcc #include “ap_cint.h”
C++
ap_[u]int (1024 bits)
#include “ap_int.h”
System C
sc_[u]int (64 bits)
#include “systemc.h”
sc_[u]bigint (512 bits)
Note: The header files define the arbitrary precision types are also provided with Vivado HLS as a
standalone package with the rights to use them in your own source code. The package,
xilinx_hls_lib_.tgz is provided in the include directory in the Vivado
HLS installation area.
Arbitrary Integer Precision Types with C
For the C language, the header file ap_cint.h defines the arbitrary precision integer data
types [u]int.
Note: The package xilinx_hls_lib_.tgz does not include the C arbitrary
precision types defined in ap_cint.h. These types cannot be used with standard C compilers, only
with the Vivado HLS cpcc compiler. More details on this are provided in Validating Arbitrary Precision
Types in C.
To use arbitrary precision integer data types in a C function:
•
Add header file ap_cint.h to the source code.
•
Change the bit types to intN for signed types or uintN for unsigned types, where N is
a bit-size from 1 to 1024.
The following example shows how the header file is added and two variables implemented
to use 9-bit integer and 10-bit unsigned integer types:
#include "ap_cint.h"
void foo_top (…) {
int9 var1;
uint10 var2;
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
// 9-bit
// 10-bit unsigned
www.xilinx.com
Send Feedback
207
Chapter 2: High-Level Synthesis C Libraries
Arbitrary Integer Precision Types with C++
The header file ap_int.h defines the arbitrary precision integer data type for the C++
ap_[u]int data types listed in Table 2-2. To use arbitrary precision integer data types in a
C++ function:
•
Add header file ap_int.h to the source code.
•
Change the bit types to ap_int for signed types or ap_uint for unsigned
types, where N is a bit-size from 1 to 1024.
The following example shows how the header file is added and two variables implemented
to use 9-bit integer and 10-bit unsigned integer types:
#include "ap_int.h"
void foo_top (…) {
ap_int<9> var1;
ap_uint<10> var2;
// 9-bit
// 10-bit unsigned
Arbitrary Precision Integer Types with SystemC
The arbitrary precision types used by SystemC are defined in the systemc.h header file
that is required to be included in all SystemC designs. The header file includes the SystemC
sc_int<>, sc_uint<>, sc_bigint<> and sc_biguint<> types.
Arbitrary Precision Fixed-Point Data Types
In Vivado HLS, it is important to use fixed-point data types, because the behavior of the
C++/SystemC simulations performed using fixed-point data types match that of the
resulting hardware created by synthesis. This allows you to analyze the effects of
bit-accuracy, quantization, and overflow with fast C-level simulation.
Vivado HLS offers arbitrary precision fixed-point data types for use with C++ and SystemC
functions as shown in the following table.
Table 2-2:
Fixed-Point Data Types
Language
Fixed-Point Data Type
Required Header
C
-- Not Applicable --
-- Not Applicable --
C++
ap_[u]fixed
#include “ap_fixed.h”
System C
sc_[u]fixed
#define SC_INCLUDE_FX
[#define SC_FX_EXCLUDE_OTHER]
#include “systemc.h”
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
208
Chapter 2: High-Level Synthesis C Libraries
These data types manage the value of real (non-integer) numbers within the boundaries of
a specified total width and integer width, as shown in the following figure.
X-Ref Target - Figure 2-1
MSB
LSB
I-1
...
1
0
-1
...
-B
Binary point
W=I+B
;
Figure 2-1:
Fixed-Point Data Type
The following table provides a brief overview of operations supported by fixed-point types.
Table 2-3:
Fixed-Point Identifier Summary
Identifier
Description
W
Word length in bits
I
The number of bits used to represent the integer value (the number of bits above the
decimal point)
Q
Quantization mode
This dictates the behavior when greater precision is generated than can be defined by
smallest fractional bit in the variable used to store the result.
SystemC Types
ap_fixed Types
Description
SC_RND
AP_RND
Round to plus infinity
SC_RND_ZERO
AP_RND_ZERO
Round to zero
SC_RND_MIN_INF
AP_RND_MIN_INF
Round to minus infinity
SC_RND_INF
AP_RND_INF
Round to infinity
SC_RND_CONV
AP_RND_CONV
Convergent rounding
SC_TRN
AP_TRN
Truncation to minus infinity
SC_TRN_ZERO
AP_TRN_ZERO
Truncation to zero (default)
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
209
Chapter 2: High-Level Synthesis C Libraries
Table 2-3:
Fixed-Point Identifier Summary (Cont’d)
Identifier
O
Description
Overflow mode.
This dictates the behavior when the result of an operation exceeds the maximum (or
minimum in the case of negative numbers) possible value that can be stored in the
variable used to store the result.
N
SystemC Types
ap_fixed Types
Description
SC_SAT
AP_SAT
Saturation
SC_SAT_ZERO
AP_SAT_ZERO
Saturation to zero
SC_SAT_SYM
AP_SAT_SYM
Symmetrical saturation
SC_WRAP
AP_WRAP
Wrap around (default)
SC_WRAP_SM
AP_WRAP_SM
Sign magnitude wrap
around
This defines the number of saturation bits in overflow wrap modes.
Example Using ap_fixed
In this example the Vivado HLS ap_fixed type is used to define an 18-bit variable with 6
bits representing the numbers above the decimal point and 12-bits representing the value
below the decimal point. The variable is specified as signed, the quantization mode is set to
round to plus infinity and the default wrap-around mode is used for overflow.
#include 
...
ap_fixed<18,6,AP_RND > my_type;
...
Example Using sc_fixed
In this sc_fixed example a 22-bit variable is shown with 21 bits representing the numbers
above the decimal point: enabling only a minimum accuracy of 0.5. Rounding to zero is
used, such that any result less than 0.5 rounds to 0 and saturation is specified.
#define SC_INCLUDE_FX
#define SC_FX_EXCLUDE_OTHER
#include 
...
sc_fixed<22,21,SC_RND_ZERO,SC_SAT> my_type;
...
C Arbitrary Precision Integer Data Types
The native data types in C are on 8-bit boundaries (8, 16, 32 and 64 bits). RTL signals and
operations support arbitrary bit-lengths. Vivado HLS provides arbitrary precision data types
for C to allow variables and operations in the C code to be specified with any arbitrary
bit-widths: for example, 6-bit, 17-bit, and 234-bit, up to 1024 bits.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
210
Chapter 2: High-Level Synthesis C Libraries
Vivado HLS also provides arbitrary precision data types in C++ and supports the arbitrary
precision data types that are part of SystemC. These types are discussed in the respective
C++ and SystemC coding.
Advantages of C Arbitrary Precision Data Types
The primary advantages of arbitrary precision data types are:
•
Better quality hardware
If, for example, a 17-bit multiplier is required, you can use arbitrary precision types to
require exactly 17 bits in the calculation.
Without arbitrary precision data types, a multiplication such as 17 bits must be
implemented using 32-bit integer data types. This results in the multiplication being
implemented with multiple DSP48 components.
•
Accurate C simulation and analysis
Arbitrary precision data types in the C code allows the C simulation to be executed using
accurate bit-widths and for the C simulation to validate the functionality (and accuracy)
of the algorithm before synthesis.
For the C language, the header file ap_cint.h defines the arbitrary precision integer data
types [u]int#W. For example:
•
int8 represents an 8-bit signed integer data type.
•
uint234 represents a 234-bit unsigned integer type.
The ap_cint.h file is located in the directory:
$HLS_ROOT/include
where
•
$HLS_ROOT is the Vivado HLS installation directory.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
211
Chapter 2: High-Level Synthesis C Libraries
The code shown in the following example is a repeat of the code shown in the Example 3-20
on basic arithmetic. In both examples, the data types in the top-level function to be
synthesized are specified as dinA_t, dinB_t, etc.
#include "apint_arith.h"
void apint_arith(din_A inA, din_B inB, din_C inC, din_D inD,
out_1 *out1, dout_2 *out2, dout_3 *out3, dout_4 *out4
) {
// Basic arithmetic operations
*out1 = inA * inB;
*out2 = inB + inA;
*out3 = inC / inA;
*out4 = inD % inA;
}
Example 2-1:
Basic Arithmetic Revisited
The real difference between the two examples is in how the data types are defined. To use
arbitrary precision integer data types in a C function:
•
Add header file ap_cint.h to the source code.
•
Change the native C types to arbitrary precision types:
°
intN
or
°
uintN
where
-
N is a bit size from 1 to 1024.
The data types are defined in the header apint_arith.h. See the following example
compared with Example 3-20:
•
The input data types have been reduced to represent the maximum size of the real
input data. For example, 8-bit input inA is reduced to 6-bit input.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
212
Chapter 2: High-Level Synthesis C Libraries
•
The output types have been refined to be more accurate. For example, out2 (the sum
of inA and inB) needs to be only 13-bit, not 32-bit.
#include 
#include ap_cint.h
// Previous data types
//typedef char dinA_t;
//typedef short dinB_t;
//typedef int dinC_t;
//typedef long long dinD_t;
//typedef int dout1_t;
//typedef unsigned int dout2_t;
//typedef int32_t dout3_t;
//typedef int64_t dout4_t;
typedef
typedef
typedef
typedef
int6 dinA_t;
int12 dinB_t;
int22 dinC_t;
int33 dinD_t;
typedef
typedef
typedef
typedef
int18 dout1_t;
uint13 dout2_t;
int22 dout3_t;
int6 dout4_t;
void apint_arith(dinA_t inA,dinB_t inB,dinC_t inC,dinD_t inD,dout1_t
*out1,dout2_t *out2,dout3_t *out3,dout4_t *out4);
Example 2-2:
Basic Arithmetic apint Types
Synthesizing the preceding example results in a design that is functionally identical to
Example 3-20 (given data in the range specified by the preceding example). The final RTL
design is smaller in area and has a faster clock speed, because smaller bit-widths result in
reduced logic.
The function must be compiled and validated before synthesis.
Validating Arbitrary Precision Types in C
To create arbitrary precision types, attributes are added to define the bit-sizes in file
ap_cint.h. Standard C compilers such as gcc compile the attributes used in the header
file, but they do not know what the attributes mean. This results in computations that do
not reflect the bit-accurate behavior of the code. For example, a 3-bit integer value with
binary representation 100 is treated by gcc (or any other third-party C compiler) as having
a decimal value 4 and not -4.
Note: This issue is only present when using C arbitrary precision types. There are no such issues with
C++ or SystemC arbitrary precision types.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
213
Chapter 2: High-Level Synthesis C Libraries
Vivado HLS solves this issue by automatically using its own built-in C compiler apcc, when
it recognizes arbitrary precision C types are being used. This compiler is gcc compatible
but correctly interprets arbitrary precision types and arithmetic. You can invoke the apcc
compiler at the command prompt by replacing “gcc” by “apcc”.
$ apcc -o foo_top foo_top.c tb_foo_top.c
$ ./foo_top
When arbitrary precision types are used in C, the design can no longer be analyzed using
the Vivado HLS C debugger. If it is necessary to debug the design, Xilinx recommends one
of the following methodologies:
•
Use the printf or fprintf functions to output the data values for analysis.
•
Replace the arbitrary precision types with native C types (int, char, short, etc). This
approach helps debug the operation of the algorithm itself but does not help when you
must analyze the bit-accurate results of the algorithm.
•
Change the C function to C++ and use C++ arbitrary precision types for which there
are no debugger limitations.
Integer Promotion
Take care when the result of arbitrary precision operations crosses the native 8, 16, 32 and
64-bit boundaries. In the following example, the intent is that two 18-bit values are
multiplied and the result stored in a 36-bit number:
#include "ap_cint.h"
int18
int36
a,b;
tmp;
tmp = a * b;
Integer promotion occurs when using this method. The result might not be as expected.
In integer promotion, the C compiler:
•
Promotes the multiplication inputs to the native integer size (32-bit).
•
Performs multiplication, which generates a 32-bit result.
•
Assigns the result to the 36-bit variable tmp.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
214
Chapter 2: High-Level Synthesis C Libraries
This results in the behavior and incorrect result shown in the following figure.
X-Ref Target - Figure 2-2
5HVXOWLQ+H[
D
E
0XOWLSOLFDWLRQ5HVXOW
5HVXOWಯSURPRWHGರWRELW
WPS
;
Figure 2-2:
Integer Promotion
Because Vivado HLS produces the same results as C simulation, Vivado HLS creates
hardware in which a 32-bit multiplier result is sign-extended to a 36-bit result.
To overcome the integer promotion issue, cast operator inputs to the output size. The
following example shows where the inputs to the multiplier are cast to 36-bit value before
the multiplication. This results in the correct (expected) results during C simulation and the
expected 36-bit multiplication in the RTL.
#include "ap_cint.h"
typedef int18 din_t;
typedef int36 dout_t;
dout_t apint_promotion(din_t a,din_t b) {
dout_t tmp;
tmp = (dout_t)a * (dout_t)b;
return tmp;
}
Example 2-3:
Cast to Avoid Integer Promotion
Casting to avoid integer promotion issue is required only when the result of an operation is
greater than the next native boundary (8, 16, 32, or 64). This behavior is more typical with
multipliers than with addition and subtraction operations.
There are no integer promotion issues when using C++ or SystemC arbitrary precision
types.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
215
Chapter 2: High-Level Synthesis C Libraries
C Arbitrary Precision Integer Types: Reference Information
The information in C Arbitrary Precision Types in Chapter 4 provides information on:
•
Techniques for assigning constant and initialization values to arbitrary precision
integers (including values greater than 64-bit).
•
A description of Vivado HLS helper functions, such as printing, concatenating,
bit-slicing and range selection functions.
•
A description of operator behavior, including a description of shift operations (a
negative shift values, results in a shift in the opposite direction).
C++ Arbitrary Precision Integer Types
The native data types in C++ are on 8-bit boundaries (8, 16, 32 and 64 bits). RTL signals and
operations support arbitrary bit-lengths.
Vivado HLS provides arbitrary precision data types for C++ to allow variables and
operations in the C++ code to be specified with any arbitrary bit-widths: 6-bit, 17-bit,
234-bit, up to 1024 bits.
TIP: The default maximum width allowed is 1024 bits. You can override this default by defining the
macro AP_INT_MAX_W with a positive integer value less than or equal to 32768 before inclusion of the
ap_int.h header file.
C++ supports use of the arbitrary precision types defined in the SystemC standard. Include
the SystemC header file systemc.h, and use SystemC data types. For more information on
SystemC types, see SystemC Synthesis in Chapter 3.
Arbitrary precision data types have are two primary advantages over the native C++ types:
•
Better quality hardware: If for example, a 17-bit multiplier is required, arbitrary
precision types can specify that exactly 17-bit are used in the calculation.
Without arbitrary precision data types, such a multiplication (17-bit) must be
implemented using 32-bit integer data types and result in the multiplication being
implemented with multiple DSP48 components.
•
Accurate C++ simulation/analysis: Arbitrary precision data types in the C++ code
allows the C++ simulation to be performed using accurate bit-widths and for the C++
simulation to validate the functionality (and accuracy) of the algorithm before
synthesis.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
216
Chapter 2: High-Level Synthesis C Libraries
The arbitrary precision types in C++ have none of the disadvantages of those in C:
•
C++ arbitrary types can be compiled with standard C++ compilers (there is no C++
equivalent of apcc, as discussed in Validating Arbitrary Precision Types in C).
•
C++ arbitrary precision types do not suffer from Integer Promotion Issues.
It is not uncommon for users to change a file extension from .c to .cpp so the file can be
compiled as C++, where neither of these issues are present.
For the C++ language, the header file ap_int.h defines the arbitrary precision integer
data types ap_(u)int. For example, ap_int<8> represents an 8-bit signed integer
data type and ap_uint<234> represents a 234-bit unsigned integer type.
The ap_int.h file is located in the directory $HLS_ROOT/include, where $HLS_ROOT is the
Vivado HLS installation directory.
The code shown in the following example, is a repeat of the code shown in the earlier
example on basic arithmetic (Example 3-20 and again in Example 2-1). In this example the
data types in the top-level function to be synthesized are specified as dinA_t, dinB_t ...
#include "cpp_ap_int_arith.h"
void cpp_ap_int_arith(din_A inA, din_B inB, din_C inC, din_D inD,
dout_1 *out1, dout_2 *out2, dout_3 *out3, dout_4 *out4
) {
// Basic arithmetic operations
*out1 = inA * inB;
*out2 = inB + inA;
*out3 = inC / inA;
*out4 = inD % inA;
}
Example 2-4:
Basic Arithmetic Revisited with C++ Types
In this latest update to this example, the C++ arbitrary precision types are used:
•
Add header file ap_int.h to the source code.
•
Change the native C++ types to arbitrary precision types ap_int or ap_uint,
where N is a bit-size from 1 to 1024 (as noted above, this can be extended to 32K-bits is
required).
The data types are defined in the header cpp_ap_int_arith.h as shown in Example 2-2.
Compared with Example 3-20, the input data types have simply been reduced to represent
the maximum size of the real input data (for example, 8-bit input inA is reduced to 6-bit
input). The output types have been refined to be more accurate, for example, out2, the
sum of inA and inB, need only be 13-bit and not 32-bit.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
217
Chapter 2: High-Level Synthesis C Libraries
#ifndef _CPP_AP_INT_ARITH_H_
#define _CPP_AP_INT_ARITH_H_
#include 
#include "ap_int.h"
#define N 9
// Old data types
//typedef char dinA_t;
//typedef short dinB_t;
//typedef int dinC_t;
//typedef long long dinD_t;
//typedef int dout1_t;
//typedef unsigned int dout2_t;
//typedef int32_t dout3_t;
//typedef int64_t dout4_t;
typedef
typedef
typedef
typedef
ap_int<6> dinA_t;
ap_int<12> dinB_t;
ap_int<22> dinC_t;
ap_int<33> dinD_t;
typedef
typedef
typedef
typedef
ap_int<18> dout1_t;
ap_uint<13> dout2_t;
ap_int<22> dout3_t;
ap_int<6> dout4_t;
void cpp_ap_int_arith(dinA_t inA,dinB_t inB,dinC_t inC,dinD_t inD,dout1_t
*out1,dout2_t *out2,dout3_t *out3,dout4_t *out4);
#endif
Example 2-5:
Basic Arithmetic with C++ Arbitrary Precision Types
If Example 2-4 is synthesized, it results in a design that is functionally identical to
Example 3-20 and Example 2-2. It keeps the test bench as similar as possible to
Example 2-2, rather than use the C++ cout operator to output the results to a file, the
built-in ap_int method .to_int() is used to convert the ap_int results to integer types
used with the standard fprintf function.
fprintf(fp, %d*%d=%d; %d+%d=%d; %d/%d=%d; %d mod %d=%d;\n,
inA.to_int(), inB.to_int(), out1.to_int(),
inB.to_int(), inA.to_int(), out2.to_int(),
inC.to_int(), inA.to_int(), out3.to_int(),
inD.to_int(), inA.to_int(), out4.to_int());
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
218
Chapter 2: High-Level Synthesis C Libraries
C++ Arbitrary Precision Integer Types: Reference Information
For comprehensive information on the methods, synthesis behavior, and all aspects of using
the ap_(u)int arbitrary precision data types, see C++ Arbitrary Precision Types in
Chapter 4. This section includes:
•
Techniques for assigning constant and initialization values to arbitrary precision
integers (including values greater than 1024-bit).
•
A description of Vivado HLS helper methods, such as printing, concatenating,
bit-slicing and range selection functions.
•
A description of operator behavior, including a description of shift operations (a
negative shift values, results in a shift in the opposite direction).
C++ Arbitrary Precision Fixed-Point Types
C++ functions can take advantage of the arbitrary precision fixed-point types included with
Vivado HLS. The following figure summarizes the basic features of these fixed-point types:
•
The word can be signed (ap_fixed) or unsigned (ap_ufixed).
•
A word with of any arbitrary size W can be defined.
•
The number of places above the decimal point I, also defines the number of decimal
places in the word, W-I (represented by B in the following figure).
•
The type of rounding or quantization (Q) can be selected.
•
The overflow behavior (O and N) can be selected.
X-Ref Target - Figure 2-3
DSB>X@IL[HG:,421!
,
%
%LQDU\SRLQW: ,%
;
Figure 2-3:
Arbitrary Precision Fixed-Point Types
The arbitrary precision fixed-point types can be used when header file ap_fixed.h is
included in the code.
TIP: Arbitrary precision fixed-point types use more memory during C simulation. If using very large
arrays of ap_[u]fixed types, refer to the discussion of C simulation in Arrays in Chapter 3.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
219
Chapter 2: High-Level Synthesis C Libraries
The advantages of using fixed-point types are:
•
They allow fractional number to be easily represented.
•
When variables have a different number of integer and decimal place bits, the
alignment of the decimal point is handled.
•
There are numerous options to handle how rounding should happen: when there are
too few decimal bits to represent the precision of the result.
•
There are numerous options to handle how variables should overflow: when the result
is greater than the number of integer bits can represent.
These attributes are summarized by examining the code in Example 2-6. First, the header
file ap_fixed.h is included. The ap_fixed types are then defined using the typedef
statement:
•
A 10-bit input: 8-bit integer value with 2 decimal places.
•
A 6-bit input: 3-bit integer value with 3 decimal places.
•
A 22-bit variable for the accumulation: 17-bit integer value with 5 decimal places.
•
A 36-bit variable for the result: 30-bit integer value with 6 decimal places.
The function contains no code to manage the alignment of the decimal point after
operations are performed. The alignment is done automatically.
#include "ap_fixed.h"
typedef
typedef
typedef
typedef
ap_ufixed<10,8, AP_RND, AP_SAT> din1_t;
ap_fixed<6,3, AP_RND, AP_WRAP> din2_t;
ap_fixed<22,17, AP_TRN, AP_SAT> dint_t;
ap_fixed<36,30> dout_t;
dout_t cpp_ap_fixed(din1_t d_in1, din2_t d_in2) {
static dint_t sum;
sum += d_in1;
return sum * d_in2;
}
Example 2-6:
ap_fixed Type Example
The following table shows the quantization and overflow modes. For detailed information,
see C++ Arbitrary Precision Fixed-Point Types in Chapter 4.
TIP: Quantization and overflow modes that do more than the default behavior of standard hardware
arithmetic (wrap and truncate) result in operators with more associated hardware. It costs logic (LUTs)
to implement the more advanced modes, such as round to minus infinity or saturate symmetrically.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
220
Chapter 2: High-Level Synthesis C Libraries
Table 2-4:
Identifier
W
Fixed-Point Identifier Summary
Description
Word length in bits
I
The number of bits used to represent the integer value (the number of bits above the
decimal point)
Q
Quantization mode dictates the behavior when greater precision is generated than can
be defined by smallest fractional bit in the variable used to store the result.
O
N
Mode
Description
AP_RND
Rounding to plus infinity
AP_RND_ZERO
Rounding to zero
AP_RND_MIN_INF
Rounding to minus infinity
AP_RND_INF
Rounding to infinity
AP_RND_CONV
Convergent rounding
AP_TRN
Truncation to minus infinity
AP_TRN_ZERO
Truncation to zero (default)
Overflow mode dictates the behavior when more bits are generated than the variable to
store the result contains.
Mode
Description
AP_SAT
Saturation
AP_SAT_ZERO
Saturation to zero
AP_SAT_SYM
Symmetrical saturation
AP_WRAP
Wrap around (default)
AP_WRAP_SM
Sign magnitude wrap around
The number of saturation bits in wrap modes.
Using ap_(u)fixed types, the C++ simulation is bit accurate. Fast simulation can validate
the algorithm and its accuracy. After synthesis, the RTL exhibits the identical bit-accurate
behavior.
Arbitrary precision fixed-point types can be freely assigned literal values in the code. See
shown the test bench (Example 2-7) used with Example 2-6, in which the values of in1 and
in2 are declared and assigned constant values.
When assigning literal values involving operators, the literal values must first be cast to
ap_(u)fixed types. Otherwise, the C compiler and Vivado HLS interpret the literal as an
integer or float/double type and may fail to find a suitable operator. As shown in the
following example, in the assignment of in1 = in1 + din1_t(0.25), the literal 0.25 is
cast to an ap_fixed type.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
221
Chapter 2: High-Level Synthesis C Libraries
#include 
#include 
#include 
#include 
#include 
using namespace std;
#include "ap_fixed.h"
typedef
typedef
typedef
typedef
ap_ufixed<10,8, AP_RND, AP_SAT> din1_t;
ap_fixed<6,3, AP_RND, AP_WRAP> din2_t;
ap_fixed<22,17, AP_TRN, AP_SAT> dint_t;
ap_fixed<36,30> dout_t;
dout_t cpp_ap_fixed(din1_t d_in1, din2_t d_in2);
int main()
{
ofstream result;
din1_t in1 = 0.25;
din2_t in2 = 2.125;
dout_t output;
int retval=0;
result.open(result.dat);
// Persistent manipulators
result << right << fixed << setbase(10) << setprecision(15);
for (int i = 0; i <= 250; i++)
{
output = cpp_ap_fixed(in1,in2);
result
result
result
result
result
<<
<<
<<
<<
<<
setw(10)
setw(20)
setw(20)
setw(20)
endl;
<<
<<
<<
<<
i;
in1;
in2;
output;
in1 = in1 + din1_t(0.25);
in2 = in2 - din2_t(0.125);
}
result.close();
// Compare the results file with the golden results
retval = system(diff --brief -w result.dat result.golden.dat);
if (retval != 0) {
printf(Test failed !!!\n);
retval=1;
} else {
printf(Test passed !\n);
}
// Return 0 if the test passes
return retval;
}
Example 2-7:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
ap_fixed Type Test Bench Coding Example
www.xilinx.com
Send Feedback
222
Chapter 2: High-Level Synthesis C Libraries
C++ Arbitrary Precision Fixed-Point Types: Reference Information
For comprehensive information on the methods, synthesis behavior, and all aspects of using
the ap_(u)fixed arbitrary precision fixed-point data types, see C++ Arbitrary
Precision Fixed-Point Types in Chapter 4. This section includes:
•
Techniques for assigning constant and initialization values to arbitrary precision
integers (including values greater than 1024-bit).
•
A detailed description of the overflow and saturation modes.
•
A description of Vivado HLS helper methods, such as printing, concatenating,
bit-slicing and range selection functions.
•
A description of operator behavior, including a description of shift operations (a
negative shift values, results in a shift in the opposite direction).
HLS Stream Library
Streaming data is a type of data transfer in which data samples are sent in sequential order
starting from the first sample. Streaming requires no address management.
Modeling designs that use streaming data can be difficult in C. As discussed in Multi-Access
Pointer Interfaces: Streaming Data in Chapter 3, the approach of using pointers to perform
multiple read and/or write accesses can introduce issues, because there are implications for
the type qualifier and how the test bench is constructed.
Vivado HLS provides a C++ template class hls::stream<> for modeling streaming data
structures. The streams implemented with the hls::stream<> class have the following
attributes.
•
In the C code, an hls::stream<> behaves like a FIFO of infinite depth. There is no
requirement to define the size of an hls::stream<>.
•
They are read from and written to sequentially. That is, after data is read from an
hls::stream<>, it cannot be read again.
•
An hls::stream<> on the top-level interface is by default implemented with an
ap_fifo interface.
•
An hls::stream<> internal to the design is implemented as a FIFO with a depth of 1.
The optimization directive STREAM is used to change this default size.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
223
Chapter 2: High-Level Synthesis C Libraries
This section shows how the hls::stream<> class can more easily model designs with
streaming data. The topics in this section provide:
•
An overview of modeling with streams and the RTL implementation of streams.
•
Rules for global stream variables.
•
How to use streams.
•
Blocking reads and writes.
•
Non-Blocking Reads and writes.
•
Controlling the FIFO depth.
Note: The hls::stream class should always be passed between functions as a C++ reference
argument. For example, &my_stream.
IMPORTANT: The hls::stream class is only used in C++ designs.
C Modeling and RTL Implementation
Streams are modeled as an infinite queue in software (and in the test bench during RTL
co-simulation). There is no need to specify any depth to simulate streams in C++. Streams
can be used inside functions and on the interface to functions. Internal streams may be
passed as function parameters.
Streams can be used only in C++ based designs. Each hls::stream<> object must be
written by a single process and read by a single process.
If an hls::stream is used on the top-level interface, it is by default implemented in the
RTL as a FIFO interface (ap_fifo) but may be optionally implemented as a handshake
interface (ap_hs) or an AXI-Stream interface (axis).
If an hls::steam is used inside the design function and synthesized into hardware, it is
implemented as a FIFO with a default depth of 1. In some cases, such as when interpolation
is used, the depth of the FIFO might have to be increased to ensure the FIFO can hold all the
elements produced by the hardware. Failure to ensure the FIFO is large enough to hold all
the data samples generated by the hardware can result in a stall in the design (seen in C/RTL
co-simulation and in the hardware implementation). The depth of the FIFO can be adjusted
using the STREAM directive with the depth option. An example of this is provided in the
example design hls_stream, as shown in Table 1-5.
IMPORTANT: Ensure hls::stream variables are correctly sized when used in the default
non-DATAFLOW regions.
If an hls::stream is used to transfer data between tasks (sub-functions or loops), you
should immediately consider implementing the tasks in a DATAFLOW region where data
streams from one task to the next. The default (non-DATAFLOW) behavior is to complete
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
224
Chapter 2: High-Level Synthesis C Libraries
each task before starting the next task, in which case the FIFOs used to implement the
hls::stream variables must be sized to ensure they are large enough to hold all the data
samples generated by the producer task. Failure to increase the size of the hls::stream
variables results in the error below:
ERROR: [XFORM 203-733] An internal stream xxxx.xxxx.V.user.V' with default size is
used in a non-dataflow region, which may result in deadlock. Please consider to
resize the stream using the directive 'set_directive_stream' or the 'HLS stream'
pragma.
This error informs you that in a non-DATAFLOW region (the default FIFOs of depth of 1) may
not be large enough to hold all the data samples written to the FIFO by the producer task.
Global and Local Streams
Streams may be defined either locally or globally. Local streams are always implemented as
internal FIFOs. Global streams can be implemented as internal FIFOs or ports:
•
Globally-defined streams that are only read from, or only written to, are inferred as
external ports of the top-level RTL block.
•
Globally-defined streams that are both read from and written to (in the hierarchy below
the top-level function) are implemented as internal FIFOs.
Streams defined in the global scope follow the same rules as any other global variables. For
more information on the synthesis of global variables, see Data Types and Bit-Widths in
Chapter 1.
Using HLS Streams
To use hls::stream<> objects, include the header file hls_stream.h. Streaming data
objects are defined by specifying the type and variable name. In this example, a 128-bit
unsigned integer type is defined and used to create a stream variable called
my_wide_stream.
#include "ap_int.h"
#include "hls_stream.h"
typedef ap_uint<128> uint128_t; // 128-bit user defined type
hls::stream my_wide_stream; // A stream declaration
Streams must use scoped naming. Xilinx recommends using the scoped hls:: naming
shown in the example above. However, if you want to use the hls namespace, you can
rewrite the preceding example as:
#include 
#include 
using namespace hls;
typedef ap_uint<128> uint128_t; // 128-bit user defined type
stream my_wide_stream; // hls:: no longer required
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
225
Chapter 2: High-Level Synthesis C Libraries
Given a stream specified as hls::stream, the type T may be:
•
Any C++ native data type
•
A Vivado HLS arbitrary precision type (for example, ap_int<>, ap_ufixed<>)
•
A user-defined struct containing either of the above types
Note: General user-defined classes (or structures) that contain methods (member functions) should
not be used as the type (T) for a stream variable.
Streams may be optional named. Providing a name for the stream allows the name to be
used in reporting. For example, Vivado HLS automatically checks to ensure all elements
from an input stream are read during simulation. Given the following two streams:
stream bytestr_in1;
stream bytestr_in2("input_stream2");
Any warning on elements left in the streams are reported as follows, where it is clear which
message relates to bytetr_in2:
WARNING: Hls::stream 'hls::stream.1' contains leftover data, which
may result in RTL simulation hanging.
WARNING: Hls::stream 'input_stream2' contains leftover data, which may result in RTL
simulation hanging.
When streams are passed into and out of functions, they must be passed-by-reference as in
the following example:
void stream_function (
hls::stream &strm_out,
hls::stream &strm_in,
uint16_t strm_len
)
Vivado HLS supports both blocking and non-blocking access methods.
•
Non-blocking accesses can be implemented only as FIFO interfaces.
•
Streaming ports that are implemented as ap_fifo ports and that are defined with an
AXI4-Stream resource must not use non-blocking accesses.
A complete design example using streams is provided in the Vivado HLS examples. Refer to
the hls_stream example in the design examples available from the GUI welcome screen.
Blocking Reads and Writes
The basic accesses to an hls::stream<> object are blocking reads and writes. These are
accomplished using class methods. These methods stall (block) execution if a read is
attempted on an empty stream FIFO, a write is attempted to a full stream FIFO, or until a full
handshake is accomplished for a stream mapped to an ap_hs interface protocol.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
226
Chapter 2: High-Level Synthesis C Libraries
A stall can be observed in C/RTL co-simulation as the continued execution of the simulator
without any progress in the transactions. The following shows a classic example of a stall
situation, where the RTL simulation time keeps increasing, but there is no progress in the
inter or intra transactions:
// RTL Simulation : "Inter-Transaction Progress" ["Intra-Transaction Progress"] @
"Simulation Time"
///////////////////////////////////////////////////////////////////////////////////
// RTL Simulation : 0 / 1 [0.00%] @ "110000"
// RTL Simulation : 0 / 1 [0.00%] @ "202000"
// RTL Simulation : 0 / 1 [0.00%] @ "404000"
Blocking Write Methods
In this example, the value of variable src_var is pushed into the stream.
// Usage of void write(const T & wdata)
hls::stream my_stream;
int src_var = 42;
my_stream.write(src_var);
The << operator is overloaded such that it may be used in a similar fashion to the stream
insertion operators for C++ stream (for example, iostreams and filestreams). The
hls::stream<> object to be written to is supplied as the left-hand side argument and the
value to be written as the right-hand side.
// Usage of void operator << (T & wdata)
hls::stream my_stream;
int src_var = 42;
my_stream << src_var;
Blocking Read Methods
This method reads from the head of the stream and assigns the values to the variable
dst_var.
// Usage of void read(T &rdata)
hls::stream my_stream;
int dst_var;
my_stream.read(dst_var);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
227
Chapter 2: High-Level Synthesis C Libraries
Alternatively, the next object in the stream can be read by assigning (using for example =,
+=) the stream to an object on the left-hand side:
// Usage of T read(void)
hls::stream my_stream;
int dst_var = my_stream.read();
The '>>' operator is overloaded to allow use similar to the stream extraction operator for
C++ stream (for example, iostreams and filestreams). The hls::stream is supplied as the
LHS argument and the destination variable the RHS.
// Usage of void operator >> (T & rdata)
hls::stream my_stream;
int dst_var;
my_stream >> dst_var;
Non-Blocking Reads and Writes
Non-blocking write and read methods are also provided. These allow execution to continue
even when a read is attempted on an empty stream or a write to a full stream.
These methods return a Boolean value indicating the status of the access (true if
successful, false otherwise). Additional methods are included for testing the status of an
hls::stream<> stream.
IMPORTANT: Non-blocking behavior is only supported on interfaces using the ap_fifo protocol.
More specifically, the AXI-Stream standard and the Xilinx ap_hs IO protocol do not support
non-blocking accesses.
During C simulation, streams have an infinite size. It is therefore not possible to validate
with C simulation if the stream is full. These methods can be verified only during RTL
simulation when the FIFO sizes are defined (either the default size of 1, or an arbitrary size
defined with the STREAM directive).
IMPORTANT: If the design is specified to use the block-level I/O protocol ap_ctrl_none and the design
contains any hls::stream variables that employ non-blocking behavior, C/RTL co-simulation is not
guaranteed to complete.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
228
Chapter 2: High-Level Synthesis C Libraries
Non-Blocking Writes
This method attempts to push variable src_var into the stream my_stream, returning a
boolean true if successful. Otherwise, false is returned and the queue is unaffected.
// Usage of void write_nb(const T & wdata)
hls::stream my_stream;
int src_var = 42;
if (my_stream.write_nb(src_var)) {
// Perform standard operations
...
} else {
// Write did not occur
return;
}
Fullness Test
bool full(void)
Returns true, if and only if the hls::stream<> object is full.
// Usage of bool full(void)
hls::stream my_stream;
int src_var = 42;
bool stream_full;
stream_full = my_stream.full();
Non-Blocking Read
bool read_nb(T & rdata)
This method attempts to read a value from the stream, returning true if successful.
Otherwise, false is returned and the queue is unaffected.
// Usage of void read_nb(const T & wdata)
hls::stream my_stream;
int dst_var;
if (my_stream.read_nb(dst_var)) {
// Perform standard operations
...
} else {
// Read did not occur
return;
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
229
Chapter 2: High-Level Synthesis C Libraries
Emptiness Test
bool empty(void)
Returns true if the hls::stream<> is empty.
// Usage of bool empty(void)
hls::stream my_stream;
int dst_var;
bool stream_empty;
stream_empty = my_stream.empty();
The following example shows how a combination of non-blocking accesses and full/empty
tests can provide error handling functionality when the RTL FIFOs are full or empty:
#include "hls_stream.h"
using namespace hls;
typedef struct {
short
data;
bool
valid;
bool
invert;
} input_interface;
bool invert(stream& in_data_1,
stream& in_data_2,
stream& output
) {
input_interface in;
bool full_n;
// Read an input value or return
if (!in_data_1.read_nb(in))
if (!in_data_2.read_nb(in))
return false;
// If the valid data is written, return not-full (full_n) as true
if (in.valid) {
if (in.invert)
full_n = output.write_nb(~in.data);
else
full_n = output.write_nb(in.data);
}
return full_n;
}
Controlling the RTL FIFO Depth
For most designs using streaming data, the default RTL FIFO depth of 1 is sufficient.
Streaming data is generally processed one sample at a time.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
230
Chapter 2: High-Level Synthesis C Libraries
For multirate designs in which the implementation requires a FIFO with a depth greater than
1, you must determine (and set using the STREAM directive) the depth necessary for the RTL
simulation to complete. If the FIFO depth is insufficient, RTL co-simulation stalls.
Because stream objects cannot be viewed in the GUI directives pane, the STREAM directive
cannot be applied directly in that pane.
Right-click the function in which an hls::stream<> object is declared (or is used, or exists
in the argument list) to:
•
Select the STREAM directive.
•
Populate the variable field manually with name of the stream variable.
Alternatively, you can:
•
Specify the STREAM directive manually in the directives.tcl file, or
•
Add it as a pragma in source.
C/RTL Co-Simulation Support
The Vivado HLS C/RTL co-simulation feature does not support structures or classes
containing hls::stream<> members in the top-level interface. Vivado HLS supports
these structures or classes for synthesis.
typedef struct {
hls::stream a;
hls::stream b;
} strm_strct_t;
void dut_top(strm_strct_t indata, strm_strct_t outdata) { … }
These restrictions apply to both top-level function arguments and globally declared
objects. If structs of streams are used for synthesis, the design must be verified using an
external RTL simulator and user-created HDL test bench. There are no such restrictions on
hls::stream<> objects with strictly internal linkage.
HLS Math Library
The Vivado HLS Math Library (hls_math.h) provides support for the synthesis of the
standard C (math.h) and C++ (cmath.h) libraries and is automatically used to specify the
math operations during synthesis. The support includes floating point (single-precision,
double-precision and half-precision) for all functions and fixed-point support for some
functions.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
231
Chapter 2: High-Level Synthesis C Libraries
The hls_math.h library can optionally be used in the C source in place of the standard C
math library. The difference between using the standard C math library and using the
hls_math.h math library in the C source is the accuracy of the results reported in C
simulation and C/RTL co-simulation.
HLS Math Library Accuracy
The HLS math functions are implemented as synthesizable bit-approximate functions from
the hls_math.h library. Bit-approximate HLS math library functions do not provide the
same accuracy as the standard C function. To achieve the desired result, the bit-approximate
implementation may use a different underlying algorithm than the standard C math library
version. The accuracy of the function is specified in terms of ULP (Unit of Least Precision).
This difference in accuracy has implications for both C simulation and C/RTL co-simulation.
The ULP difference is typically in the range of 1-4 ULP.
•
If the standard C math library is used in the C source code, there may be a difference
between the C simulation and the C/RTL co-simulation due to the fact that some
functions exhibit a ULP difference from the standard C math library.
•
If the HLS math library is used in the C source code, there will be no difference between
the C simulation and the C/RTL co-simulation. A C simulation using the HLS math
library, may however differ from a C simulation using the standard C math library.
The Verification and Math Functions section below details a number of options for verifying
the synthesized design will perform with the required accuracy.
In addition, the following seven functions might show some differences, depending on the
C standard used to compile and run the C simulation:
•
copysign
•
fpclassify
•
isinf
•
isfinite
•
isnan
•
isnormal
•
signbit
C90 mode
Only isinf, isnan, and copysign are usually provided by the system header files, and
they operate on doubles. In particular, copysign always returns a double result. This might
result in unexpected results after synthesis if it must be returned to a float, because a
double-to-float conversion block is introduced into the hardware.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
232
Chapter 2: High-Level Synthesis C Libraries
C99 mode (-std=c99)
All seven functions are usually provided under the expectation that the system header files
will redirect them to __isnan(double) and __isnan(float). The usual GCC header
files do not redirect isnormal, but implement it in terms of fpclassify.
C++ Using math.h
All seven are provided by the system header files, and they operate on doubles.
copysign always returns a double result. This might cause unexpected results after
synthesis if it must be returned to a float, because a double-to-float conversion block is
introduced into the hardware.
C++ Using cmath
Similar to C99 mode(-std=c99), except that:
°
The system header files are usually different.
°
The functions are properly overloaded for:
-
float(). snan(double)
-
isinf(double)
copysign and copysignf are handled as built-ins even when using namespace std;.
C++ Using cmath and namespace std
No issues. Xilinx recommends using the following for best results:
•
-std=c99 for C
•
-fno-builtin for C and C++
Note: To specify the C compile options, such as -std=c99, use the Tcl command add_files with
the -cflags option. Alternatively, use the Edit CFLAGs button in the Project Settings dialog box.
The HLS Math Library
The following functions are provided in the HLS math library. Each function supports
half-precision (type half), single-precision (type float) and double precision (type
double).
IMPORTANT: For each function func listed below, there is also an associated half-precision
only function named half_func and single-precision only function named funcf
provided in the library.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
233
Chapter 2: High-Level Synthesis C Libraries
When mixing half-precision, single-precision and double-precision data types, review
Common Synthesis Errors to prevent introducing type-conversion hardware in the final
FPGA implementation.
Trigonometric Functions
acos
acospi
asin
asinpi
atan
atan2
atan2pi
asinpi
cos
cospi
sin
sincos
sinpi
tan
tanpi
Hyperbolic Functions
acosh
asinh
sinh
tanh
atanh
cosh
Exponential Functions
exp
exp10
exp2
expm1
frexp
ldexp
modf
scalbln
log10
log1p
scalbn
Logarithmic Functions
ilogb
log
log2
logb
Power Functions
cbrt
hypot
pow
pown
powr
rootn
rsqrt
sqrt
Error Functions
erf
erfc
Gamma Functions
lgamma
lgamma_r
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
tgamma
www.xilinx.com
Send Feedback
234
Chapter 2: High-Level Synthesis C Libraries
Rounding Functions
ceil
floor
llrint
llround
lrint
lround
nearbyint
rint
round
trunc
Remainder Functions
fmod
remainder
remquo
Floating-point
copysign
nan
nextafter
nexttoward
fmin
maxmag
fma
Difference Functions
fdim
fmax
minmag
Other Functions
abs
divide
fabs
fract
mad
recip
Classification Functions
fpclassify
isfinite
isnormal
signbit
isinf
isnan
isless
islessequal
Comparison Functions
isgreater
isgreaterequal
islessgreater
isunordered
Relational Functions
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
235
Chapter 2: High-Level Synthesis C Libraries
all
any
bitselect
isnotequal
isordered
select
isequal
Fixed-Point Math Functions
Fixed-point implementations are also provided for the following math functions:
Trigonometric Functions
These are supported for ap_fixed data types with bit-width specification
ap_fixed where W<=32:
cos
cospi
sin
sinpi
Exponential Functions
These are supported for ap_fixed data types with bit-width specifications
ap_fixed<16,8> and ap_fixed<8,4>:
exp
Power Functions
These are supported for ap_fixed data types with bit-width specification
ap_fixed where W<=32 :
sqrt
The fixed-point type provides a slightly-less accurate version of the function value, but a
smaller and faster RTL implementation.
The methodology for implementing a math function with a fixed-point data types is:
1. Determine if a fixed-point implementation is supported.
2. Update the math functions to use ap_fixed types.
3. Perform C simulation to validate the design still operates with the required precision.
The C simulation is performed using the same bit-accurate types as the RTL
implementation.
4. Synthesize the design.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
236
Chapter 2: High-Level Synthesis C Libraries
For example, a fixed-point implementation of the function sin is specified by using
fixed-point types with the math function as follows:
#include "hls_math.h"
#include "ap_fixed.h"
ap_fixed<32,2> my_input, my_output;
my_input = 24.675;
my_output = sin(my_input);
When using fixed-point math functions, the result type must have the same width and
integer bits as the input
Verification and Math Functions
If the standard C math library is used in the C source code, the C simulation results and the
C/RTL co-simulation results may be different: if any of the math functions in the source code
have an ULP difference from the standard C math library it may result in differences when
the RTL is simulated.
If the hls_math.h library is used in the C source code, the C simulation and C/RTL
co-simulation results are identical. However, the results of C simulation using hls_math.h
are not the same as those using the standard C libraries. The hls_math.h library simply
ensures the C simulation matches the C/RTL co-simulation results. In both cases, the same
RTL implementation is created. The following explains each of the possible options which
are used to perform verification when using math functions.
Verification Option 1: Standard Math Library and Verify Differences
In this option, the standard C math libraries are used in the source code. If any of the
functions synthesized do have exact accuracy the C/RTL co-simulation is different than the
C simulation. The following example highlights this approach.
#include 
#include 
#include 
#include 
#include 
using namespace std;
typedef float data_t;
data_t cpp_math(data_t angle) {
data_t s = sinf(angle);
data_t c = cosf(angle);
return sqrtf(s*s+c*c);
}
Example 2-8:
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
Standard C Math Library Example
www.xilinx.com
Send Feedback
237
Chapter 2: High-Level Synthesis C Libraries
In this case, the results between C simulation and C/RTL co-simulation are different. Keep in
mind when comparing the outputs of simulation, any results written from the test bench are
written to the working directory where the simulation executes:
•
C simulation: Folder //csim/build
•
C/RTL co-simulation: Folder //sim/
where  is the project folder,  is the name of the solution folder and
 is the type of RTL verified (verilog or vhdl). The following figure shows a typical
comparison of the pre-synthesis results file on the left-hand side and the post-synthesis RTL
results file on the right-hand side. The output is shown in the third column.
X-Ref Target - Figure 2-4
Figure 2-4:
Pre-Synthesis and Post-Synthesis Simulation Differences
The results of pre-synthesis simulation and post-synthesis simulation differ by fractional
amounts. You must decide whether these fractional amounts are acceptable in the final RTL
implementation.
The recommended flow for handling these differences is using a test bench that checks the
results to ensure that they lie within an acceptable error range. This can be accomplished by
creating two versions of the same function, one for synthesis and one as a reference
version. In this example, only function cpp_math is synthesized.
#include 
#include 
#include 
#include 
#include 
using namespace std;
typedef float data_t;
data_t cpp_math(data_t angle) {
data_t s = sinf(angle);
data_t c = cosf(angle);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
238
Chapter 2: High-Level Synthesis C Libraries
return sqrtf(s*s+c*c);
}
data_t cpp_math_sw(data_t angle) {
data_t s = sinf(angle);
data_t c = cosf(angle);
return sqrtf(s*s+c*c);
}
The test bench to verify the design compares the outputs of both functions to determine
the difference, using variable diff in the following example. During C simulation both
functions produce identical outputs. During C/RTL co-simulation function cpp_math
produces different results and the difference in results are checked.
int main() {
data_t angle = 0.01;
data_t output, exp_output, diff;
int retval=0;
for (data_t i = 0; i <= 250; i++) {
output = cpp_math(angle);
exp_output = cpp_math_sw(angle);
// Check for differences
diff = ( (exp_output > output) ? exp_output - output : output - exp_output);
if (diff > 0.0000005) {
printf("Difference %.10f exceeds tolerance at angle %.10f \n", diff, angle);
retval=1;
}
angle = angle + .1;
}
if (retval != 0) {
printf("Test failed !!!\n");
retval=1;
} else {
printf("Test passed !\n");
}
// Return 0 if the test passes
return retval;
}
If the margin of difference is lowered to 0.00000005, this test bench highlights the margin
of error during C/RTL co-simulation:
Difference
Difference
Difference
Difference
etc..
0.0000000596
0.0000000596
0.0000000596
0.0000000596
at
at
at
at
angle
angle
angle
angle
1.1100001335
1.2100001574
1.5100002289
1.6100002527
When using the standard C math libraries (math.h and cmath.h) create a “smart” test
bench to verify any differences in accuracy are acceptable.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
239
Chapter 2: High-Level Synthesis C Libraries
Verification Option 2: HLS Math Library and Validate Differences
An alternative verification option is to convert the source code to use the HLS math library.
With this option, there are no differences between the C simulation and C/RTL
co-simulation results. The following example shows how the code above is modified to use
the hls_math.h library.
Note: This option is only available in C++.
•
Include the hls_math.h header file.
•
Replace the math functions with the equivalent hls:: function.
#include 
#include "hls_math.h"
#include 
#include 
#include 
#include 
using namespace std;
typedef float data_t;
data_t cpp_math(data_t angle) {
data_t s = hls::sinf(angle);
data_t c = hls::cosf(angle);
return hls::sqrtf(s*s+c*c);
}
With this verification option there is now a difference between the C simulation results
using the HLS math library and those previously obtained using the standard C math
libraries. These difference should be validated with C simulation using a “smart” test bench
similar to option 1.
In cases where there are many math functions and updating the code is painful, a third
option can be used.
Verification Option 3: HLS Math Library File and Validate Differences
Including the HLS math library file lib_hlsm.cpp as a design file ensures Vivado HLS uses
the HLS math library for C simulation. This option is identical to option2 however it does
not require the C code to be modified.
The HLS math library file is located in the src directory in the Vivado HLS installation area.
Simply copy the file to your local folder and add the file as a standard design file.
Note: This option is only available in C++.
As with option 2, with this option there is now a difference between the C simulation results
using the HLS math library file and those previously obtained without adding this file.
These difference should be validated with C simulation using a “smart” test bench similar to
option 1.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
240
Chapter 2: High-Level Synthesis C Libraries
Common Synthesis Errors
The following are common use errors when synthesizing math functions. These are often
(but not exclusively) caused by converting C functions to C++ to take advantage of
synthesis for math functions.
C++ cmath.h
If the C++ cmath.h header file is used, the floating point functions (for example, sinf
and cosf) can be used. These result in 32-bit operations in hardware. The cmath.h header
file also overloads the standard functions (for example, sin and cos) so they can be used
for float and double types.
C math.h
If the C math.h library is used, the single-precision functions (for example, sinf and
cosf) are required to synthesize 32-bit floating point operations. All standard function calls
(for example, sin and cos) result in doubles and 64-bit double-precision operations being
synthesized.
Cautions
When converting C functions to C++ to take advantage of math.h support, be sure that the
new C++ code compiles correctly before synthesizing with Vivado HLS. For example, if
sqrtf() is used in the code with math.h, it requires the following code extern added to
the C++ code to support it:
#include 
extern “C” float sqrtf(float);
To avoid unnecessary hardware caused by type conversion, follow the warnings on mixing
double and float types discussed in Floats and Doubles in Chapter 3.
HLS Video Library
The video library contains functions to help address several aspects of modeling video
design in C++. The following topics are addressed in this section:
•
Video Functions
•
Data Types
•
Memory Line Buffer
•
Memory Window
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
241
Chapter 2: High-Level Synthesis C Libraries
Using the Video Library
The Vivado HLS video library requires the hls_video.h header file. This file includes all
image and video processing specific video types and functions provided by Vivado HLS.
When using the Vivado HLS video library, the only additional usage requirement is as
follows.
The design is written in C++ and uses the hls namespace:
#include 
hls::rgb_8 video_data[1920][1080]
You can use alternatively scoped naming as shown in the following example:
#include 
using namespace hls;
rgb_8 video_data[1920][1080]
Video Data Types
The data types provided in the HLS Video Library are used to ensure the output RTL created
by synthesis can be seamlessly integrated with any Xilinx ® Video IP blocks used in the
system.
When using any Xilinx Video IP in your system, refer to the IP data sheet and determine the
format used to send or receive the video data. Use the appropriate video data type in the C
code and the RTL created by synthesis may be connected to the Xilinx Video IP.
The library includes the following data types. All data types support 8-bit data only.
Table 2-5:
Video Data Types
Data Type
Name
Field 0 (8 bits)
Field 1 (8 bits)
Field 2 (8 bits)
Field 3 (8 bits)
yuv422_8
Y
UV
Not Used
Not Used
yuv444_8
Y
U
V
Not Used
rgb_8
G
B
R
Not Used
yuva422_8
Y
UV
A
Not Used
yuva444_8
Y
U
V
A
rgba_8
G
B
R
A
yuva420_8
Y
AUV
Not Used
Not Used
yuvd422_8
U
UV
D
Not Used
yuvd444_8
Y
U
V
D
rgbd_8
G
B
R
D
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
242
Chapter 2: High-Level Synthesis C Libraries
Table 2-5:
Video Data Types (Cont’d)
Data Type
Name
Field 0 (8 bits)
Field 1 (8 bits)
Field 2 (8 bits)
Field 3 (8 bits)
bayer_8
RGB
Not Used
Not Used
Not Used
luma_8
Y
Not Used
Not Used
Not Used
After the hls_video.h library is included, the data types can be freely used in the source
code.
#include "hls_video.h"
hls::rgb_8 video_data[1920][1080]
Memory Line Buffer
The LineBuffer class is a C++ class that allows you to easily declare and manage line buffers
within your algorithmic code. This class provides all the methods required for instantiating
and working with line buffers. The LineBuffer class works with all data types.
The main features of the LineBuffer class are:
•
Support for all data types through parameterization
•
User-defined number of rows and columns
•
Automatic banking of rows into separate memory banks for increased memory
bandwidth
•
Provides all the methods for using and debugging line buffers in an algorithmic design
The LineBuffer class has the following methods, explained below:
•
shift_pixels_up()
•
shift_pixels_down()
•
insert_bottom_row()
•
insert_top_row()
•
getval(row,column)
To illustrate the usage of the LineBuffer class, the following data set is assumed at the start
of all examples.
Table 2-6:
Row
Data Set for LineBuffer Examples
Column 0
Column 1
Column 2
Column 3
Column 4
Row 0
1
2
3
4
5
Row 1
6
7
8
9
10
Row 2
11
12
13
14
15
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
243
Chapter 2: High-Level Synthesis C Libraries
A line buffer can be instantiated in an algorithm by using the LineBuffer data type,
shown in this example specifying a LineBuffer variable for the data in the table above:
// hls::LineBuffer variable;
hls::LineBuffer<3,5, char> Buff_A;
The LineBuffer class assumes the data entering the block instantiating the line buffer is
arranged in raster scan order. Each new data item is therefore stored in a different column
than the previous data item.
Inserting new values, while preserving a finite number of previous values in a column,
requires a vertical shift between rows for a given column. After the shift is complete, a new
data value can be inserted at either the top or the bottom of the column.
For example, to insert the value 100 to the top of column 2 of the line buffer set:
Buff_A.shift_pixels_down(2);
Buff_A.insert_top_row(100,2);
This results in the new data set shown in the following table.
Table 2-7:
Line
Data Set After Shift Down and Insert Top Classes Used
Column 0
Column 1
Column 2
Column 3
Column 4
Row 0
1
2
100
4
5
Row 1
6
7
3
9
10
Row 2
11
12
8
14
15
To insert the value 100 to the bottom of column 2 of the line buffer set in Table 2-6 use of
the following:
Buff_A.shift_pixels_up(2);
Buff_A.insert_bottom_row(100,2);
This results in the new data set shown in the following table.
Table 2-8:
Line
Data Set After Shift Up and Insert Bottom Classes Used
Column 0
Column 1
Column 2
Column 3
Column 4
Row 0
1
2
8
4
5
Row 1
6
7
13
9
10
Row 2
11
12
100
14
15
The shift and insert methods both require the column value on which to operate.
All values stored by a LineBuffer instance are available using the getval(row,column)
method. Returns the value of any location inside the line buffer. For example, the following
results in variable Value being assigned the value 9:
Value = Buff_A.getval(1,3);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
244
Chapter 2: High-Level Synthesis C Libraries
Memory Window Buffer
The memory window C++ class allows you to declare and manage two-dimensional
memory windows. The main features of this class are:
•
Support for all data types through parametrization
•
User-defined number of rows and columns
•
Automatic partitioning into individual registers for maximum bandwidth
•
Provides all the methods to use and debug memory windows in the context of an
algorithm
The memory window class is supported by the following methods, explained below:
•
shift_pixels_up()
•
shift_pixels_down()
•
shift_pixels_left()
•
shift_pixels_right()
•
insert_pixel(value,row,colum)
•
insert_row()
•
insert_bottom_row()
•
insert_top_row()
•
insert_col()
•
insert_left_col()
•
insert_right_col()
•
getval(row, column)
You can instantiate a memory window in an algorithm by specifying a Window variable for
the following data type:
// hls::Window variable;
hls::Window<3,3,char> Buff_B;
The memory window class examples in this section use the data set in the following table.
Table 2-9:
Data Set for Memory Window Examples
Column 0
Column 1
Column 2
Row
1
2
3
Row 0
6
7
8
Row 1
11
12
13
Row 2
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
245
Chapter 2: High-Level Synthesis C Libraries
The Window class provides methods for moving data stored within the memory window up,
down, left, and right. Each shift operation clears space in the memory window for new data.
Buff_B.shift_pixels_up(); produces the following results.
Table 2-10:
Memory Window Data Set After Shift Up
Column 0
Column 1
Column 2
Row
6
7
8
Row 0
11
12
13
Row 1
New
New
New
Row 2
Note: The New data has undefined, arbitrary values.
Buff_B.shift_pixels_down(); produces the following results.
Table 2-11:
Memory Window Data Set After Shift Down
Column 0
Column 1
Column 2
Row
New
New
New
Row 0
1
2
3
Row 1
6
7
8
Row 2
Note: The New data has undefined, arbitrary values.
Buff_B.shift_pixels_left(); produces the following results.
Table 2-12:
Memory Window Data Set After Shift Left
Column 0
Column 1
Column 2
Row
2
3
New
Row 0
7
8
New
Row 1
12
13
New
Row 2
Note: The New data has undefined, arbitrary values.
Buff_B.shift_pixels_right(); produces the following results.
Table 2-13:
Memory Window Data Set After Shift Right
Column 0
Column 1
Column 2
Row
New
1
2
Row 0
New
6
7
Row 1
New
11
12
Row 2
Note: The New data has undefined, arbitrary values.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
246
Chapter 2: High-Level Synthesis C Libraries
The Window class allows you to insert and retrieve data from any location within the
memory window. It also supports block insertion of data on the boundaries of the memory
window.
To insert data into any location of the memory window, use the following:
insert_pixel(value,row,column);
For example, you can place the value 100 into row 1, column 1 of the memory window
using:
Buff_B.insert_pixel(100,1,1);
This operation produces the following results.
Table 2-14:
Memory Window Data Set After Insertion Operation at Location 1,1
Column 0
Column 1
Column 2
Row
1
2
3
Row 0
6
100
8
Row 1
11
12
13
Row 2
Block level insertion requires that you provide an array of data elements to insert on a
boundary. The methods provided by the window class are:
•
insert_row()
•
insert_bottom_row()
•
insert_top_row()
•
insert_col()
•
insert_left_col()
•
insert_right_col()
The insert_row and insert_col methods take an array and a row or col location arguments
and place the contents in the specified row or column. The insert-row method:
char C[3] = {50, 50, 50};
Buff_B.insert_row(C,1);
results in the following:
Table 2-15:
Memory Window Data Set After the Insertion at Row 1 Using an Array
Column 0
Column 1
Column 2
Row
1
2
3
Row 0
50
50
50
Row 1
11
12
13
Row 2
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
247
Chapter 2: High-Level Synthesis C Libraries
The insert_bottom_row, insert_top_row, insert_left_col and insert_right_col methods, simply
take an array argument. For example, when C is an array of three elements in which each
element has the value of 50, you can insert the value 50 across the bottom boundary of the
memory window using the following operation:
char C[3] = {50, 50, 50};
Buff_B.insert_bottom_row(C);
This operation produces the following results.
Table 2-16:
Memory Window Data Set After Insert Bottom Operation Using an Array
Column 0
Column 1
Column 2
Row
1
2
3
Row 0
6
7
8
Row 1
50
50
50
Row 2
The other edge insertion methods for the window class work in the same way as the
insert_bottom_row() method.
To retrieve data can from a memory window, use:
getval(row,column)
For example:
A = Buff_B.getval(0,1);
results in:
A = 50
Video Functions
The video processing functions included in the HLS Video library are compatible with
existing OpenCV functions and are similarly named. They do not directly replace existing
OpenCV video library functions. The video processing functions use a data type hls::Mat.
This data type allows the functions to be synthesized and implemented as high
performance hardware.
Three types of functions are provided in the HLS Video Library:
•
OpenCV Interface Functions: Converts data to and from the AXI4 streaming data type
and the standard OpenCV data types. These functions allow any OpenCV functions
executed in software to transfer data, via the AXI4 streaming functions, to and from the
hardware block created by HLS.
•
AXI Interface Functions: These functions are used to convert the video data to and from
the hls::Mat data type used in the Video functions.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
248
Chapter 2: High-Level Synthesis C Libraries
•
Video Processing Functions: Compatible with standard OpenCV functions for
manipulating and processing video images. These functions use the hls::mat data
type and are synthesized by Vivado HLS.
OpenCV Interface Functions
In a typical video system using OpenCV functions, most of the algorithm remains on the
CPU using OpenCV functions. Only those parts of the algorithm that require acceleration in
the FPGA fabric are synthesized and therefore updated to use the Vivado HLS video
functions.
Because the AXI4 streaming protocol is commonly used as the interface between the code
that remains on the CPU and the functions to be synthesized, the OpenCV interface
functions are provided to enable the data transfer between the OpenCV code running on
the CPU and the synthesized hardware function running on FPGA fabric.
Using the interface functions to transform the data before passing it to the function to be
synthesized ensures a high-performance system. In addition to transforming the data, the
functions also include the means of converting OpenCV data formats to and from the
Vivado HLS Video Library data types, for example hls::Mat.
To use the OpenCV interface functions, you must include the header file hls_opencv.h.
These functions are used in the code that remains on the CPU.
AXI4-Interface Functions
The AXI4-Interface functions are used to transfer data into and out of the function to be
synthesized. The video functions to be synthesized use the hls::Mat data type for an
image.
The AXI4-Interface I/O functions discussed below allow you to convert the hls::Mat data
type.
Video Processing Functions
The video processing functions included in the Vivado HLS Video Library are specifically for
manipulating video images. Most of these functions are designed for accelerating
corresponding OpenCV functions, which have a similar signature and usage.
Using Video Functions
The following example demonstrates how each of three types of video functions are used.
In the test bench shown below:
•
The data starts as standard OpenCV image data.
•
This is converted to AXI4-Stream format using one of the OpenCV Interface Functions.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
249
Chapter 2: High-Level Synthesis C Libraries
•
The AXI4-Stream format is used for the input and output to the function for synthesis.
•
Finally, the data is converted back into standard OpenCV formatted data.
This process ensures the test bench operates using the standard OpenCV functions used in
many software applications. The test bench may be executed on a CPU with the following:
#include "hls_video.h"
int main (int argc, char** argv) {
// Load data in OpenCV image format
IplImage* src = cvLoadImage(INPUT_IMAGE);
IplImage* dst = cvCreateImage(cvGetSize(src), src->depth, src->nChannels);
AXI_STREAM src_axi, dst_axi;
// Convert OpenCV format to AXI4 Stream format
IplImage2AXIvideo(src, src_axi);
// Call the function to be synthesized
image_filter(src_axi, dst_axi, src->height, src->width);
// Convert the AXI4 Stream data to OpenCV format
AXIvideo2IplImage(dst_axi, dst);
// Standard OpenCV image functions
cvSaveImage(OUTPUT_IMAGE, dst);
opencv_image_filter(src, dst);
cvSaveImage(OUTPUT_IMAGE_GOLDEN, dst);
cvReleaseImage(&src);
cvReleaseImage(&dst);
char tempbuf[2000];
sprintf(tempbuf, "diff --brief -w %s %s", OUTPUT_IMAGE, OUTPUT_IMAGE_GOLDEN);
int ret = system(tempbuf);
if (ret != 0) {
printf("Test Failed!\n");
ret = 1;
} else {
printf("Test Passed!\n");
}
return ret;
}
The function to be synthesized, image_filter, is shown below. The characteristics of this
function are:
•
The input data type is the AXI4-Interface formatted data.
•
The AXI4-Interface formatted data is converted to hls::Mat format using an the
AXI4-Interface function.
•
The Video Processing Functions, named in a similar manner to their equivalent OpenCV
functions, process the image and will synthesize into a high-quality FPGA
implementation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
250
Chapter 2: High-Level Synthesis C Libraries
•
The data is converted back to AXI4-Stream format and output.
#include "hls_video.h"
typedef hls::stream >
typedef hls::Scalar<3, unsigned char>
typedef hls::Mat
AXI_STREAM;
RGB_PIXEL;
RGB_IMAGE;
void image_filter(AXI_STREAM& INPUT_STREAM, AXI_STREAM& OUTPUT_STREAM, int rows, int
cols) {
//Create AXI streaming interfaces for the core
RGB_IMAGE img_0(rows, cols);
RGB_IMAGE img_1(rows, cols);
RGB_IMAGE img_2(rows, cols);
RGB_IMAGE img_3(rows, cols);
RGB_IMAGE img_4(rows, cols);
RGB_IMAGE img_5(rows, cols);
RGB_PIXEL pix(50, 50, 50);
// Convert AXI4 Stream data to hls::mat format
hls::AXIvideo2Mat(INPUT_STREAM, img_0);
// Execute the video pipelines
hls::Sobel<1,0,3>(img_0, img_1);
hls::SubS(img_1, pix, img_2);
hls::Scale(img_2, img_3, 2, 0);
hls::Erode(img_3, img_4);
hls::Dilate(img_4, img_5);
// Convert the hls::mat format to AXI4 Stream format
hls::Mat2AXIvideo(img_5, OUTPUT_STREAM);
}
Using all three types of functions allows you to implement video functions on an FPGA and
maintain a seamless transfer of data between the video functions optimized for synthesis
and the OpenCV functions and data which remain in the test bench (executing on the CPU).
The following table summarizes the functions provided in the HLS Video Library.
Table 2-17:
HLS Video Library
Function Type
Function
Description
OpenCV
Interface
AXIvideo2cvMat
Converts data from AXI4 video stream (hls::stream)
format to OpenCV cv::Mat format
OpenCV
Interface
AXIvideo2CvMat
Converts data from AXI4 video stream (hls::stream)
format to OpenCV CvMat format2
OpenCV
Interface
AXIvideo2IplImage
Converts data from AXI4 video stream (hls::stream)
format to OpenCV IplImage format
OpenCV
Interface
cvMat2AXIvideo
Converts data from OpenCV cv::Mat format to AXI4 video
stream (hls::stream) format
OpenCV
Interface
CvMat2AXIvideo
Converts data from OpenCV CvMat format to AXI4 video
stream (hls::stream) format
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
251
Chapter 2: High-Level Synthesis C Libraries
Table 2-17:
HLS Video Library (Cont’d)
Function Type
Function
Description
OpenCV
Interface
cvMat2hlsMat
Converts data from OpenCV cv::Mat format to hls::Mat
format
OpenCV
Interface
CvMat2hlsMat
Converts data from OpenCV CvMat format to hls::Mat
format
OpenCV
Interface
CvMat2hlsWindow
Converts data from OpenCV CvMat format to
hls::Window format
OpenCV
Interface
hlsMat2cvMat
Converts data from hls::Mat format to OpenCV cv::Mat
format
OpenCV
Interface
hlsMat2CvMat
Converts data from hls::Mat format to OpenCV CvMat
format
OpenCV
Interface
hlsMat2IplImage
Converts data from hls::Mat format to OpenCV IplImage
format
OpenCV
Interface
hlsWindow2CvMat
Converts data from hls::Window format to OpenCV
CvMat format
OpenCV
Interface
IplImage2AXIvideo
Converts data from OpenCV IplImage format to AXI4
video stream (hls::stream) format
OpenCV
Interface
IplImage2hlsMat
Converts data from OpenCV IplImage format to hls::Mat
format
AXI4-Interface
AXIvideo2Mat
Converts image data stored in hls::Mat format to an AXI4
video stream (hls::stream) format
AXI4-Interface
Mat2AXIvideo
Converts image data stored in AXI4 video stream
(hls::stream) format to an image of hls::Mat format
AX-Interface
Array2Mat
Converts image data stored in an array to an image of
hls::Mat format.
AX-Interface
Array2Mat
Converts image data stored hls::Mat format to an array.
Video
Processing
AbsDiff
Computes the absolute difference between two input
images src1 and src2 and saves the result in dst
Video
Processing
AddS
Computes the per-element sum of an image src and a
scalar scl
Video
Processing
AddWeighted
Computes the weighted per-element sum of two image
src1 and src2
Video
Processing
And
Calculates the per-element bitwise logical conjunction of
two images src1 and src2
Video
Processing
Avg
Calculates an average of elements in image src
Video
Processing
AvgSdv
Calculates an average of elements in image src
Video
Processing
Cmp
Performs the per-element comparison of two input
images src1 and src2
Video
Processing
CmpS
Performs the comparison between the elements of input
images src and the input value and saves the result in dst
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
252
Chapter 2: High-Level Synthesis C Libraries
Table 2-17:
HLS Video Library (Cont’d)
Function Type
Function
Description
Video
Processing
CornerHarris
This function implements a Harris edge/corner detector
Video
Processing
CvtColor
Converts a color image from or to a grayscale image
Video
Processing
Dilate
Dilates the image src using the specified structuring
element constructed within the kernel
Video
Processing
Duplicate
Copies the input image src to two output images dst1
and dst2, for divergent point of two datapaths
Video
Processing
EqualizeHist
Computes a histogram of each frame and uses it to
normalize the range of the following frame
Video
Processing
Erode
Erodes the image src using the specified structuring
element constructed within kernel
Video
Processing
FASTX
Implements the FAST corner detector, generating either a
mask of corners, or an array of coordinates
Video
Processing
Filter2D
Applies an arbitrary linear filter to the image src using the
specified kernel
Video
Processing
GaussianBlur
Applies a normalized 2D Gaussian Blur filter to the input
Video
Processing
Harris
This function implements a Harris edge or corner
detector
Video
Processing
HoughLines2
Implements the Hough line transform
Video
Processing
Integral
Implements the computation of an integral image
Video
Processing
InitUndistortRectifyMap
Generates map1 and map2, based on a set of parameters,
where map1 and map2 are suitable inputs for
hls::Remap()
Video
Processing
Max
Calculates per-element maximum of two input images
src1 and src2 and saves the result in dst
Video
Processing
MaxS
Calculates the maximum between the elements of input
images src and the input value and saves the result in dst
Video
Processing
Mean
Calculates an average of elements in image src, and
return the value of first channel of result scalar
Video
Processing
Merge
Composes a multichannel image dst from several
single-channel images
Video
Processing
Min
Calculates per-element minimum of two input images
src1 and src2 and saves the result in dst
Video
Processing
MinMaxLoc
Finds the global minimum and maximum and their
locations in input image src
Video
Processing
MinS
Calculates the minimum between the elements of input
images src and the input value and saves the result in dst
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
253
Chapter 2: High-Level Synthesis C Libraries
Table 2-17:
HLS Video Library (Cont’d)
Function Type
Function
Description
Video
Processing
Mul
Calculates the per-element product of two input images
src1 and src2
Video
Processing
Not
Performs per-element bitwise inversion of image src
Video
Processing
PaintMask
Each pixel of the destination image is either set to color
(if mask is not zero) or the corresponding pixel from the
input image
Video
Processing
PyrDown
Blurs the image and then reduces the size by a factor of 2.
Video
Processing
PyrUp
Upsamples the image by a factor of 2 and then blurs the
image.
Video
Processing
Range
Sets all value in image src by the following rule and return
the result as image dst
Video
Processing
Remap
Remaps the source image src to the destination image dst
according to the given remapping
Video
Processing
Reduce
Reduces 2D image src along dimension dim to a vector
dst
Video
Processing
Resize
Resizes the input image to the size of the output image
using bilinear interpolation
Video
Processing
Set
Sets elements in image src to a given scalar value scl
Video
Processing
Scale
Converts an input image src with optional linear
transformation
Video
Processing
Sobel
Computes a horizontal or vertical Sobel filter, returning
an estimate of the horizontal or vertical derivative, using
a filter
Video
Processing
Split
Divides a multichannel image src from several
single-channel images
Video
Processing
SubRS
Computes the differences between scalar value scl and
elements of image src
Video
Processing
SubS
Computes the differences between elements of image src
and scalar value scl
Video
Processing
Sum
Sums the elements of an image
Video
Processing
Threshold
Performs a fixed-level threshold to each element in a
single-channel image
Video
Processing
Zero
Sets elements in image src to 0
As shown in the example above, the video functions are not direct replacements for
OpenCV functions. They use input and output arrays to process the data and typically use
template parameters.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
254
Chapter 2: High-Level Synthesis C Libraries
A complete description of all functions in the HLS video library is provided in Chapter 4,
High-Level Synthesis Reference Guide.
Optimizing Video Functions for Performance
The HLS video functions are pre-optimized to ensure a high-quality and high-performance
implementation. The functions already include the optimization directives required to
process data at a rate of one sample per clock.
The exact performance metrics of the video functions depends upon the clock rate and the
target device specifications. Refer to the synthesis report for complete details on the final
performance achieved after synthesis.
The previous example is repeated below to highlight the only optimizations required to
achieve a complete high-performance design.
•
Because the functions are already pipelined, adding the DATAFLOW optimization
ensures the pipelined functions will execute in parallel.
•
In this example, the data type is an hls::stream which is automatically implemented
as a FIFO of depth 1: there is no requirement to use the config_dataflow
configuration to control the size of the dataflow memory channels.
•
Implementing the input and output ports with an AXI4-Stream interface (axis) ensures a
high-performance streaming interface.
•
Optionally, implementing the block-level protocol with an AXI4-Lite slave interface
would allow the synthesized block to be controlled from a CPU.
#include "hls_video.h"
typedef hls::stream >
typedef hls::Scalar<3, unsigned char>
typedef hls::Mat
AXI_STREAM;
RGB_PIXEL;
RGB_IMAGE;
void image_filter(AXI_STREAM& INPUT_STREAM, AXI_STREAM& OUTPUT_STREAM, int rows, int
cols) {
#pragma HLS INTERFACE axis port=INPUT_STREAM
#pragma HLS INTERFACE axis port=OUTPUT_STREAM
#pragma HLS dataflow
//Create AXI streaming interfaces for the core
RGB_IMAGE img_0(rows, cols);
RGB_IMAGE img_1(rows, cols);
RGB_IMAGE img_2(rows, cols);
RGB_IMAGE img_3(rows, cols);
RGB_IMAGE img_4(rows, cols);
RGB_IMAGE img_5(rows, cols);
RGB_PIXEL pix(50, 50, 50);
// Convert AXI4 Stream data to hls::mat format
hls::AXIvideo2Mat(INPUT_STREAM, img_0);
// Execute the video pipelines
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
255
Chapter 2: High-Level Synthesis C Libraries
hls::Sobel<1,0,3>(img_0, img_1);
hls::SubS(img_1, pix, img_2);
hls::Scale(img_2, img_3, 2, 0);
hls::Erode(img_3, img_4);
hls::Dilate(img_4, img_5);
// Convert the hls::mat format to AXI4 Stream format
hls::Mat2AXIvideo(img_5, OUTPUT_STREAM);
}
HLS IP Libraries
Vivado HLS provides C libraries to implement a number of Xilinx IP blocks. The C libraries
allow the following Xilinx IP blocks to be directly inferred from the C source code ensuring
a high-quality implementation in the FPGA.
Table 2-18:
HLS IP Libraries
Library Header File
Description
hls_fft.h
Allows the Xilinx LogiCORE IP FFT to be simulated in C and implemented using
the Xilinx LogiCORE block.
hls_fir.h
Allows the Xilinx LogiCORE IP FIR to be simulated in C and implemented using
the Xilinx LogiCORE block.
hls_dds.h
Allows the Xilinx LogiCORE IP DDS to be simulated in C and implemented
using the Xilinx LogiCORE block.
ap_shift_reg.h
Provides a C++ class to implement a shift register which is implemented
directly using a Xilinx SRL primitive.
FFT IP Library
The Xilinx FFT IP block can be called within a C++ design using the library hls_fft.h. This
section explains how the FFT can be configured in your C++ code.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP Fast Fourier Transform
Product Guide (PG109) [Ref 5] for information on how to implement and use the features of the IP.
To use the FFT in your C++ code:
1. Include the hls_fft.h library in the code
2. Set the default parameters using the pre-defined struct hls::ip_fft::params_t
3. Define the run time configuration
4. Call the FFT function
5. Optionally, check the run time status
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
256
Chapter 2: High-Level Synthesis C Libraries
The following code examples provide a summary of how each of these steps is performed.
Each step is discussed in more detail below.
First, include the FFT library in the source code. This header file resides in the include
directory in the Vivado HLS installation area which is automatically searched when Vivado
HLS executes.
#include "hls_fft.h"
Define the static parameters of the FFT. This includes such things as input width, number of
channels, type of architecture. which do not change dynamically. The FFT library includes a
parameterization struct hls::ip_fft::params_t, which can be used to initialize all
static parameters with default values.
In this example, the default values for output ordering and the widths of the configuration
and status ports are over-ridden using a user-defined struct param1 based on the
pre-defined struct.
struct param1 : hls::ip_fft::params_t {
static const unsigned ordering_opt = hls::ip_fft::natural_order;
static const unsigned config_width = FFT_CONFIG_WIDTH;
static const unsigned status_width = FFT_STATUS_WIDTH;
};
Define types and variables for both the run time configuration and run time status. These
values can be dynamic and are therefore defined as variables in the C code which can
change and are accessed through APIs.
typedef hls::ip_fft::config_t config_t;
typedef hls::ip_fft::status_t status_t;
config_t fft_config1;
status_t fft_status1;
Next, set the run time configuration. This example sets the direction of the FFT (Forward or
Inverse) based on the value of variable “direction” and also set the value of the scaling
schedule.
fft_config1.setDir(direction);
fft_config1.setSch(0x2AB);
Call the FFT function using the HLS namespace with the defined static configuration
(param1 in this example). The function parameters are, in order, input data, output data,
output status and input configuration.
hls::fft (xn1, xk1, &fft_status1, &fft_config1);
Finally, check the output status. This example checks the overflow flag and stores the results
in variable “ovflo”.
*ovflo = fft_status1->getOvflo();
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
257
Chapter 2: High-Level Synthesis C Libraries
Design examples using the FFT C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FFT.
FFT Static Parameters
The static parameters of the FFT define how the FFT is configured and specifies the fixed
parameters such as the size of the FFT, whether the size can be changed dynamically,
whether the implementation is pipelined or radix_4_burst_io.
The hls_fft.h header file defines a struct hls::ip_fft::params_t which can be used
to set default values for the static parameters. If the default values are to be used, the
parameterization struct can be used directly with the FFT function.
hls::fft
(xn1, xk1, &fft_status1, &fft_config1);
A more typical use is to change some of the parameters to non-default values. This is
performed by creating a new user-define parameterization struct based on the default
parameterization struct and changing some of the default values.
In this example, a new user struct my_fft_config is defined and with a new value for the
output ordering (changed to natural_order). All other static parameters to the FFT use the
default values (shown below in Table 2-20).
struct my_fft_config : hls::ip_fft::params_t {
static const unsigned ordering_opt = hls::ip_fft::natural_order;
};
hls::fft
(xn1, xk1, &fft_status1, &fft_config1);
The values used for the parameterization struct hls::ip_fft::params_t are explained
in the following table. The default values for the parameters and a list of possible values is
provided in Table 2-20.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP Fast Fourier Transform
Product Guide (PG109) [Ref 5] for details on the parameters and the implication for their settings.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
258
Chapter 2: High-Level Synthesis C Libraries
Table 2-19:
FFT Struct Parameters
Parameter
Description
input_width
Data input port width.
output_width
Data output port width.
status_width
Output status port width.
config_width
Input configuration port width.
max_nfft
The size of the FFT data set is specified as 1 << max_nfft.
has_nfft
Determines if the size of the FFT can be run time configurable.
channels
Number of channels.
arch_opt
The implementation architecture.
phase_factor_width
Configure the internal phase factor precision.
ordering_opt
The output ordering mode.
ovflo
Enable overflow mode.
scaling_opt
Define the scaling options.
rounding_opt
Define the rounding modes.
mem_data
Specify using block or distributed RAM for data memory.
mem_phase_factors
Specify using block or distributed RAM for phase factors memory.
mem_reorder
Specify using block or distributed RAM for output reorder memory.
stages_block_ram
Defines the number of block RAM stages used in the implementation.
mem_hybrid
When block RAMs are specified for data, phase factor, or reorder buffer,
mem_hybrid specifies where or not to use a hybrid of block and distributed RAMs
to reduce block RAM count in certain configurations.
complex_mult_type
Defines the types of multiplier to use for complex multiplications.
butterfly_type
Defines the implementation used for the FFT butterfly.
When specifying parameter values which are not integer or boolean, the HLS FFT
namespace should be used.
For example the possible values for parameter butterfly_type in the following table are
use_luts and use_xtremedsp_slices. The values used in the C program should be
butterfly_type = hls::ip_fft::use_luts and butterfly_type =
hls::ip_fft::use_xtremedsp_slices.
The following table covers all features and functionality of the FFT IP. Features and
functionality not described in this table are not supported in the Vivado HLS
implementation.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
259
Chapter 2: High-Level Synthesis C Libraries
Table 2-20:
FFT Struct Parameters Values
Parameter
C Type
Default Value
Valid Values
input_width
unsigned
16
8-34
output_width
unsigned
16
input_width to (input_width + max_nfft
+ 1)
status_width
unsigned
8
Depends on FFT configuration
config_width
unsigned
16
Depends on FFT configuration
max_nfft
unsigned
10
3-16
has_nfft
bool
false
True, False
channels
unsigned
1
1-12
arch_opt
unsigned
pipelined_streaming_io
automatically_select
pipelined_streaming_io
radix_4_burst_io
radix_2_burst_io
radix_2_lite_burst_io
phase_factor_width
unsigned
16
8-34
ordering_opt
unsigned
bit_reversed_order
bit_reversed_order
natural_order
ovflo
bool
true
false
true
scaling_opt
unsigned
scaled
scaled
unscaled
block_floating_point
rounding_opt
unsigned
truncation
truncation
convergent_rounding
mem_data
unsigned
block_ram
block_ram
distributed_ram
mem_phase_factors
unsigned
block_ram
block_ram
distributed_ram
mem_reorder
unsigned
block_ram
block_ram
distributed_ram
stages_block_ram
unsigned
(max_nfft < 10) ? 0 :
0-11
(max_nfft - 9)
mem_hybrid
bool
false
false
true
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
260
Chapter 2: High-Level Synthesis C Libraries
Table 2-20:
FFT Struct Parameters Values (Cont’d)
complex_mult_type
unsigned
use_mults_resources
use_luts
use_mults_resources
use_mults_performance
butterfly_type
unsigned
use_luts
use_luts
use_xtremedsp_slices
FFT Run Time Configuration and Status
The FFT supports run time configuration and run time status monitoring through the
configuration and status ports. These ports are defined as arguments to the FFT function,
shown here as variables fft_status1 and fft_config1:
hls::fft (xn1, xk1, &fft_status1, &fft_config1);
The run time configuration and status can be accessed using the predefined structs from
the FFT C library:
•
hls::ip_fft::config_t
•
hls::ip_fft::status_t
Note: In both cases, the struct requires the name of the static parameterization struct, shown in
these examples as param1. Refer to the previous section for details on defining the static
parameterization struct.
The run time configuration struct allows the following actions to be performed in the C
code:
•
Set the FFT length, if run time configuration is enabled
•
Set the FFT direction as forward or inverse
•
Set the scaling schedule
The FFT length can be set as follows:
typedef hls::ip_fft::config_t config_t;
config_t fft_config1;
// Set FFT length to 512 => log2(512) =>9
fft_config1-> setNfft(9);
IMPORTANT: The length specified during run time cannot exceed the size defined by max_nfft in the
static configuration.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
261
Chapter 2: High-Level Synthesis C Libraries
The FFT direction can be set as follows:
typedef hls::ip_fft::config_t config_t;
config_t fft_config1;
// Forward FFT
fft_config1->setDir(1);
// Inverse FFT
fft_config1->setDir(0);
The FFT scaling schedule can be set as follows:
typedef hls::ip_fft::config_t config_t;
config_t fft_config1;
fft_config1->setSch(0x2AB);
The output status port can be accessed using the pre-defined struct to determine:
•
If any overflow occurred during the FFT
•
The value of the block exponent
The FFT overflow mode can be checked as follows:
typedef hls::ip_fft::status_t status_t;
status_t fft_status1;
// Check the overflow flag
bool *ovflo = fft_status1->getOvflo();
IMPORTANT: After each transaction completes, check the overflow status to confirm the correct
operation of the FFT.
And the block exponent value can be obtained using:
typedef hls::ip_fft::status_t status_t;
status_t fft_status1;
// Obtain the block exponent
unsigned int *blk_exp = fft_status1-> getBlkExp();
Using the FFT Function
The FFT function is defined in the HLS namespace and can be called as follows:
hls::fft (
INPUT_DATA_ARRAY,
OUTPUT_DATA_ARRAY,
OUTPUT_STATUS,
INPUT_RUN_TIME_CONFIGURATION);
The STATIC_PARAM is the static parameterization struct discussed in the earlier section FFT
Static Parameters. This defines the static parameters for the FFT.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
262
Chapter 2: High-Level Synthesis C Libraries
Both the input and output data are supplied to the function as arrays (INPUT_DATA_ARRAY
and OUTPUT_DATA_ARRAY). In the final implementation, the ports on the FFT RTL block will
be implemented as AXI4-Stream ports. Xilinx recommends always using the FFT function in
a region using dataflow optimization (set_directive_dataflow), because this ensures
the arrays are implemented as streaming arrays. An alternative is to specify both arrays as
streaming using the set_directive_stream command.
IMPORTANT: The FFT cannot be used in a region which is pipelined. If high-performance operation is
required, pipeline the loops or functions before and after the FFT then use dataflow optimization on all
loops and functions in the region.
The data types for the arrays can be float or ap_fixed.
typedef float data_t;
complex xn[FFT_LENGTH];
complex xk[FFT_LENGTH];
To use fixed-point data types, the Vivado HLS arbitrary precision type ap_fixed should be
used.
#include "ap_fixed.h"
typedef ap_fixed data_in_t;
typedef ap_fixed data_out_t;
#include 
typedef std::complex cmpxData;
typedef std::complex cmpxDataOut;
In both cases, the FFT should be parameterized with the same correct data sizes. In the case
of floating point data, the data widths will always be 32-bit and any other specified size will
be considered invalid.
TIP: The input and output width of the FFT can be configured to any arbitrary value within the
supported range. The variables which connect to the input and output parameters must be defined in
increments of 8-bit. For example, if the output width is configured as 33-bit, the output variable must
be defined as a 40-bit variable.
The multichannel functionality of the FFT can be used by using two-dimensional arrays for
the input and output data. In this case, the array data should be configured with the first
dimension representing each channel and the second dimension representing the FFT data.
typedef float data_t;
static complex xn[CHANNEL][FFT_LENGTH];
static complex xk[CHANELL][FFT_LENGTH];
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
263
Chapter 2: High-Level Synthesis C Libraries
The FFT core consumes and produces data as interleaved channels (for example, ch0-data0,
ch1-data0, ch2-data0, etc, ch0-data1, ch1-data1, ch2-data2, etc.). Therefore, to stream the
input or output arrays of the FFT using the same sequential order that the data was read or
written, you must fill or empty the two-dimensional arrays for multiple channels by
iterating through the channel index first, as shown in the following example:
cmpxData
cmpxData
in_fft[FFT_CHANNELS][FFT_LENGTH];
out_fft[FFT_CHANNELS][FFT_LENGTH];
// Write to FFT Input Array
for (unsigned i = 0; i < FFT_LENGTH; i++) {
for (unsigned j = 0; j < FFT_CHANNELS; ++j) {
in_fft[j][i] = in.read().data;
}
}
// Read from FFT Output Array
for (unsigned i = 0; i < FFT_LENGTH; i++) {
for (unsigned j = 0; j < FFT_CHANNELS; ++j) {
out.data = out_fft[j][i];
}
}
The OUTPUT_STATUS and INPUT_RUN_TIME_CONFIGURATION are the structs discussed in
the earlier section FFT Run Time Configuration.
Design examples using the FFT C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FFT.
FIR Filter IP Library
The Xilinx FIR IP block can be called within a C++ design using the library hls_fir.h. This
section explains how the FIR can be configured in your C++ code.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP FIR Compiler Product
Guide (PG149) [Ref 6] for information on how to implement and use the features of the IP.
To use the FIR in your C++ code:
1. Include the hls_fir.h library in the code.
2. Set the static parameters using the pre-defined struct hls::ip_fir::params_t.
3. Call the FIR function.
4. Optionally, define a run time input configuration to modify some parameters
dynamically.
The following code examples provide a summary of how each of these steps is performed.
Each step is discussed in more detail below.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
264
Chapter 2: High-Level Synthesis C Libraries
First, include the FIR library in the source code. This header file resides in the include
directory in the Vivado HLS installation area. This directory is automatically searched when
Vivado HLS executes. There is no need to specify the path to this directory if compiling
inside Vivado HLS.
#include "hls_fir.h"
Define the static parameters of the FIR. This includes such static attributes such as the input
width, the coefficients, the filter rate (single, decimation, hilbert). The FIR library includes a
parameterization struct hls::ip_fir::params_t which can be used to initialize all
static parameters with default values.
In this example, the coefficients are defined as residing in array coeff_vec and the default
values for the number of coefficients, the input width and the quantization mode are
over-ridden using a user a user-defined struct myconfig based on the pre-defined struct.
struct myconfig : hls::ip_fir::params_t {
static const double coeff_vec[sg_fir_srrc_coeffs_len];
static const unsigned num_coeffs = sg_fir_srrc_coeffs_len;
static const unsigned input_width = INPUT_WIDTH;
static const unsigned quantization = hls::ip_fir::quantize_only;
};
Create an instance of the FIR function using the HLS namespace with the defined static
parameters (myconfig in this example) and then call the function with the run method to
execute the function. The function arguments are, in order, input data and output data.
static hls::FIR fir1;
fir1.run(fir_in, fir_out);
Optionally, a run time input configuration can be used. In some modes of the FIR, the data
on this input determines how the coefficients are used during interleaved channels or when
coefficient reloading is required. This configuration can be dynamic and is therefore
defined as a variable. For a complete description of which modes require this input
configuration, refer to the LogiCORE IP FIR Compiler Product Guide (PG149) [Ref 6].
When the run time input configuration is used, the FIR function is called with three
arguments: input data, output data and input configuration.
// Define the configuration type
typedef ap_uint<8> config_t;
// Define the configuration variable
config_t fir_config = 8;
// Use the configuration in the FFT
static hls::FIR fir1;
fir1.run(fir_in, fir_out, &fir_config);
Design examples using the FIR C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FIR.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
265
Chapter 2: High-Level Synthesis C Libraries
FIR Static Parameters
The static parameters of the FIR define how the FIR IP is parameterized and specifies
non-dynamic items such as the input and output widths, the number of fractional bits, the
coefficient values, the interpolation and decimation rates. Most of these configurations
have default values: there are no default values for the coefficients.
The hls_fir.h header file defines a struct hls::ip_fir::params_t that can be
used to set the default values for most of the static parameters.
IMPORTANT: There are no defaults defined for the coefficients. Therefore, Xilinx does not recommend
using the pre-defined struct to directly initialize the FIR. A new user defined struct which specifies the
coefficients should always be used to perform the static parameterization.
In this example, a new user struct my_config is defined and with a new value for the
coefficients. The coefficients are specified as residing in array coeff_vec. All other
parameters to the FIR will use the default values (shown below in Table 2-22).
struct myconfig : hls::ip_fir::params_t {
static const double coeff_vec[sg_fir_srrc_coeffs_len];
};
static hls::FIR fir1;
fir1.run(fir_in, fir_out);
The following table describes the parameters used for the parametrization struct
hls::ip_fir::params_t. Table 2-22 provides the default values for the parameters and
a list of possible values.
RECOMMENDED: Xilinx highly recommends that you refer to the LogiCORE IP FIR Compiler Product
Guide (PG149) [Ref 6] for details on the parameters and the implication for their settings.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
266
Chapter 2: High-Level Synthesis C Libraries
Table 2-21:
FIR Struct Parameters
Parameter
Description
input_width
Data input port width
input_fractional_bits
Number of fractional bits on the input port
output_width
Data output port width
output_fractional_bits
Number of fractional bits on the output port
coeff_width
Bit-width of the coefficients
coeff_fractional_bits
Number of fractional bits in the coefficients
num_coeffs
Number of coefficients
coeff_sets
Number of coefficient sets
input_length
Number of samples in the input data
output_length
Number of samples in the output data
num_channels
Specify the number of channels of data to process
total_number_coeff
Total number of coefficients
coeff_vec[total_num_coeff]
The coefficient array
filter_type
The type implementation used for the filter
rate_change
Specifies integer or fractional rate changes
interp_rate
The interpolation rate
decim_rate
The decimation rate
zero_pack_factor
Number of zero coefficients used in interpolation
rate_specification
Specify the rate as frequency or period
hardware_oversampling_rate
Specify the rate of over-sampling
sample_period
The hardware oversample period
sample_frequency
The hardware oversample frequency
quantization
The quantization method to be used
best_precision
Enable or disable the best precision
coeff_structure
The type of coefficient structure to be used
output_rounding_mode
Type of rounding used on the output
filter_arch
Selects a systolic or transposed architecture
optimization_goal
Specify a speed or area goal for optimization
inter_column_pipe_length
The pipeline length required between DSP columns
column_config
Specifies the number of DSP48 column
config_method
Specifies how the DSP48 columns are configured
coeff_padding
Number of zero padding added to the front of the filter
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
267
Chapter 2: High-Level Synthesis C Libraries
When specifying parameter values that are not integer or boolean, the HLS FIR namespace
should be used.
For example the possible values for rate_change are shown in the following table to be
integer and fixed_fractional. The values used in the C program should be
rate_change = hls::ip_fir::integer and rate_change =
hls::ip_fir::fixed_fractional.
The following table covers all features and functionality of the FIR IP. Features and
functionality not described in this table are not supported in the Vivado HLS
implementation.
Table 2-22:
FIR Struct Parameters Values
Parameter
C Type
Default Value
Valid Values
input_width
unsigned
16
No limitation
input_fractional_bits
unsigned
0
Limited by size of input_width
output_width
unsigned
24
No limitation
output_fractional_bits
unsigned
0
Limited by size of output_width
coeff_width
unsigned
16
No limitation
coeff_fractional_bits
unsigned
0
Limited by size of coeff_width
num_coeffs
bool
21
Full
coeff_sets
unsigned
1
1-1024
input_length
unsigned
21
No limitation
output_length
unsigned
21
No limitation
num_channels
unsigned
1
1-1024
total_number_coeff
unsigned
21
num_coeffs * coeff_sets
coeff_vec[total_num_coeff]
double
array
None
Not applicable
filter_type
unsigned
single_rate
single_rate, interpolation,
decimation, hilbert_filter,
interpolated
rate_change
unsigned
integer
integer, fixed_fractional
interp_rate
unsigned
1
1-1024
decim_rate
unsigned
1
1-1024
zero_pack_factor
unsigned
1
1-8
rate_specification
unsigned
period
frequency, period
hardware_oversampling_rate
unsigned
1
No Limitation
sample_period
bool
1
No Limitation
sample_frequency
unsigned
0.001
No Limitation
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
268
Chapter 2: High-Level Synthesis C Libraries
Table 2-22:
FIR Struct Parameters Values (Cont’d)
quantization
unsigned
integer_coefficients
integer_coefficients,
quantize_only,
maximize_dynamic_range
best_precision
unsigned
false
false
true
coeff_structure
unsigned
non_symmetric
inferred, non_symmetric,
symmetric, negative_symmetric,
half_band, hilbert
output_rounding_mode
unsigned
full_precision
full_precision, truncate_lsbs,
non_symmetric_rounding_down,
non_symmetric_rounding_up,
symmetric_rounding_to_zero,
symmetric_rounding_to_infinity,
convergent_rounding_to_even,
convergent_rounding_to_odd
filter_arch
unsigned
systolic_multiply_accumulate
systolic_multiply_accumulate,
transpose_multiply_accumulate
optimization_goal
unsigned
area
area, speed
inter_column_pipe_length
unsigned
4
1-16
column_config
unsigned
1
Limited by number of DSP48s
used
config_method
unsigned
single
single, by_channel
coeff_padding
bool
false
false
true
Using the FIR Function
The FIR function is defined in the HLS namespace and can be called as follows:
// Create an instance of the FIR
static hls::FIR fir1;
// Execute the FIR instance fir1
fir1.run(INPUT_DATA_ARRAY, OUTPUT_DATA_ARRAY);
The STATIC_PARAM is the static parameterization struct discussed in the earlier section
FIR Static Parameters. This defines most static parameters for the FIR.
Both the input and output data are supplied to the function as arrays (INPUT_DATA_ARRAY
and OUTPUT_DATA_ARRAY). In the final implementation, these ports on the FIR IP will be
implemented as AXI4-Stream ports. Xilinx recommends always using the FIR function in a
region using the dataflow optimization (set_directive_dataflow), because this
ensures the arrays are implemented as streaming arrays. An alternative is to specify both
arrays as streaming using the set_directive_stream command.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
269
Chapter 2: High-Level Synthesis C Libraries
IMPORTANT: The FIR cannot be used in a region which is pipelined. If high-performance operation is
required, pipeline the loops or functions before and after the FIR then use dataflow optimization on all
loops and functions in the region.
The multichannel functionality of the FIR is supported through interleaving the data in a
single input and single output array.
•
The size of the input array should be large enough to accommodate all samples:
num_channels * input_length.
•
The output array size should be specified to contain all output samples: num_channels
* output_length.
The following code example demonstrates, for two channels, how the data is interleaved. In
this example, the top-level function has two channels of input data (din_i, din_q) and
two channels of output data (dout_i, dout_q). Two functions, at the front-end (fe) and
back-end (be) are used to correctly order the data in the FIR input array and extract it from
the FIR output array.
void dummy_fe(din_t din_i[LENGTH], din_t din_q[LENGTH], din_t out[FIR_LENGTH]) {
for (unsigned i = 0; i < LENGTH; ++i) {
out[2*i] = din_i[i];
out[2*i + 1] = din_q[i];
}
}
void dummy_be(dout_t in[FIR_LENGTH], dout_t dout_i[LENGTH], dout_t dout_q[LENGTH]) {
for(unsigned i = 0; i < LENGTH; ++i) {
dout_i[i] = in[2*i];
dout_q[i] = in[2*i+1];
}
}
void fir_top(din_t din_i[LENGTH], din_t din_q[LENGTH],
dout_t dout_i[LENGTH], dout_t dout_q[LENGTH]) {
din_t fir_in[FIR_LENGTH];
dout_t fir_out[FIR_LENGTH];
static hls::FIR fir1;
dummy_fe(din_i, din_q, fir_in);
fir1.run(fir_in, fir_out);
dummy_be(fir_out, dout_i, dout_q);
}
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
270
Chapter 2: High-Level Synthesis C Libraries
Optional FIR Run Time Configuration
In some modes of operation, the FIR requires an additional input to configure how the
coefficients are used. For a complete description of which modes require this input
configuration, refer to the LogiCORE IP FIR Compiler Product Guide (PG149) [Ref 6].
This input configuration can be performed in the C code using a standard ap_int.h 8-bit
data type. In this example, the header file fir_top.h specifies the use of the FIR and
ap_fixed libraries, defines a number of the design parameter values and then defines
some fixed-point types based on these:
#include "ap_fixed.h"
#include "hls_fir.h"
const unsigned FIR_LENGTH
= 21;
const unsigned INPUT_WIDTH = 16;
const unsigned INPUT_FRACTIONAL_BITS = 0;
const unsigned OUTPUT_WIDTH = 24;
const unsigned OUTPUT_FRACTIONAL_BITS = 0;
const unsigned COEFF_WIDTH = 16;
const unsigned COEFF_FRACTIONAL_BITS = 0;
const unsigned COEFF_NUM = 7;
const unsigned COEFF_SETS = 3;
const unsigned INPUT_LENGTH = FIR_LENGTH;
const unsigned OUTPUT_LENGTH = FIR_LENGTH;
const unsigned CHAN_NUM = 1;
typedef ap_fixed s_data_t;
typedef ap_fixed m_data_t;
typedef ap_uint<8> config_t;
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
271
Chapter 2: High-Level Synthesis C Libraries
In the top-level code, the information in the header file is included, the static
parameterization struct is created using the same constant values used to specify the
bit-widths, ensuring the C code and FIR configuration match, and the coefficients are
specified. At the top-level, an input configuration, defined in the header file as 8-bit data,
is passed into the FIR.
#include "fir_top.h"
struct param1 : hls::ip_fir::params_t {
static const double coeff_vec[total_num_coeff];
static const unsigned input_length = INPUT_LENGTH;
static const unsigned output_length = OUTPUT_LENGTH;
static const unsigned num_coeffs = COEFF_NUM;
static const unsigned coeff_sets = COEFF_SETS;
};
const double param1::coeff_vec[total_num_coeff] =
{6,0,-4,-3,5,6,-6,-13,7,44,64,44,7,-13,-6,6,5,-3,-4,0,6};
void dummy_fe(s_data_t in[INPUT_LENGTH], s_data_t out[INPUT_LENGTH],
config_t* config_in, config_t* config_out)
{
*config_out = *config_in;
for(unsigned i = 0; i < INPUT_LENGTH; ++i)
out[i] = in[i];
}
void dummy_be(m_data_t in[OUTPUT_LENGTH], m_data_t out[OUTPUT_LENGTH])
{
for(unsigned i = 0; i < OUTPUT_LENGTH; ++i)
out[i] = in[i];
}
// DUT
void fir_top(s_data_t in[INPUT_LENGTH],
m_data_t out[OUTPUT_LENGTH],
config_t* config)
{
s_data_t fir_in[INPUT_LENGTH];
m_data_t fir_out[OUTPUT_LENGTH];
config_t fir_config;
// Create struct for config
static hls::FIR fir1;
//==================================================
// Dataflow process
dummy_fe(in, fir_in, config, &fir_config);
fir1.run(fir_in, fir_out, &fir_config);
dummy_be(fir_out, out);
//==================================================
}
Design examples using the FIR C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FIR.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
272
Chapter 2: High-Level Synthesis C Libraries
DDS IP Library
You can use the Xilinx Direct Digital Synthesizer (DDS) IP block within a C++ design using
the hls_dds.h library. This section explains how to configure DDS IP in your C++ code.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP DDS Compiler Product
Guide (PG141) [Ref 7] for information on how to implement and use the features of the IP.
IMPORTANT: The C IP implementation of the DDS IP core supports the fixed mode for the
Phase_Increment and Phase_Offset parameters and supports the none mode for Phase_Offset, but it
does not support programmable and streaming modes for these parameters.
To use the DDS in the C++ code:
1. Include the hls_dds.h library in the code.
2. Set the default parameters using the pre-defined struct hls::ip_dds::params_t.
3. Call the DDS function.
First, include the DDS library in the source code. This header file resides in the include
directory in the Vivado HLS installation area, which is automatically searched when Vivado
HLS executes.
#include "hls_dds.h"
Define the static parameters of the DDS. For example, define the phase width, clock rate,
and phase and increment offsets. The DDS C library includes a parameterization struct
hls::ip_dds::params_t, which is used to initialize all static parameters with default
values. By redefining any of the values in this struct, you can customize the implementation.
The following example shows how to override the default values for the phase width, clock
rate, phase offset, and the number of channels using a user-defined struct param1, which
is based on the existing predefined struct hls::ip_dds::params_t:
struct param1 :
static const
static const
static const
static const
};
hls::ip_dds::params_t {
unsigned Phase_Width = PHASEWIDTH;
double
DDS_Clock_Rate = 25.0;
double PINC[16];
double POFF[16];
Create an instance of the DDS function using the HLS namespace with the defined static
parameters (for example, param1). Then, call the function with the run method to execute
the function. Following are the data and phase function arguments shown in order:
static hls::DDS dds1;
dds1.run(data_channel, phase_channel);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
273
Chapter 2: High-Level Synthesis C Libraries
To access design examples that use the DDS C library, select Help > Welcome > Open
Example Project > Design Examples > DDS.
DDS Static Parameters
The static parameters of the DDS define how to configure the DDS, such as the clock rate,
phase interval, and modes. The hls_dds.h header file defines an
hls::ip_dds::params_t struct, which sets the default values for the static parameters.
To use the default values, you can use the parameterization struct directly with the DDS
function.
static hls::DDS< hls::ip_dds::params_t > dds1;
dds1.run(data_channel, phase_channel);
The following table describes the parameters for the hls::ip_dds::params_t
parameterization struct. For a list of possible values for the parameters, including default
values, see Table 2-24.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP DDS Compiler Product
Guide (PG141) [Ref 7] for details on the parameters and values.
Table 2-23:
DDS Struct Parameters
Parameter
Description
DDS_Clock_Rate
Specifies the clock rate for the DDS output.
Channels
Specifies the number of channels. The DDS and phase generator
can support up to 16 channels. The channels are time-multiplexed,
which reduces the effective clock frequency per channel.
Mode_of_Operation
Specifies one of the following operation modes:
• Standard mode for use when the accumulated phase can be
truncated before it is used to access the SIN/COS LUT.
• Rasterized mode for use when the desired frequencies and
system clock are related by a rational fraction.
Modulus
Describes the relationship between the system clock frequency and
the desired frequencies.
Note: Use this parameter in rasterized mode only.
Spurious_Free_Dynamic_Range
Specifies the targeted purity of the tone produced by the DDS.
Frequency_Resolution
Specifies the minimum frequency resolution in Hz and determines
the Phase Width used by the phase accumulator, including
associated phase increment (PINC) and phase offset (POFF) values.
Noise_Shaping
Controls whether to use phase truncation, dithering, or Taylor
series correction.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
274
Chapter 2: High-Level Synthesis C Libraries
Table 2-23:
DDS Struct Parameters (Cont’d)
Phase_Width
Sets the width of the following:
• PHASE_OUT field within m_axis_phase_tdata
• Phase field within s_axis_phase_tdata when the DDS is
configured to be a SIN/COS LUT only
• Phase accumulator
• Associated phase increment and offset registers
• Phase field in s_axis_config_tdata
Note: For rasterized mode, the phase width is fixed as the number of bits
required to describe the valid input range [0, Modulus-1], that is, log2
(Modulus-1) rounded up.
Output_Width
Sets the width of SINE and COSINE fields within
m_axis_data_tdata. The SFDR provided by this parameter
depends on the selected Noise Shaping option.
Phase_Increment
Selects the phase increment value.
Phase_Offset
Selects the phase offset value.
Output_Selection
Sets the output selection to SINE, COSINE, or both in the
m_axis_data_tdata bus.
Negative_Sine
Negates the SINE field at run time.
Negative_Cosine
Negates the COSINE field at run time.
Amplitude_Mode
Sets the amplitude to full range or unit circle.
Memory_Type
Controls the implementation of the SIN/COS LUT.
Optimization_Goal
Controls whether the implementation decisions target highest
speed or lowest resource.
DSP48_Use
Controls the implementation of the phase accumulator and
addition stages for phase offset, dither noise addition, or both.
Latency_Configuration
Sets the latency of the core to the optimum value based upon the
Optimization Goal.
Latency
Specifies the manual latency value.
Output_Form
Sets the output form to two’s complement or to sign and
magnitude. In general, the output of SINE and COSINE is in two’s
complement form. However, when quadrant symmetry is used, the
output form can be changed to sign and magnitude.
PINC[XIP_DDS_CHANNELS_MAX]
Sets the values for the phase increment for each output channel.
POFF[XIP_DDS_CHANNELS_MAX]
Sets the values for the phase offset for each output channel.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
275
Chapter 2: High-Level Synthesis C Libraries
The following table shows the possible values for the hls::ip_dds::params_t
parameterization struct parameters.
Table 2-24:
DDS Struct Parameters Values
Parameter
C Type
Default Value
Valid Values
DDS_Clock_Rate
double
20.0
Any double value
Channels
unsigned
1
1 to 16
Mode_of_Operation
unsigned
XIP_DDS_MOO_CONVE
NTIONAL
• XIP_DDS_MOO_CONVENTIONAL
truncates the accumulated
phase.
• XIP_DDS_MOO_RASTERIZED
selects rasterized mode.
Modulus
unsigned
200
129 to 256
Spurious_Free_Dynamic_Range
double
20.0
18.0 to 150.0
Frequency_Resolution
double
10.0
0.000000001 to 125000000
Noise_Shaping
unsigned
XIP_DDS_NS_NONE
• XIP_DDS_NS_NONE produces
phase truncation DDS.
• XIP_DDS_NS_DITHER uses phase
dither to improve SFDR at the
expense of increased noise floor.
• XIP_DDS_NS_TAYLOR
interpolates sine/cosine values
using the otherwise discarded
bits from phase truncation
• XIP_DDS_NS_AUTO
automatically determines
noise-shaping.
Phase_Width
unsigned
16
Must be an integer multiple of 8
Output_Width
unsigned
16
Must be an integer multiple of 8
Phase_Increment
unsigned
XIP_DDS_PINCPOFF_FIX
ED
XIP_DDS_PINCPOFF_FIXED fixes
PINC at generation time, and PINC
cannot be changed at run time.
Note: This is the only value
supported.
Phase_Offset
unsigned
XIP_DDS_PINCPOFF_NO
NE
• XIP_DDS_PINCPOFF_NONE does
not generate phase offset.
• XIP_DDS_PINCPOFF_FIXED fixes
POFF at generation time, and
POFF cannot be changed at run
time.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
276
Chapter 2: High-Level Synthesis C Libraries
Table 2-24:
DDS Struct Parameters Values (Cont’d)
Output_Selection
unsigned
XIP_DDS_OUT_SIN_AND
_COS
• XIP_DDS_OUT_SIN_ONLY
produces sine output only.
• XIP_DDS_OUT_COS_ONLY
produces cosine output only.
• XIP_DDS_OUT_SIN_AND_COS
produces both sin and cosine
output.
Negative_Sine
unsigned
XIP_DDS_ABSENT
• XIP_DDS_ABSENT produces
standard sine wave.
• XIP_DDS_PRESENT negates sine
wave.
Negative_Cosine
bool
XIP_DDS_ABSENT
• XIP_DDS_ABSENT produces
standard sine wave.
• XIP_DDS_PRESENT negates sine
wave.
Amplitude_Mode
unsigned
XIP_DDS_FULL_RANGE
• XIP_DDS_FULL_RANGE
normalizes amplitude to the
output width with the binary
point in the first place. For
example, an 8-bit output has a
binary amplitude of 100000000
- 10 giving values between
01111110 and 11111110, which
corresponds to just less than 1
and just more than -1
respectively.
• XIP_DDS_UNIT_CIRCLE
normalizes amplitude to half full
range, that is, values range from
01000 .. (+0.5). to 110000 ..
(-0.5).
Memory_Type
unsigned
XIP_DDS_MEM_AUTO
• XIP_DDS_MEM_AUTO selects
distributed ROM for small cases
where the table can be
contained in a single layer of
memory and selects block ROM
for larger cases.
• XIP_DDS_MEM_BLOCK always
uses block RAM.
• XIP_DDS_MEM_DIST always uses
distributed RAM.
Optimization_Goal
unsigned
XIP_DDS_OPTGOAL_AUT • XIP_DDS_OPTGOAL_AUTO
O
automatically selects the
optimization goal.
• XIP_DDS_OPTGOAL_AREA
optimizes for area.
• XIP_DDS_OPTGOAL_SPEED
optimizes for performance.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
277
Chapter 2: High-Level Synthesis C Libraries
Table 2-24:
DDS Struct Parameters Values (Cont’d)
DSP48_Use
unsigned
XIP_DDS_DSP_MIN
• XIP_DDS_DSP_MIN implements
the phase accumulator and the
stages for phase offset, dither
noise addition, or both in FPGA
logic.
• XIP_DDS_DSP_MAX implements
the phase accumulator and the
phase offset, dither noise
addition, or both using DSP
slices. In the case of single
channel, the DSP slice can also
provide the register to store
programmable phase increment,
phase offset, or both and
thereby, save further fabric
resources.
Latency_Configuration
unsigned
XIP_DDS_LATENCY_AUT
O
• XIP_DDS_LATENCY_AUTO
automatically determines he
latency.
• XIP_DDS_LATENCY_MANUAL
manually specifies the latency
using the Latency option.
Latency
unsigned
5
Any value
Output_Form
unsigned
XIP_DDS_OUTPUT_TWO
S
• XIP_DDS_OUTPUT_TWOS
outputs two's complement.
• XIP_DDS_OUTPUT_SIGN_MAG
outputs signed magnitude.
PINC[XIP_DDS_CHANNELS_MAX]
unsigned
array
{0}
Any value for the phase increment
for each channel
POFF[XIP_DDS_CHANNELS_MAX]
unsigned
array
{0}
Any value for the phase offset for
each channel
SRL IP Library
C code is written to satisfy several different requirements: reuse, readability, and
performance. Until now, it is unlikely that the C code was written to result in the most ideal
hardware after high-level synthesis.
Like the requirements for reuse, readability, and performance, certain coding techniques or
pre-defined constructs can ensure that the synthesis output results in more optimal
hardware or to better model hardware in C for easier validation of the algorithm.
Mapping Directly into SRL Resources
Many C algorithms sequentially shift data through arrays. They add a new value to the start
of the array, shift the existing data through array, and drop the oldest data value. This
operation is implemented in hardware as a shift register.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
278
Chapter 2: High-Level Synthesis C Libraries
This most common way to implement a shift register from C into hardware is to completely
partition the array into individual elements, and allow the data dependencies between the
elements in the RTL to imply a shift register.
Logic synthesis typically implements the RTL shift register into a Xilinx SRL resource, which
efficiently implements shift registers. The issue is that sometimes logic synthesis does not
implement the RTL shift register using an SRL component:
•
When data is accessed in the middle of the shift register, logic synthesis cannot directly
infer an SRL.
•
Sometimes, even when the SRL is ideal, logic synthesis may implement the shift-resister
in flip-flops, due to other factors. (Logic synthesis is also a complex process).
Vivado HLS provides a C++ class (ap_shift_reg) to ensure that the shift register defined
in the C code is always implemented using an SRL resource. The ap_shift_reg class has
two methods to perform the various read and write accesses supported by an SRL
component.
Read from the Shifter
The read method allows a specified location to be read from the shifter register.
The ap_shift_reg.h header file that defines the ap_shift_reg class is also included
with Vivado HLS as a standalone package. You have the right to use it in your own source
code. The package xilinx_hls_lib_.tgz is located in the
include directory in the Vivado HLS installation area.
// Include the Class
#include "ap_shift_reg.h"
// Define a variable of type ap_shift_reg
// - Sreg must use the static qualifier
// - Sreg will hold integer data types
// - Sreg will hold 4 data values
static ap_shift_reg Sreg;
int var1;
// Read location 2 of Sreg into var1
var1 = Sreg.read(2);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
279
Chapter 2: High-Level Synthesis C Libraries
Read, Write, and Shift Data
A shift method allows a read, write, and shift operation to be performed.
// Include the Class
#include "ap_shift_reg.h"
// Define a variable of type ap_shift_reg
// - Sreg must use the static qualifier
// - Sreg will hold integer data types
// - Sreg will hold 4 data values
static ap_shift_reg Sreg;
int var1;
// Read location 3 of Sreg into var1
// THEN shift all values up one and load In1 into location 0
var1 = Sreg.shift(In1,3);
Read, Write, and Enable-Shift
The shift method also supports an enabled input, allowing the shift process to be
controlled and enabled by a variable.
// Include the Class
#include "ap_shift_reg.h"
// Define a variable of type ap_shift_reg
// - Sreg must use the static qualifier
// - Sreg will hold integer data types
// - Sreg will hold 4 data values
static ap_shift_reg Sreg;
int var1, In1;
bool En;
// Read location 3 of Sreg into var1
// THEN if En=1
// Shift all values up one and load In1 into location 0
var1 = Sreg.shift(In1,3,En);
When using the ap_shift_reg class, Vivado HLS creates a unique RTL component for
each shifter. When logic synthesis is performed, this component is synthesized into an SRL
resource.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
280
Chapter 2: High-Level Synthesis C Libraries
HLS Linear Algebra Library
The HLS Linear Algebra Library provides a number of commonly used linear algebra
functions. The functions in the HLS Linear Algebra Library all use two-dimensional arrays to
represent matrices and are listed in the following table.
Table 2-25:
HLS Linear Algebra Library
Function
cholesky
Data Type
Implementation Style
Synthesized
float
ap_fixed
x_complex
x_complex
cholesky_inverse
Synthesized
float
ap_fixed
x_complex
x_complex
matrix_multiply
Synthesized
float
ap_fixed
x_complex
x_complex
qrf
float
Synthesized
x_complex
qr_inverse
float
Synthesized
x_complex
svd
float
Synthesized
x_complex
The linear algebra functions all use two-dimensional arrays to represent matrices. All
functions support float (single precision) inputs, for real and complex data. A subset of the
functions support ap_fixed (fixed-point) inputs, for real and complex data. The precision
and rounding behavior of the ap_fixed types may be user defined, if desired.
A complete description of all linear algebra functions is provided in the HLS Linear Algebra
Library Functions in Chapter 4.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
281
Chapter 2: High-Level Synthesis C Libraries
Using the Linear Algebra Library
You can reference the HLS linear algebra functions using one of the following methods:
•
Using scoped naming:
#include "hls_linear_algebra.h"
hls::cholesky(In_Array,Out_Array);
•
Using the hls namespace:
#include "hls_linear_algebra.h"
using namespace hls;// Namespace specified after the header files
cholesky(In_Array,Out_Array);
Optimizing the Linear Algebra Functions
When using linear algebra functions, you must determine the level of optimization for the
RTL implementation. The level and type of optimization depend on how the C code is
written and how the Vivado HLS directives are applied to the C code.
To simplify the process of optimization, Vivado HLS provides the linear algebra library
functions, which include several C code architectures and embedded optimization
directives. Using a C++ configuration class, you can select the C code to use and the
optimization directives to apply.
Although the exact optimizations vary from function to function, the configuration class
typically allows you to specify the level of optimization for the RTL implementation as
follows:
•
Small: Lower resources and throughput
•
Balanced: Compromise between resources and throughput
•
Fast: Higher throughput at the expense of higher resources
Vivado HLS provides example projects that show how to use the configuration class for
each function in the linear algebra library. You can use these examples as templates to learn
how to configure Vivado HLS for each of the functions for a specific implementation target.
Each example provides a C++ source file with multiple C code architectures as different
C++ functions.
Note: To identify the top-level C++ function, look for the TOP directive in the directives.tcl
file or the Vivado HLS GUI Directive tab.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
282
Chapter 2: High-Level Synthesis C Libraries
You can open these examples from the Vivado HLS Welcome screen:
1. Click Open Example Project.
2. In the Examples dialog box, expand Design Examples > linear_algebra >
implementation_targets.
Note: The Welcome Page appears when you invoke the Vivado HLS GUI. You can access it at any
time by selecting Help > Welcome.
To determine which optimization works best for your design, you can compare the
performance and utilization estimates for each solution using the Vivado HLS Compare
Reports feature. To compare the estimates, you must run synthesis for all of the project
solutions by selecting Solution > Run C Synthesis > All Solutions. Then, use the Compare
Reports toolbar button.
Cholesky
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Table 2-26:
Cholesky Key Factor Summary
Key Factor
Value
Resources
Throughput
Latency
Architecture
0
Low
Low
High
(ARCH)
1
Medium
Medium
Medium
2
High
High
Low
Inner loop pipelining
1
High
High
Low
(INNER_II)
>1
Low
Low
High
Inner loop unrolling
1
Low
Low
High
(UNROLL_FACTOR)
>1
High
High
Low
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
283
Chapter 2: High-Level Synthesis C Libraries
Key Factors
Following is additional information about the key factors in the preceding table:
•
Architecture
°
°
°
•
1: Uses higher DSP utilization but minimized memory utilization with increased
throughput. This value does not support inner loop unrolling to further increase
throughput.
2: Uses highest DSP and memory utilization. This value supports inner loop
unrolling to improve overall throughput with a limited increase in DSP resources.
This is the most flexible architecture for design exploration.
Inner loop pipelining
°
•
0: Uses the lowest DSP utilization and lowest throughput.
>1: For ARCH 2, enables Vivado HLS to resource share and reduce the DSP
utilization. When using complex floating-point data types, setting the value to 2 or
4 significantly reduces DSP utilization.
Inner loop unrolling
°
For ARCH 2, duplicates the hardware required to implement the loop processing by
a specified factor, executes the corresponding number of loop iterations in parallel,
and increases throughput but also increases DSP and memory utilization.
Specifications
You can specify all factors using a configuration class derived from the following
hls::cholesky_traits base class by redefining the appropriate class member:
struct MY_CONFIG :
hls::cholesky_traits{
static const int ARCH = 2;
static const int INNER_II = 2;
static const int UNROLL_FACTOR = 1;
};
The configuration class is supplied to the hls::cholesky_top function as a template
parameter as follows:
hls::cholesky_top(A,L);
The hls::cholesky function uses the following default configuration:
hls::cholesky(A,L);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
284
Chapter 2: High-Level Synthesis C Libraries
Examples
The following table shows example implementation solutions for the Cholesky function. The
performance metrics are generated using the Cholesky example project, which defines a
solution for each implementation target. The throughput and latency figures are based on
post-synthesis simulation.
The example project uses the following specifications:
•
A input: 16x16 floating point complex matrix
•
Synthesis wrapper: Local arrays for the input and output matrix
•
Device: Kintex ®-7 (xc7k160tfbg484-1)
•
Nominal clock period: 4 ns
Table 2-27:
Cholesky Implementation Targets
Solution
Key Factor
Performance Metric
Latency cycles
LUT
Throughput cycles
FF
BRAM
DSP
Inner loop
unrolling
(UNROLL_FACTOR)
Inner loop
pipelining
(INNER_II)
Architecture
(ARCH)
Resources
small
0
N/A
N/A
8
8
5850
4271
33724
33724
balanced
1
N/A
N/A
10
8
4582
3367
14466
14466
alt_balanced
2
4
1
10
6
5115
3552
15412
15412
fast
2
1
1
36
6
7820
5288
9322
9322
faster
2
1
2
72
12
12569
8494
8370
8370
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
285
Chapter 2: High-Level Synthesis C Libraries
Cholesky Inverse and QR Inverse
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Table 2-28:
Inverse Key Factor Summary
Key Factor
Value
Resources
Throughput
Latency
Sub-function implementation
target (Cholesky/QRF and matrix
multiply)
Small
Low
Low
High
Balanced
Medium
Medium
Medium
Fast
High
High
Low
Back substitution inner and
diagonal loop pipelining
1
High
High
Low
>1
Low
Low
High
DATAFLOW directive
Yes
Medium
High
High
INLINE directive
Yes
Low
Low
High
Key Factors
Following is additional information about the key factors shown in the preceding table:
•
Sub-function implementation
°
•
Back substitution inner and diagonal loop pipelining
°
•
>1: Enables Vivado HLS to resource share and reduce the DSP utilization.
DATAFLOW directive
°
•
Utilizes the following sub-functions executed sequentially: Cholesky or QRF, back
substitution, and matrix multiply. The implementation selected for these
sub-functions determines the resource utilization and function throughput/latency
of the Inverse function.
Pipelines sequential tasks, which increases the function throughput to an initiation
interval based on the maximum sub-function latency rather than the sum of the
individual sub-function latencies. The function throughput substantially increases
along with an increase in overall latency. Additional memory resources are required.
INLINE directive
°
Removes the sub-function hierarchy and allows Vivado HLS to better share
resources and can reduce DSP and memory utilization.
TIP: You can adjust the resources and throughput of the Inverse functions to meet specific requirements
by combining the DATAFLOW directive with the appropriate sub-function implementations.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
286
Chapter 2: High-Level Synthesis C Libraries
Specifications
The DATAFLOW directive is applied to the hls::cholesky_inverse_top or
hls::qr_inverse_top function as follows:
set_directive_dataflow "cholesky_inverse_top"
The INLINE directive is applied in the same manner:
set_directive_inline -recursive "cholesky_inverse_top"
You can specify the individual sub-function implementations using a configuration class
derived from the following hls::cholesky_inverse_traits or
hls::qr_inverse_traits base class by redefining the appropriate class member:
typedef hls::cholesky_inverse_traits MY_DFLT_CFG;
struct MY_CONFIG : MY_DFLT_CFG {
struct CHOLESKY_TRAITS :
hls::cholesky_traits {
static const int ARCH = 1;
};
struct BACK_SUB_CONFIG :
hls::back_substitute_traits {
static const int INNER_II = 2;
static const int DIAG_II = 2;
};
struct MULTIPLIER_CONFIG :
hls::matrix_multiply_traits {
static const int INNER_II = 2;
};
};
The configuration class is supplied to the hls::cholesky_inverse_top or
hls::qr_inverse_top function as a template parameter as follows:
hls::cholesky_inverse_top(A,INVERSE_A,inv
erse_OK);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
287
Chapter 2: High-Level Synthesis C Libraries
The hls::cholesky_inverse or hls::qr_inverse function uses the following
default configuration:
hls::cholesky_inverse(A,INVERSE_A,inverse_OK);
Examples
The following table shows example implementation solutions for the Cholesky and matrix
multiply sub-functions. The performance metrics are generated using the Cholesky
Inverse example project, which defines a solution for each implementation target. The
throughput and latency figures are based on post-synthesis simulation.
The example projects use the following specifications:
•
A input: 8x8 floating point complex matrix
•
Synthesis wrapper: Local arrays for the input and output matrix
•
Device: Kintex-7 (xc7k160tfbg484-1)
•
Nominal clock period: 4 ns
Table 2-29:
Cholesky Inverse Implementation Targets
Solution
Key Factor
Performance Metric
LUT
Latency cycles
FF
BRAM
DSP
Throughput cycles
Resources
INNER_II
DIAG_II
Cholesky and Multiply
Target
DATAFLOW directive
INLINE directive
Back
Subst.
smaller
✓
N/A
Small
8
8
8
17
6786
5268
13972
13972
small
N/A
N/A
Small
8
8
18
21
9924
6887
11762
11762
balanced
N/A
N/A
Balanced
2
2
38
16
10625
7808
2181
10182
balanced_
N/A
✓
Balanced
2
2
38
26
10566
7708
5820
5820
default
N/A
N/A
Default
1
1
66
16
13464
9286
4885
4885
fast
N/A
N/A
Fast
1
1
92
18
16588
11179
4533
4533
fast_high_
throughput
N/A
✓
Fast
1
1
92
28
16562
11112
1900
8428
high_
throughput
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
288
Chapter 2: High-Level Synthesis C Libraries
The following table shows example implementation solutions for the QRF and matrix
multiply sub-functions. The performance metrics are generated using the QR Inverse
example project, which defines a solution for each implementation target. The throughput
and latency figures are based on post-synthesis simulation.
The example projects use the following specifications:
•
A input: 8x8 floating point complex matrix
•
Synthesis wrapper: Local arrays for the input and output matrix
•
Device: Kintex-7 (xc7k160tfbg484-1)
•
Nominal clock period: 4 ns
Table 2-30:
QRF Inverse Implementation Targets
Solution
Key Factor
Performance Metric
Latency cycles
LUT
FF
BRAM
DSP
Throughput cycles
Resources
INNER_II
DIAG_II
QRF and Multiply
Target
DATAFLOW directive
INLINE directive
Back
Subst.
smaller
✓
N/A
Small
8
8
18
23
13530
9715
10734
10734
small
N/A
N/A
Small
8
8
33
25
16249
11721
10705
10705
balanced
N/A
N/A
Balanced
2
2
92
26
39436
21675
6277
6277
balanced_
high_
throughput
N/A
✓
Balanced
2
2
92
38
39461
21653
2975
12458
default
N/A
N/A
Default
1
1
110
26
41254
22532
5982
5982
fast
N/A
N/A
Fast
1
1
146
26
45026
25471
5576
5576
fast_high_
N/A
✓
Fast
1
1
146
38
45051
25449
2650
11066
throughput
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
289
Chapter 2: High-Level Synthesis C Libraries
Matrix Multiply
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Table 2-31:
Matrix Multiply Key Factor Summary
Key Factor
Value
Resources
Throughput
Latency
Architecture
2 (Floating Point)
Low
Low
High
(ARCH)
3 (Floating Point)
High
High
Low
0 (Fixed Point)
Low
Low
High
2 (Fixed Point)
Medium
Medium
Medium
4 (Fixed Point)
High
High
Low
Inner loop pipelining
1
High
High
Low
(INNER_II)
>1
Low
Low
High
Inner loop unrolling
1
Low
Low
High
(UNROLL_FACTOR)
>1
High
High
Low
Resource directive
LUTRAM
Medium
N/A
N/A
(RESOURCE)
Key Factors
Following is additional information about the key factors in the preceding table:
•
Architecture
The ARCH key factor selects the architecture based on the implementation data type.
°
°
Floating-point data types
-
2: Ensures the inner accumulation loop achieves the maximum throughput with
an II of 1. This value supports inner loop partial unrolling, which improves
overall throughput with a limited increase in DSP resources.
-
3: Implements a fully unrolled inner accumulation loop, which uses the highest
number of DSP resources and highest throughput.
Fixed-point data types
-
0: Uses the lowest resource utilization and lowest throughput.
-
2: Supports inner loop partial unrolling to improve overall throughput with a
limited increase in DSP resource.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
290
Chapter 2: High-Level Synthesis C Libraries
•
Inner loop pipelining
°
•
>1: When using complex floating-point data types, shares resources and reduces
DSP utilization. Setting the value to 2 or 4 significantly reduces DSP utilization.
Inner loop unrolling
°
°
•
4: Implements a fully unrolled inner accumulation loop, which uses the highest
number of DSP resources and highest throughput.
For ARCH 2, duplicates the hardware required to implement the loop processing by
a specified factor, executes the corresponding number of loop iterations in parallel,
and increases throughput but also increases DSP and memory utilization.
For ARCH 3 or 4, fully unrolls the accumulation loop.
Resource directive
By default, Vivado HLS uses Block RAM to implement arrays.
°
For ARCH 2, partially unrolling the accumulation loop results in Vivado HLS splitting
the sum_mult array across multiple Block RAM.
°
When the partitioned size does not require using a Block RAM, use the RESOURCE
directive to specify a LUTRAM.
Specifications
Except for the RESOURCE directive, you can specify all factors using a configuration class
derived from the following hls::matrix_multiply_traits base class by redefining
the appropriate class member:
struct MY_CONFIG: hls::matrix_multiply_traits{
static const int ARCH
= 2;
static const int INNER_II
= 1;
static const int UNROLL_FACTOR = 2;
};
The configuration class is supplied to the hls::matrix_multiply_top function as a
template parameter as follows:
hls::matrix_multiply_top(A,B,C);
The hls::matrix_multiply function uses the following default configuration:
hls::matrix_multiply(A,B,C);
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
291
Chapter 2: High-Level Synthesis C Libraries
If you select ARCH 2, the RESOURCE directive is applied to the sum_mult array in function
hls::matrix_multiply_alt2 as follows:
set_directive_resource -core RAM_S2P_LUTRAM "matrix_multiply_alt2" sum_mult
Examples
The following table shows example implementation solutions for the matrix multiply
function. The performance metrics are generated using the Matrix Multiply Float
and Matrix Multiply Fixed example projects, which define a solution for each
implementation target. The throughput and latency values are based on post-synthesis
simulation.
The example projects use the following specifications:
•
A and B inputs: 8x8 complex matrices
•
Synthesis wrapper: Local arrays for the input and output matrix
•
Device: Kintex-7 (xc7k160tfbg484-1)
•
Nominal clock period: 4 ns
Table 2-32:
Solution
Matrix Multiply Implementation Targets
Data
type
Key Factor
Performance Metric
Throughput cycles
Latency cycles
6
693
509
2194
2194
default
0
1
1
N/A
4
6
627
491
659
659
fast
2
1
4
✓
16
2
3164
1369
401
401
faster
4
N/A
N/A
N/A
32
2
2869
713
210
210
2
4
1
N/A
5
9
1401
948
2217
2217
default
2
1
1
N/A
20
10
3023
1993
683
683
fast
2
1
4
✓
40
2
7842
4885
425
425
faster
3
N/A
N/A
N/A
156
2
21680
12506
251
251
small
Float
LUT
4
FF
N/A
BRAM
RESOURCE
directive
1
Fixed
DSP
Inner loop
unrolling
(INNER_UNROLL
4
small
Inner loop
pipelining
(INNER_II)
0
Architecture
(ARCH)
Resources
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
292
Chapter 2: High-Level Synthesis C Libraries
QRF
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Table 2-33:
QRF Key Factor Summary
Key Factor
Value
Resources
Throughput
Latency
Q and R update loop pipelining
(UPDATE_II)
2
High
High
Low
>2
Low
Low
High
Q and R update loop unrolling
1
Low
Low
High
(UNROLL_FACTOR)
>1
High
High
Low
Rotation loop pipelining
1
High
High
Low
(CALC_ROT_II)
>1
Low
Low
High
Key Factors
Following is additional information about the key factors in the preceding table:
•
Q and R update loop pipelining
°
°
•
>2: Enables Vivado HLS to further resource share and reduce the DSP utilization.
With complex-floating point data types, setting the value to 4 or 8 significantly
reduces DSP utilization.
Q and R update loop unrolling
°
•
2: Sets the minimum achievable initiation interval (II) of 2, which satisfies the Q and
R matrix array requirement of two writes every iteration of the update loop.
Duplicates the hardware required to implement the loop processing by a specified
factor, executes the corresponding number of loop iterations in parallel, and
increases throughput but also increases DSP and memory utilization.
Rotation loop pipelining
°
Enables Vivado HLS to resource share and reduce the DSP utilization.
Specifications
You can specify all factors using a configuration class derived from the following
hls::qrf_traits base class by redefining the appropriate class member:
struct MY_CONFIG : hls::qrf_traits{
static const int CALC_ROT_II = 4;
static const int UPDATE_II= 4;
static const int UNROLL_FACTOR= 2;
};
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
293
Chapter 2: High-Level Synthesis C Libraries
The configuration class is supplied to the hls::qrf_top function as a template parameter
as follows:
hls::qrf_top(A,Q,R);
The hls::qrf function uses the following default configuration:
hls::qrf(A,Q,R);
Examples
The following table shows example implementation solutions for the QRF function. The
performance metrics are generated using the QRF example project, which defines a solution
for each implementation target. The throughput and latency figures are based on
post-synthesis simulation.
The example project uses the following specifications:
•
A input: 16x16 floating-point complex matrix
•
Synthesis wrapper: Local arrays for the input and output matrix
•
Device: Kintex-7 (xc7k160tfbg484-1)
•
Nominal clock period: 4 ns
Table 2-34:
QRF Implementation Targets
Solution
Key Factor
Performance Metric
Throughput cycles
Latency cycles
23
14
12252
9203
25620
25620
balanced
1
4
1
54
14
32624
16825
16746
16746
fast
1
2
1
90
14
36396
19764
13116
13116
faster
1
2
2
162
22
46004
27043
11180
11180
Q and R update loop
unrolling
(UNROLL_FACTOR)
LUT
N/A
FF
8
BRAM
8
DSP
small
Q and R update loop
pipelining
(UPDATE_II)
Rotation loop
pipelining
(CALC_ROT_II)
Resources
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
294
Chapter 2: High-Level Synthesis C Libraries
SVD
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Table 2-35:
SVD Key Factor Summary
Key Factor
Value
Resources
Throughput
Latency
ALLOCATION directive
1
Low
Low
High
( vm2x1_base limit)
>1
High
High
Low
Off-diagonal loop pipelining
4
High
High
Low
(OFF_DIAG_II)
>4
Low
Low
High
Diagonal loop pipelining
1
High
High
Low
(DIAG_II)
>1
Low
Low
High
Iterations
<10
N/A
High
Low
Combined
operator
Medium
High
Low
(NUM_SWEEP)
Reciprocal Square Root operator
Key Factors
Following is additional information about the key factors in the preceding table:
•
ALLOCATION directive
°
Limits the number of implemented 2x1 vector dot products. Vivado HLS schedules
the SVD function to use the specified number 2x1 vector dot product kernels.
Note: The SVD algorithm is computationally intensive, particularly for complex data types.
The ALLOCATION directive is the most effective method to balance resource utilization and
throughput.
•
•
Off-diagonal loop pipelining
°
4: Sets the minimum achievable initiation interval (II) of 4, which satisfies the S, U,
and V array requirement of four writes every iteration of the off-diagonal loop.
°
>4: Enables Vivado HLS to further resource share and reduce the DSP utilization.
Diagonal loop pipelining
°
>1: Enables Vivado HLS to resource share.
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
295
Chapter 2: High-Level Synthesis C Libraries
•
Iterations
The SVD function uses the iterative two-sided Jacobi method.
•
°
10: Sets the default number of iterations.
°
<10: Maximizes the function throughput by setting the minimum number of
iterations that meets the desired performance.
Reciprocal Square Root operator
°
Ensures a much lower latency than the discrete operators.
Note: By default, Vivado HLS does not use the combined rsqrt operator but uses discrete
divide and sqrt operators. Selecting the -unsafe_math_optimizations compiler
option enables the use of the rsqrt operator.
Specifications
You can apply the ALLOCATION directive to the hls::svd_pairs function in combination
with the INLINE directive as follows:
set_directive_inline -off "vm2x1_base"
set_directive_allocation -limit 1 -type function "svd_pairs" vm2x1_base
You can select the -unsafe_math_optimizations compiler option as follows:
config_compile -unsafe_math_optimizations
You can specify all other factors using a configuration class derived from the following
hls::svd_traits base class by redefining the appropriate class member:
struct MY_CONFIG : hls::svd_traits{
static const int NUM_SWEEPS = 6;
static const int DIAG_II = 4;
static const int OFF_DIAG_II = 4;
};
High-Level Synthesis
UG902 (v2017.1) April 5, 2017
www.xilinx.com
Send Feedback
296
Chapter 2: High-Level Synthesis C Libraries
The configuration class is supplied to the hls::svd_top function as a template parameter
as follows:
hls::svd_top