Lab Instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 9

Environment Setup

bash

cp -rp /mnt/hld_fpga-sysc $HOME

export HLD_ROOT=$HOME/hld_fpga-sysc/

export SC_DIR=/opt/systemc/2.3.1

export GTEST_DIR=/opt/googletest/googletest

export PYTHONPATH=.:$HLD_ROOT/scripts/systemc-gen

WIKI

• firefox https://github.com/intel/rapid-design-methods-for-developing-hardware-

accelerators/wiki

• or firefox $HLD_ROOT/docs/wiki/index.html

VCS and Quartus setup:

source /mnt/harpv2/env.sh

Install AAL SDK (if not done previously):

cp /mnt/harpv2/SR-5.0.3-Release.tar.gz $HOME

tar xzvf SR-5.0.3-Release.tar.gz

cd SR-5.0.3-Release/Base/SW

tar xzvf aalsdk-5.0.3.BSD-License.tar.gz

cd aalsdk-5.0.3

./configure --prefix=$HOME/SR-5.0.3-Release/Base/SW/install

make; make install

cd ../install

export AALSDK=`pwd`

Install MPF (memory layer building block required to use virtual addressing in HW):

cp /mnt/harpv2/harpv2_mpf.tar.gz $HOME;

cd $HOME; tar xzvf $HOME/harpv2_mpf.tar.gz

cd harpv2_mpf/BBB_cci_mpf/sw; make prefix=$AALSDK

export CCI_MPF_SW_ROOT=`pwd`

Compile FPGA APP SW (common SW driver for HLD apps)

cd $HLD_ROOT/fpga_sw/5.0.3/AcclAppVtp/

make prefix=$AALSDK

Lab Overview

The lab goes over implementing a Floyd's Cycle Detection Algorithm on Xeon/FPGA

(based on SR-5.0.3 SDK)

Input: SW implementation of the algorithm

Output:

− Generated RTL for FPGA

− Refactored SW that offloads the computation to FPGA

− SW-ASE simulation running

Please have the wiki handy (firefox https://github.com/intel/rapid-design-methods-for-

developing-hardware-accelerators/wiki)

Every lab has a solution under labN-solution folder. Each lab is independent.

Algorithm

1. Let us take 2 pointers namely slow Pointer and fast Pointer to traverse a Singly

Linked List at different speeds. A slow Pointer (Also called Tortoise) moves one

step forward while fast Pointer (Also called Hare) moves 2 steps forward

Start Tortoise and Hare at the first node of the List.

2. If Hare reaches end of the List, return as there is no loop in the list.

3. Else move Hare one step forward.

4. If Hare reaches end of the List, return as there is no loop in the list.

5. Else move Hare and Tortoise one step forward.

6. If Hare and Tortoise pointing to same Node return, we found loop in the List.

7. Else start with STEP-2.

cycle_detection.c

int cycle_detect(listnode_t *head) {

listnode_t *fast, *slow;

slow = head;

fast = head;

while(!(slow == NULL || fast == NULL)) {

fast = fast->next;

if(!fast)

return 0;

if(fast == slow)

return 1;

fast = fast->next;

slow = slow->next;

}

return 0;

Lab1: Partition SW following SW/HW Interface

The original SW resides in file $HLD_ROOT/labs/linkedlist-lab/cycle_detect.c. You can

compile it using “make” command. The original code has to be refactored to conform to

the SW/HW interface specified by IFpgaApp C++ interface

($HLD_ROOT/common/fpga_app_if.h).

We provide the FpgaSWAlloc class ($HLD_ROOT/common/fpga_app_sw.h) that

implements alloc(), join() and free() methods for simulation in software, so only

compute() method has to be implemented.

We will create a Config struct for this application with address offsets to the header of the

list and the result. The Config object is passed to compute() method. This is the only way

to pass initial data from SW to future HW. The rest of communications between SW and

HW will be carried out over shared memory. Thus, Config often has pointers to shared

memory to let HW know where to read/write data to/from.

Go to $HLD_ROOT/labs/linkedlist-lab/lab1.

Config structure is defined in Config.h. The SW code is refactored to use the IFpgaApp

interface in cycle_detection_sw.cpp. The code that emulates HW and implements

compute() method is located in cycle_detection_hw_emul.h. cycle_detect() function from

the original SW is the one we offload in this lab.

LAB INSTRUCTIONS:

- Implement compute() function by processing the config and redirecting a call to

the cycle_detect() function you moved from original SW

- Hint1: Look at the Config struct methods and ones that return pointers for the

linked list header and the result

- When done, run “make” and then “cycle_detection_hw_emul” to simulate

refactored SW

Note: The refactored SW code ( cycle_detection_sw.cpp ) will be the driver reused across

different execution models including the physical platform. The ifdefs include different

implementation of AcclApp class depending on the model.

class IFpgaApp {

virtual void *alloc( unsigned long long size_in_bytes) = 0;

virtual void compute( const void * config_ptr, const unsigned int config_size) = 0;

virtual void join() = 0;

virtual void free() = 0;

};

Lab2: Python DSL and SystemC Accelerator Code Generation

Most of the accelerator SystemC code and testbench is automatically generated. The user

needs to specify memory ports needed for computation and their types. The compute

kernel module (cycle_detection_hls) will be generated with ports corresponding to

memory port selection. The complete accelerator unit (cycle_detection_acc) will be

generated with the load/store units corresponding on the selected memory port types as

shown in the figure.

#ifdef SC_KERNEL

#include "cycle_detection_hls_tb.h"

#elif SC_ACC

#include "cycle_detection_acc_tb.h"

#elif FPGA_ACC

#include "AcclApp.h"

#else

#include "cycle_detection_hw_emul.h"

#endif

The way memory port specification is captured is by the means of Python DSL that we

provide. The DSL captures the following about a design:

− dut name (the generated files will have the dut name as a base name)

− user types that will be used as a payload for the memory ports

− memory inputs and outputs, their types and parameters for load store units

− extra config fields (by default for every memory port config will have the base

address of the corresponding data in memory)

− SystemC thread specification, memory port mapping to threads and module hierarchy

Go to lab2.

Open dut_params.py to see dut name and user types specified in Python DSL.

LAB INSTRUCTIONS:

− In dut_params.py file, specify two user memory ports: read port for type Node (name

it “inp”), write port for type CycleInfoExist (name it “out”).

− Hint1: Follow Python Interface Spec API from the wiki using TypedRead and

TypedWrite types correspondingly. Make sure you read about max_burst_count and

buf_size_in_cl parameters. You will need to pick them at this point but will be able to

change them later.

− Generate all SystemC collaterals following the CHEATSHEET

− make MODE=kernel (simulation should hang as there’s no functionality yet)

Note: there are multiple simulation collaterals were generated. We provide testbenches to

simulation at kernel level (cycle_detection_hls), accelerator level (cycle_detection_acc).

Different modes of simulation are activated by the MODE parameter used with Makefile.

Lab3: Complete Cycle Detect Functionality in SystemC

Go to lab 3. In generated files, e.g. cycle_detection_hls.h, you will see cog-related

comments. They allow regenerating files (already with user modifications) when changes

in dut_params are required. Those comments can also be deleted if -d flag is used with

the cog scripts. Open cycle_detection_hls.h:

− detection() function is the CTHREAD SystemC process that implements the cycle

detection algorithm

− Note special structure of any CTHREAD process: while(1) loop with main

functionality, reset section before the while(1) loop and wait() statements

− See how the memory port bundles (inpReqOut, inpRespIn) and (outReqOut,

outDataOut) are used to get the next Node in the linked list and to write

CycleExistInfo respectively

LAB INSTRUCTIONS:

− Complete the code where requests to memory are done to get the next Node in the

linked list to update slow and fast offsets

− Hint1: see how the fast_offset gets updated in the same function

− When done, make using: “make MODE=kernel” and then run the test:

“./cycle_detection_kernel”

Lab 4: Measuring Memory BW

The cycle_detection_acc module is fully defined by the generated file

cycle_detection_acc.h. The testbench harness to test this block is instantiated in

cycle_detection_acc_tb.h. To compile this testbench, do:

− make MODE=acc

See achievable memory BW by running it: ./cycle_detection_acc

LAB INSTRUCTIONS:

− In cycle_detection_acc_tb.h change the frequency of design to 200Mhz and observe

changes in the memory BW report

− In cycle_detection_acc_tb.h change memory latency to 500ns and observe changes in

the memory BW report

− Recompile for every change “make MODE=acc” and rerun

Note: In this simulation, we use a mock memory model that is simply models latency and

bandwidth of memory. Mock memory is useful to give quick feedback on design issues

like we’ve seen in cycle detection algorithm -> cannot utilize memory bandwidth

To help saturate memory BW often multi-threading or multiple AUs are required

− For cases when increasing AU frequency does not help utilizing BW – multi-

threading inside one AU will help

− For cases when memory BW scales proportionally to the AU frequency – multiple

AUs will help (we provide a multi_au_template to facilitate multiple AU integration)

Lab 5: HLS Flow and generated RTL integration

We synthesize the cycle_detection_acc module with HLS. We use C-to-Silicon Compiler

from Cadence®. The script for the synthesis tool is provided in lab5/ctos directory. The

synthesis process takes some time for the first time as it characterizes all components by

running gate-level synthesis for them. The characterization will be cached for the next

run in rc_work directory.

The generated RTL is located under lab5/afu directory.

In order for the automatic flow to integrate generated RTL extra information is needed

about the design that is captured in hld_defines.v file that is also located under afu

directory.

− HLD_ACC_CLK will determine the frequency of the accelerator

− HLD_AFU_ID* is needed for SW to locate the accelerator service on FPGA

− HLD_APP_CONFIG_WIDTH is the SW config size in bits

− HLD_AFU_MODULE_NAME is the name of the top verilog module in generated

RTL

See User Guide, section CCIP integration for more details

Next, we will run RTL + SW simulation using the ASE environment.

LAB INSTRUCTIONS:

ASE with verilog simulation runs in a separate shell. In the new shell, source VCS and

Quartus setup script, set up HLD_ROOT and AALSDK variables according to the setup

instructions (see first section of this document)

Launching ASE RTL simulation:

− Go to SR-5.0.3-Release/Base/SW/aalsdk-5.0.3/ase

− Use scripts/generate_ase_environment.py to create vlog_files.list (use location to

lab5/afu for RTL source):

./scripts/generate_ase_environment.py $HLD_ROOT/labs/linkedlist-

lab/lab5/afu/ -t VCS

− Replace the vlog_files.list with the following (edit harpX-user to your home

directory!!)

`define HLD_MEM_RD_PORTS 1

`define HLD_MEM_WR_PORTS 1

// in Mhz //CLK_400=3'b000, CLK_200=3'b001, CLK_100=3'b010

`define HLD_ACC_CLK CLK_200

// ---- REGENERATE THE IDS BELOW FOR YOUR AFU -----

//xxd -l 8 -p /dev/random

`define HLD_AFU_ID_H 64'h30b1cf9b9bee84e7

`define HLD_AFU_ID_L 64'hfd2e242e7efcb9d8

//in bits

`define HLD_APP_CONFIG_WIDTH 128

`define HLD_AFU_MODULE_NAME cycledetection_acc_rtl

//dependent on the number of ACC buffers

`define HLD_REQ_ASYNC_FIFO_LOG2DEPTH 4

`define HLD_RESP_ASYNC_FIFO_LOG2DEPTH 4

-F /home/harpX-user/harpv2_mpf/BBB_cci_mpf/hw/sim/cci_mpf_sim_addenda.txt

-F /home/harpX-user/hld_fpga-sysc/rtl/5.0.3/ase_rtl_common_files.txt

+incdir+/home/harpX-user/hld_fpga-sysc/labs/linkedlist-lab/lab5/afu

/home/harpX-user/hld_fpga-sysc/linkedlist-lab/lab5/afu/cycledetection_acc_rtl.v

− You also need to edit Makefile to add MPF_PLATFORM define

SNPS_VLOGAN_OPT+= +define+MPF_PLATFORM_BDX

− Run: make;make sim

Running SW for FPGA:

− Export the ASE_WORKDIR in the original shell according to instructions on the

screen after you ran “make sim”

− In lab5: make MODE=fpga prefix=$AALSDK

− ./cycle_detection_fpga

The simulation should run producing output in both shell windows.

Lab 6: Performance Simulation

The performance simulator is built on top of ASE environment. We provide a script to

patch the default ASE code (you may want to create a backup copy of original ASE). The

patched ASE will run in the performance mode by default. You can enable functional

mode only (original ASE mode) by setting ASE_PERF flag to 0 in Makefile.

LAB INSTRUCTIONS:

In ASE shell:

pushd $HLD_ROOT/ase_perf

./patch-ase.csh $HOME/SR-5.0.3-Release/Base/SW/aalsdk-5.0.3/ase

popd

Recompile and launch ASE

− make;make sim

Running SW for FPGA:

− In lab5: make MODE=fpga

− ./cycle_detection_fpga

After the test is finished, the timing report is printed in the ASE shell (you may Ctrl-C the

ASE shell in case it is run in the mode to run multiple tests) as below.

================== Stats of QA_FPGA_CACHE =================

Cache Size = 1024

Associativity = 1

Line Size = 1

Replacement Policy = 2

Word Size = 1

*mTotalWordUsed = 35 Words

*mTotalWordReadFromLLM = 55 Words

Word Waste Ratio = 0.363636

Base Address = 0

Total MSHR Hit Count = 0

Total Cache Access = 1375

Total Cache Miss = 59

Miss Rate = 0.0429091

Total Cache Pipeline Hazards = 27

Total Read Miss = 38

Total Write Miss = 21

Total Eviction = 18

================== CCI-P Stats =================

Total Data Accessed = 88000 Bytes

Min Latency of VL0 Access = 12500 ps

Max Latency of VL0 Access = 505 ns

Avg Latency of VL0 Access = 33811 ps

Total Requests at CCI-P = 1375

Total Bandwidth at CCI-P interface = 0.35329 GB/sec

Total Read Bandwidth at CCI-P interface = 0.345581 GB/sec

Total Write Bandwidth at CCI-P interface = 0.00770813 GB/sec

Total VL0 Bandwidth = 0.35329 GB/sec

Total VH0 Bandwidth = -0 GB/sec

Total VH1 Bandwidth = -0 GB/sec

================== Stat of DRAM =================

Avg Latency of a DRAM Request = 85 ns

Total DRAM Requests = 23

ase_top_ccip_emulator_i_qa_fpga_mem_subsystem.mDRAM : Total DRAM Request = 23

ase_top_ccip_emulator_i_qa_fpga_mem_subsystem.mDRAM : DRAM BandWidth =

0.0463082 GB/sec

Lab Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation