Feb 2020: I am currently looking for work in the San Francisco Bay Area in RF & Optical Communications. Contact me on LinkedIn
Part 2 and Part 3 of this blog described an incremental for improving the performance of an FPGA kernel. Here is a brief recap of the procedure for a sample kernel_name.
dev/<category_name>/v_<kernel_name>
XCL_OPT_INDEX = 1: buffered memory between PS and PL.
XCL_OPT_INDEX = 2: dataflow/pipeline computation in the PL.
XCL_OPT_INDEX = 3: distribute computation across multiple CUs in PS.
XCL_OPT_INDEX = 4: vectorize computation in the PL (XCL_CU = 1).
In this blog, we implemented the v_add kernel in the BinaryOpsKernel category.
Kernel (PL) Deployment
Kernels are deployed in aten/src/ATen/fpga/kernel/<category_name>/<kernel_name> while kernel templates are in aten/src/ATen/fpga/kernel/<category_name>/template directory. This kernel template is invoked in the top-level CMakeLists.txt file.
Host (PS) Deployment
To maximize code re-use we need to replace the run_kernel() function with the fpga_kernel function, which has been generalized to support:
TensorIterator& iter: A collection of tensors with the same shape, dtype and device.
std::vector< VecHost<T, LOG2BW>>& scalar_vec: A collection of vectorized scalars with the same dtype as iter.
Args&& ... scalars: A variable number of arguments with any kind of dtype.
Host PS code is deployed in , the BinaryArithmeticKernels are implemented below
All binary and unary point-wise tensor operations have been implemented from the same template with a few minor exceptions.
Link-Time Computational Graph (Future Work)
Pytorch traces the desired computational graph and uses a Just-In-Time (JIT) compiler to off-load the computation onto the selected processor. In Vitis, this achieved during the v++ linking step, which links precompiled object files (*.xo) into a binary file (torch.xclbin). Unlike the CPU, this involves reprogramming the PL onto FPGA and the following workflow is proposed:
Unlike the CPU which handles arbitrary length tensors, FPGA tensor lengths must be restricted based on design constraints (usually a multiple of 2).
Host (PS) code is always compiled even if the kernel (PL) code is not. OpenCL creates a client-server relationship between PS and PL and gracefully handles situations where the kernel does not exist.
The PyTorch JIT must be reconfigured to export the computational graph to the Vitis v++ linker.
댓글