PyTorch For FPGA Part 4: Deploying PyTorch Kernels

Dylan T Bespalko
Feb 24, 2020
2 min read

Updated: Feb 24, 2020

Feb 2020: I am currently looking for work in the San Francisco Bay Area in RF & Optical Communications. Contact me on LinkedIn

Part 2 and Part 3 of this blog described an incremental for improving the performance of an FPGA kernel. Here is a brief recap of the procedure for a sample kernel_name.

dev/<category_name>/v_<kernel_name>

XCL_OPT_INDEX = 1: buffered memory between PS and PL.
XCL_OPT_INDEX = 2: dataflow/pipeline computation in the PL.
XCL_OPT_INDEX = 3: distribute computation across multiple CUs in PS.
XCL_OPT_INDEX = 4: vectorize computation in the PL (XCL_CU = 1).

In this blog, we implemented the v_add kernel in the BinaryOpsKernel category.

Kernel (PL) Deployment

Kernels are deployed in aten/src/ATen/fpga/kernel/<category_name>/<kernel_name> while kernel templates are in aten/src/ATen/fpga/kernel/<category_name>/template directory. This kernel template is invoked in the top-level CMakeLists.txt file.

CMakeLists.txt

Host (PS) Deployment

To maximize code re-use we need to replace the run_kernel() function with the fpga_kernel function, which has been generalized to support:

TensorIterator& iter: A collection of tensors with the same shape, dtype and device.
std::vector< VecHost<T, LOG2BW>>& scalar_vec: A collection of vectorized scalars with the same dtype as iter.
Args&& ... scalars: A variable number of arguments with any kind of dtype.

aten/src/ATen/native/fpga/Loops.h

Host PS code is deployed in , the BinaryArithmeticKernels are implemented below

aten/src/ATen/native/fpga/BinaryArithmeticKernel.cpp

All binary and unary point-wise tensor operations have been implemented from the same template with a few minor exceptions.

Link-Time Computational Graph (Future Work)

Pytorch traces the desired computational graph and uses a Just-In-Time (JIT) compiler to off-load the computation onto the selected processor. In Vitis, this achieved during the v++ linking step, which links precompiled object files (*.xo) into a binary file (torch.xclbin). Unlike the CPU, this involves reprogramming the PL onto FPGA and the following workflow is proposed:

Unlike the CPU which handles arbitrary length tensors, FPGA tensor lengths must be restricted based on design constraints (usually a multiple of 2).
Host (PS) code is always compiled even if the kernel (PL) code is not. OpenCL creates a client-server relationship between PS and PL and gracefully handles situations where the kernel does not exist.
The PyTorch JIT must be reconfigured to export the computational graph to the Vitis v++ linker.

Continued: PyTorch For FPGA Part 5: Complex and Quaternion Numbers

PyTorch For FPGA Part 4: Deploying PyTorch Kernels

Feb 2020: I am currently looking for work in the San Francisco Bay Area in RF & Optical Communications. Contact me on LinkedIn

Kernel (PL) Deployment

Host (PS) Deployment

Link-Time Computational Graph (Future Work)

Recent Posts

Comments