top of page
Writer's pictureDylan T Bespalko

PyTorch For FPGA Part 4: Deploying PyTorch Kernels

Updated: Feb 24, 2020



Feb 2020: I am currently looking for work in the San Francisco Bay Area in RF & Optical Communications. Contact me on LinkedIn


Part 2 and Part 3 of this blog described an incremental for improving the performance of an FPGA kernel. Here is a brief recap of the procedure for a sample kernel_name.

dev/<category_name>/v_<kernel_name>

  • XCL_OPT_INDEX = 1: buffered memory between PS and PL.

  • XCL_OPT_INDEX = 2: dataflow/pipeline computation in the PL.

  • XCL_OPT_INDEX = 3: distribute computation across multiple CUs in PS.

  • XCL_OPT_INDEX = 4: vectorize computation in the PL (XCL_CU = 1).

In this blog, we implemented the v_add kernel in the BinaryOpsKernel category.


 


Kernel (PL) Deployment


Kernels are deployed in aten/src/ATen/fpga/kernel/<category_name>/<kernel_name> while kernel templates are in aten/src/ATen/fpga/kernel/<category_name>/template directory. This kernel template is invoked in the top-level CMakeLists.txt file.


 

Host (PS) Deployment


To maximize code re-use we need to replace the run_kernel() function with the fpga_kernel function, which has been generalized to support:

  1. TensorIterator& iter: A collection of tensors with the same shape, dtype and device.

  2. std::vector< VecHost<T, LOG2BW>>& scalar_vec: A collection of vectorized scalars with the same dtype as iter.

  3. Args&& ... scalars: A variable number of arguments with any kind of dtype.

Host PS code is deployed in , the BinaryArithmeticKernels are implemented below

All binary and unary point-wise tensor operations have been implemented from the same template with a few minor exceptions.


 

Link-Time Computational Graph (Future Work)


Pytorch traces the desired computational graph and uses a Just-In-Time (JIT) compiler to off-load the computation onto the selected processor. In Vitis, this achieved during the v++ linking step, which links precompiled object files (*.xo) into a binary file (torch.xclbin). Unlike the CPU, this involves reprogramming the PL onto FPGA and the following workflow is proposed:

  • Unlike the CPU which handles arbitrary length tensors, FPGA tensor lengths must be restricted based on design constraints (usually a multiple of 2).

  • Host (PS) code is always compiled even if the kernel (PL) code is not. OpenCL creates a client-server relationship between PS and PL and gracefully handles situations where the kernel does not exist.

  • The PyTorch JIT must be reconfigured to export the computational graph to the Vitis v++ linker.


395 views0 comments

댓글


bottom of page