Feb 2020: I am currently looking for work in the San Francisco Bay Area in RF & Optical Communications. Contact me on LinkedIn
4 a). Muti-Core For CU Bound Ops (XCL_OPT_INDEX = 3 # distributed)
The FPGA can also use multiple compute units (CUs) to achieve the equivalent of multi-processing. Unlike the CPU which has physical number of cores, the number of FPGA CUs is determined at compile time. The host code has been modified to dispatch the add kernel to multiple CUs. For pointwise functions such as add, multi-core parallelization is achieved by splitting the tensors sent to each CU using a different memory offset and stride.
Now you can build this code and test it as follows:
sw_emu:
hw_emu:
Although this multi-core processing uses 4x CU, we only get 2x performance because the add operation is so simple that we become I/O bound as shown in the timeline above. I should also note that this optimization multiplies the layout area and the number of memory ports.
4 b). SIMD For I/O Bound Ops (XCL_OPT_INDEX = 4 # vec)
If the computation is I/O bound we can take advantage of wide bit-width memory paralization to pass more data to the FPGA at once. Using Single-Instruction, Multiple-Data (SIMD), we can stream a wide-bit vector data-type called Vec<T, LOG2BW>, where
typename T: represents the data type.
unsigned int LOG2BW: is log2(Bit Width/sizeof(T)).
Eg. To create a 512 bit-width float vector we use Vec<float, 4>. Now lets modify the kernel code to support Vec<T, LOG2BW> data containers.
Vec<T, LOG2BW> is a generalization of at::vec::Vec256<T> in PyTorch which is hardcoded for 256 bit-width vectors. It is also very similar to WideType used in the Vitis Libraries. Now you can build this code and test it as follows:
sw_emu:
hw_emu:
These results are pretty strange. While I might expect a trade-off between I/O bound and CU-bound performance, changing the bit-width of the SIMD vector doesn’t produce a predictable result in the hardware emulation in Vitis 2019.2.
I have also noticed that Muli-Core (XCL_CU > 1) and SIMD (Vec<T, LOG2BW>) are mutually exclusive optimizations that are incompatible with each other. I will open an issue to track this.
ความคิดเห็น