top of page
Writer's pictureDylan T Bespalko

PyTorch For FPGA Part 1: Heterogeneous Processing

Updated: Feb 24, 2020



Feb 2020: I am currently looking for work in the San Francisco Bay Area in RF & Optical Communications. Contact me on LinkedIn


What is PyTorch?


Pytorch is an open-source machine learning tensor library that allows contributors to dispatch math functions to multiple processors (devices) using multiple memory layouts and multiple data types (dtype) as described in the Pytorch Internals.

I have bolded the devices, layouts, and dtypes that I have worked on. Each device supports has it’s own incremental approach for squeezing more computations out of a processor at the cost of software complexity. Some of the optimizations of I have encountered are summarized below:


 

What is Heterogeneous Processing?


I have bolded the optimization techniques that are best (or first) implemented using a given processor. The CPU has the fastest clock and mature multi-core processing. The GPU has a moderate clock speed and excels at simple parallelization using Single-Instruction, Multiple-Data (SIMD). The FPGA has the slowest clock, but can perform pipeline/dataflow digital signal processing (DSP) optimizations to stream data in and out of memory as the solution becomes available.


When designing an algorithm, you typically prototype the solution on the CPU because it has the most comprehensive math library. When optimizing an algorithm, you are typically limited by two problems:

  1. Computation Unit (CU) bound: Limited by the speed of the processor.

  2. Input/Output (I/O) bound: Limited by the read/write speed between each processor.

Processors that are not built into a System on Chip (SoC) are penalized because it takes too long to copy data between processors. Even though a given math kernel is fastest on the GPU, the overall application may prefer the FPGA due to temporary bottlenecks in the hardware architecture.


 


648 views0 comments

Comments


bottom of page