TY - JOUR
T1 - Coarse Grained FPGA Overlay for Rapid Just-In-Time Accelerator Compilation
AU - Jain, Abhishek Kumar
AU - Maskell, Douglas L
AU - Fahmy, Suhaib Ahmed
N1 - KAUST Repository Item: Exported on 2021-10-05
PY - 2021
Y1 - 2021
N2 - Coarse-grained FPGA overlays built around the runtime programmable DSP blocks in modern FPGAs can achieve high throughput and improved scalability compared to traditional overlays built without detailed consideration of FPGA architecture. These overlays can be mapped to using higher level compilers, achieving fast compilation, software-like programmability and run-time management, and high-level design abstraction. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, prohibitive hardware compilation times in traditional design flows mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling on FPGAs. We present an architecture-optimised FPGA overlay that exploits the capabilities of DSP blocks to maximise throughput and an associated design methodology for runtime compilation of dataflow graphs expressed as OpenCL kernels onto the overlays. The methodology benefits from the high level of abstraction afforded by using the OpenCL programming model, while the mapping to the overlay significantly reduces compilation and load times. Key characteristics of this work include highly performant DSP-optimized functional units that scale to large overlays on modern devices and the ability to perform automatic resource-aware kernel replication up to the size of the overlay for performance scaling. We demonstrate place and route times orders of magnitude better than traditional HLS flows, even when running on an embedded processor in the Xilinx Zynq.
AB - Coarse-grained FPGA overlays built around the runtime programmable DSP blocks in modern FPGAs can achieve high throughput and improved scalability compared to traditional overlays built without detailed consideration of FPGA architecture. These overlays can be mapped to using higher level compilers, achieving fast compilation, software-like programmability and run-time management, and high-level design abstraction. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, prohibitive hardware compilation times in traditional design flows mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling on FPGAs. We present an architecture-optimised FPGA overlay that exploits the capabilities of DSP blocks to maximise throughput and an associated design methodology for runtime compilation of dataflow graphs expressed as OpenCL kernels onto the overlays. The methodology benefits from the high level of abstraction afforded by using the OpenCL programming model, while the mapping to the overlay significantly reduces compilation and load times. Key characteristics of this work include highly performant DSP-optimized functional units that scale to large overlays on modern devices and the ability to perform automatic resource-aware kernel replication up to the size of the overlay for performance scaling. We demonstrate place and route times orders of magnitude better than traditional HLS flows, even when running on an embedded processor in the Xilinx Zynq.
UR - http://hdl.handle.net/10754/672090
UR - https://ieeexplore.ieee.org/document/9555373/
U2 - 10.1109/TPDS.2021.3116859
DO - 10.1109/TPDS.2021.3116859
M3 - Article
SN - 2161-9883
SP - 1
EP - 1
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
ER -