Implementing Deep Learning into FPGAs: A Gentle Introduction to Architectures

Accelerating Deep Learning with FPGAs in a nutshell

Some frameworks allow the acceleration of inference from a model or a graph. In particular, it is possible to accelerate small networks on FPGAs using network mapping and co-processing techniques. There are two popular techniques used to implement networks on FPGAs: 1) network mapping and 2) co-processing.

Network Mapping Architecture

The network mapping maps an entire network into a hardware implementation based on the layers and/or the arithmetic operations performed by the model.

In the picture below, we illustrate the process of how the hardware is implemented on the FPGA in such a way that the network behaves like a circuit rather than a group of macro-operations. In this case, we illustrate a multi-layer perceptron network, where the perceptron can be observed as a column of multipliers followed by a tree of adders until reaching an activation function f(x) .

The network mapping technique is powerful regarding latency without having batches of inputs. Moreover, the implementation footprint is significant, and this technique becomes challenging when large and complex models with tens of layers are used, depending on the target FPGA and its available resources.

The most common frameworks that use this technique are:

HLS4ML: An Open-Source Framework from CERN for models based on TensorFlow
FINN-R: An Open-Source Framework from AMD for models based on PyTorch

We cover more about the differences on our developer wiki page.

Co-Processing Architecture

Co-processing is a lightweight technique that accelerates models by focusing on the operations rather than mapping the network on the FPGA. For instance, it is possible to accelerate operations such as matrix-matrix multiplications, matrix-vector multiplications, convolutions, and matrix binary/unary element-wise operations. Other approaches focus on developing execution units to accelerate operations using SIMD or vector instructions.

We can classify the co-processing hardware into the following categories:

Level 1

Accelerates the operations through vector instructions. The resulting co-processor is similar to a vector unit like AVX, AMX or a DSP, with fused multiply-add (FMA) or multiply-accumulate (MAC) operations on vectors.

Level 2

Accelerates the matrix operations such as the matrix-matrix multiplications and matrix element-wise operations. For instance, with these units, it is possible to accelerate:

Matrix Multiplication: dense, convolution, attention layers
Matrix Element-Wise Add/Multiplication: batch normalisation, concatenation, addition
Matrix Element-Wise Unary/Mapping: activations, scaling

It also includes per-layer acceleration, where each layer is allocated and executed by an accelerator specialised in one of the aforementioned operations.

Level 3

They are more advanced and complex than the previous architectures. They usually have scheduling operation accelerators focused on deep learning tasks, such as matrix multiplication, convolution, activations, and attention. There are commercial alternatives, such as the MathWorks Deep Learning HDL Toolbox, the AMD DPU, and AMD AIE.

Usually, they show a lower performance in terms of latency but they are more flexible and can run larger models and expense of hardware reusability.

The following picture illustrates the differences at the architectural level.

On the left, we have the Level 1 acceleration, found on top of vector units such as the AVX512-VNNI, DSP-like processing and other extensions. Level 2 involves matrix operations (centre). Level 3 is a more complex structure with one or several cores specialised in deep learning operations, like Matrix Multiplication, Convolution, Matrix Operations and Activations.

Conclusions on Deep Learning on FPGAs

In this blog post, we have covered the common architectures found in Deep Learning solutions with FPGAs. Network Mapping performs the most in exchange for resource utilisation, whereas Co-Processing tends to be lightweight, exchanging performance for a lighter resource footprint. We encourage you to inspect your requirements for a better choice. For a complete version of this post, visit our developer wiki.

More posts like this are coming. Stay tuned!