Artificial Intelligence (AI) is a hot topic nowadays due to modern techniques' versatility and generalisation capacities, such as deep learning, transformers, and large language models. Choosing an architecture is a key part of prototyping and deploying a model when performing inference, and depending on the requirements, some have more advantages than others.
This blog post will compare architectures regarding the AI inference process, exhibiting the advantages of one architecture over the other, clearing an overall landscape for further decisions. We invite you to review our other blog post: Hardware Acceleration: CPU, GPU or FPGA?
An Overview of Deep Learning and AI Inference Operations
Deep Learning techniques are widely popular in modern AI because they abstract complex functions and learn hidden patterns (aka features). Within the most popular techniques:
Multi-Layer Perceptron Neural Networks
Recurrent Neural Networks
Convolutional Neural Networks
Transformer-Based Networks
Hybrid Architectures
The aforementioned techniques share the possibility of having multiple layers separated by a non-linearity, often called activation functions, which avoids the mathematical simplification (or collapse) of the operations and enhances the learning process.
Hardware Architectures for Deep Learning and AI Inference Acceleration
Four popular architectures are used for Deep Learning and AI Inference acceleration: CPU, GPU, ASICs and FPGAs. By default, most frameworks can use CPUs to accelerate AI. Still, they are often replaced by other accelerators, given their architecture, which does not support batching samples in a world managed by raw performance. On the other hand, GPUs have emerged as the most popular solution for accelerating AI and Deep Learning. They are well-suited for batched execution given their massive parallelism based on SIMT (Single Instruction Multiple Threads).
Specialised hardware like ASICs and FPGAs are well-suited for AI since they implement the operations needed for the AI tasks, making better use of the hardware available and increasing efficiency in performance and energy. There are some units, such as the Neural Processing Unit (NPU), Tensor Processing Unit (TPU) and Deep Learning Accelerators (DLA), which are specialised but behave similarly to a GPU at the architecture level with less flexibility.
FPGAs are becoming an interesting AI solution that focuses on execution latency. Given their flexibility of re-programmability and capability of connecting to peripherals like cameras and radio transceivers, they can maintain low latency with high inference rates.
Choosing the Proper Hardware
It may depend on the use case and performance goals when choosing hardware to accelerate the AI inference. The following list will characterise each piece of hardware using the following criteria:
Performance Orientation: raw performance, execution latency
Deployment Type: production, prototyping
Applications: low-latency or relaxed
Difficulty: how difficult is it to create a solution for the platform?
Support from AI frameworks.
Depending on the architecture, FPGAs can exhibit multiple characteristics. For this section, we will assume that the model is entirely mapped into hardware.
CPU
The CPUs can support most models available in the state-of-the-art, given its versatility. Here are some characteristics:
Flexible in terms of performance orientation: If execution latency is needed, they are sometimes better than GPU.
Deployment is not their focus.
They are useful for relaxed applications and prototyping.
Support for high-resolution models that use floating-point numbers.
Highly supported by AI frameworks.
CPUs are recommended for prototyping and relaxed applications that do not require high performance.
GPU
GPUs are the most popular devices for AI acceleration, and they support various operators through frameworks such as PyTorch and TensorFlow. Here are some characteristics:
Oriented to raw performance: the most optimal way to execute a model is by using sample batches.
They can be used for prototyping and production applications: NVIDIA Jetson is a clear example of a production-ready product.
The best application use cases are focused on batched execution and relaxed latency.
Suitable for many applications except for low latency.
Support for high-resolution models that use floating-point numbers. Some GPUs now integrate quantised types like int4, float8, bfloat16 and float16.
Highly supported by AI frameworks.
Low time-to-market.
GPUs are recommended for most use cases where low latency or hard real-time response is a constraint. These use cases involve audio processing, computer vision, large language models, and others.
FPGA
FPGAs are rarely taken as an option for AI inference unless the environment requires hard real-time and ultra-low latency. One of the key advantages is the possibility of integrating the AI implementation and the signal pre-processing. For instance, anomaly detection can be performed by RF transceivers, Smart NIC filtering, and basic computer vision. Here are some characteristics:
They are more oriented in ultra-low latency and energy efficiency.
Intended for production.
The application often focuses on hard real-time, determinism and low-latency, such as RF, computer vision, and others.
Focused on quantised and optimised networks.
Difficult to implement large models.
Limited support from AI frameworks.
Higher time-to-market due to difficulty.
FPGAs are recommended when determinism and low-latency are required. Moreover, it is recommended that the AI inference must be next to the pre-processing stages. Some examples:
Video systems with ISP included in the FPGA.
RF systems with packet filtering.
Smart NICs.
Other applications, like algorithm acceleration and efficient computing, will be studied in the following posts.
Conclusions
In this blog post, we have covered some details of choosing an accelerator, including CPU, GPU, and FPGA. We encourage you to inspect your requirements for a better choice. For a complete version of this post, visit our developer wiki.
More posts like this are coming. Stay tuned!
RidgeRun Is Here To Help You Reach The Next Level
RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:
Algorithm Acceleration using FPGAs.
Image Signal Processing IP Cores.
Linux Device Drivers.
Low Power AI Acceleration using FPGAs.
Accelerated C++ Applications using AMD XRT.
And it includes much more. Contact us at https://www.ridgerun.com/contact.