In previous blog posts, we have addressed the NVIDIA Holoscan Sensor Bridge from a brief introduction to measuring the glass-to-glass latency of an example implementation of video capture with live video display. This time, we will propose an optimisation to shorten the video pipeline by replacing the image signal processing stages with our homemade CUDA ISP solution.
If you want to have a previous overview of the platform and how to get started, please visit our other blog posts:
What is CUDA ISP?
CUDA ISP is a library for Image Signal Processing that allows GPU-accelerated debayering, binary shifting, and white balancing, including CUDA buffer allocators. This makes integrating multimedia applications and computer vision algorithms easy. The library provides an intuitive C++ API that can be seamlessly integrated into existing workflows and is GStreamer-friendly, making it ideal for streaming applications, computer vision, and image analysis.
For more information, visit our other blog post: Introducing the RidgeRun CUDA ISP Library: Accelerate your Image Processing and our developer wiki.
What's the Baseline Performance of Holoscan Sensor Bridge?
The baseline performance of the Holoscan Sensor Bridge is taken from the Linux IMX274 example, provided by the GitHub repository. This executes the following pipeline:
In this pipeline, the receiver operator is purely done in the Holoscan Sensor Bridge, where the data is transmitted to the Jetson through UDP Linux Sockets. Then, the Image Processor is CUDA-accelerated through custom CUDA kernels, performing white and black balancing, mainly. The Bayer Demosaic is a Holoscan operator already present in the Holoscan Framework, based on the NPP library. The Gamma Correction is also a custom CUDA kernel.
For this pipeline, the baseline performance results are:
Sensor | Image Dimensions (px) | CPU Usage (%) | RAM Usage (MiB) | GPU Usage (%) | Glass-to-Glass Latency (ms) |
IMX274 | 3840x2160 (4K) | 2.56 (total) 30.6 (core) | 347 | 14 | 96 |
This involves configuring the Jetson into a maximum performance mode, leading to 96 ms of glass-to-glass latency when capturing at 4K at 60 fps.
You can read more details in our other blog post Glass to Glass Assessment of the Holoscan Sensor Bridge on an NVIDIA AGX Orin.
Optimizing the Pipeline with CUDA ISP
CUDA ISP integrates an outstanding algorithm for colour correction and auto-white balancing with RGB space. It adjusts the histograms of each colour channel within a confidence interval, leading to a more complete colour balancing.
Recalling the baseline pipeline, our optimization implies dropping the ISP Processor block and replacing the Gamma Correction block with the CUDA ISP block. Internally, the CUDA ISP block performs the following operations:
Downsample the image from RGBA64 to RGBA32
Auto-White Balancing
Upsample the image from RGBA32 to RGBA64
The operations 1. and 3. are needed given the previous and following processing blocks. The Holoviz requires RGBA64. Moreover, the Bayer Demosaic is configured to provide RGBA64 rather than RGBA32. For simplicity, we propose removing the ISP Processor, replacing the Gamma Correction, and performing the sampling. We are going to cover the RGBA64 compatibility with CUDA ISP in future blog posts.
The optimized pipeline is shown above, highlighting the removal and replacement of the ISP Processor block. Each of these blocks is executed in parallel for pipeline-like acceleration. Removing one of the blocks will shorten the frame processing time (latency), and further optimizing any of these blocks will also decrease the latency.
For more information about CUDA ISP, visit our developer wiki.
Final Results
For this experiment, we use the Holoscan Sensor Bridge and the NVIDIA Orin with the IMX274 Sensor. For the entire setup, you can visit our previous blog: Glass to Glass Assessment of the Holoscan Sensor Bridge on an NVIDIA AGX Orin.
Pipeline | Image Dimensions (px) | CPU Usage (%) | RAM Usage (MiB) | GPU Usage (%) | Glass-to-Glass Latency (ms) |
Baseline | 3840x2160 (4K) 60 fps | 2.56 (total) 30.6 (core) | 347 | 14 | 96 |
CUDA ISP RGBA32 | 3840x2160 (4K) 60 fps | 2.96 (total) 35.6 (core) | 401 | 35 | 58 |
This involves configuring the Jetson into a maximum performance mode. The latency lowered from 96 to 58 ms by optimizing the pipeline, leading to a 39.5% reduction.
Further Improvement
The downsampling and upsampling for the RGBA64-RGBA32 conversion happened because of a compatibility issue with CUDA ISP. Nevertheless, the image quality might be sacrificed. The next step is to add RGBA64 support to CUDA ISP, which will come in the near future.
On the other hand, more improvement can be applied by offloading the image signal processing to the FPGA, reducing the pressure on the Jetson system. The FPGA can potentially reduce the latency given the dataflow execution pattern offered by FPGA Hardware Acceleration. RidgeRun is exploring new ways to reduce the latency to 50 ms or less for critical applications, optimizing ISP algorithms and using FPGAs.
Read more about our FPGA exploration in Selecting the Proper Device for AI and Deep Learning Inference Acceleration and Hardware Acceleration: CPU, GPU or FPGA?
Important Remarks
For stereo capture, the processing platform (i.e. the NVIDIA Jetson) must have two separate network interfaces, given that each camera stream is delivered in separate ports.
The NVIDIA Jetson AGX Orin developer kit does not possess a DPDK-compatible card, falling back into the Linux socket system. This increases the glass-to-glass latency. For DPDK, it is necessary to connect a DPDK-compatible card or use a custom carrier board with a compatible NIC.
Expect more Information From Us
If you want to know more about how to leverage this technology in your project: Contact Us.