top of page
Writer's pictureLuis G. Leon-Vega

Improving Latency on the Holoscan Sensor Bridge with CUDA ISP

Image with CUDA ISP on the Jetson connected to a Holoscan Sensor Bridge
Lower the glass-to-glass latency by using CUDA ISP on your Holoscan Sensor Bridge video pipeline

In previous blog posts, we have addressed the NVIDIA Holoscan Sensor Bridge from a brief introduction to measuring the glass-to-glass latency of an example implementation of video capture with live video display. This time, we will propose an optimisation to shorten the video pipeline by replacing the image signal processing stages with our homemade CUDA ISP solution.


If you want to have a previous overview of the platform and how to get started, please visit our other blog posts:



What is CUDA ISP?



RidgeRun CUDA Image Signal Processing Library Logo
RidgeRun CUDA Image Signal Processing Library

CUDA ISP is a library for Image Signal Processing that allows GPU-accelerated debayering, binary shifting, and white balancing, including CUDA buffer allocators. This makes integrating multimedia applications and computer vision algorithms easy. The library provides an intuitive C++ API that can be seamlessly integrated into existing workflows and is GStreamer-friendly, making it ideal for streaming applications, computer vision, and image analysis.



What's the Baseline Performance of Holoscan Sensor Bridge?


The baseline performance of the Holoscan Sensor Bridge is taken from the Linux IMX274 example, provided by the GitHub repository. This executes the following pipeline:


Baseline Image Signal Processing Pipeline
Baseline Image Signal Processing Pipeline

In this pipeline, the receiver operator is purely done in the Holoscan Sensor Bridge, where the data is transmitted to the Jetson through UDP Linux Sockets. Then, the Image Processor is CUDA-accelerated through custom CUDA kernels, performing white and black balancing, mainly. The Bayer Demosaic is a Holoscan operator already present in the Holoscan Framework, based on the NPP library. The Gamma Correction is also a custom CUDA kernel.


For this pipeline, the baseline performance results are:


Sensor

Image Dimensions (px)

CPU Usage (%)

RAM Usage (MiB)

GPU Usage (%)

Glass-to-Glass Latency (ms)

IMX274

3840x2160 (4K)

2.56 (total) 30.6 (core)

347

14

96

This involves configuring the Jetson into a maximum performance mode, leading to 96 ms of glass-to-glass latency when capturing at 4K at 60 fps.



Optimizing the Pipeline with CUDA ISP


CUDA ISP integrates an outstanding algorithm for colour correction and auto-white balancing with RGB space. It adjusts the histograms of each colour channel within a confidence interval, leading to a more complete colour balancing.


Recalling the baseline pipeline, our optimization implies dropping the ISP Processor block and replacing the Gamma Correction block with the CUDA ISP block. Internally, the CUDA ISP block performs the following operations:


  1. Downsample the image from RGBA64 to RGBA32

  2. Auto-White Balancing

  3. Upsample the image from RGBA32 to RGBA64


The operations 1. and 3. are needed given the previous and following processing blocks. The Holoviz requires RGBA64. Moreover, the Bayer Demosaic is configured to provide RGBA64 rather than RGBA32. For simplicity, we propose removing the ISP Processor, replacing the Gamma Correction, and performing the sampling. We are going to cover the RGBA64 compatibility with CUDA ISP in future blog posts.


CUDA ISP Based Image Signal Processing Pipeline
CUDA ISP Based Image Signal Processing Pipeline

The optimized pipeline is shown above, highlighting the removal and replacement of the ISP Processor block. Each of these blocks is executed in parallel for pipeline-like acceleration. Removing one of the blocks will shorten the frame processing time (latency), and further optimizing any of these blocks will also decrease the latency.


For more information about CUDA ISP, visit our developer wiki.


Final Results


For this experiment, we use the Holoscan Sensor Bridge and the NVIDIA Orin with the IMX274 Sensor. For the entire setup, you can visit our previous blog: Glass to Glass Assessment of the Holoscan Sensor Bridge on an NVIDIA AGX Orin.


Pipeline

Image Dimensions (px)

CPU Usage (%)

RAM Usage (MiB)

GPU Usage (%)

Glass-to-Glass Latency (ms)

Baseline

3840x2160 (4K) 60 fps

2.56 (total) 30.6 (core)

347

14

96

CUDA ISP RGBA32

3840x2160 (4K) 60 fps

2.96 (total) 35.6 (core)

401

35

58

This involves configuring the Jetson into a maximum performance mode. The latency lowered from 96 to 58 ms by optimizing the pipeline, leading to a 39.5% reduction.



Glass-to-Glass Capture Using The Chronometer
Glass-to-Glass Capture Using The Chronometer


Further Improvement


The downsampling and upsampling for the RGBA64-RGBA32 conversion happened because of a compatibility issue with CUDA ISP. Nevertheless, the image quality might be sacrificed. The next step is to add RGBA64 support to CUDA ISP, which will come in the near future.


On the other hand, more improvement can be applied by offloading the image signal processing to the FPGA, reducing the pressure on the Jetson system. The FPGA can potentially reduce the latency given the dataflow execution pattern offered by FPGA Hardware Acceleration. RidgeRun is exploring new ways to reduce the latency to 50 ms or less for critical applications, optimizing ISP algorithms and using FPGAs.




Important Remarks


  • For stereo capture, the processing platform (i.e. the NVIDIA Jetson) must have two separate network interfaces, given that each camera stream is delivered in separate ports.

  • The NVIDIA Jetson AGX Orin developer kit does not possess a DPDK-compatible card, falling back into the Linux socket system. This increases the glass-to-glass latency. For DPDK, it is necessary to connect a DPDK-compatible card or use a custom carrier board with a compatible NIC.


Expect more Information From Us


If you want to know more about how to leverage this technology in your project: Contact Us.



bottom of page