Abstract
Portable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source of origination as possible, is needed in areas where bandwidth and latency are restricted and network stability, privacy, or security are unreliable or insecure. Examples of such situations are autonomous mobile robotics, such as autonomous tractors, which often have numerous cameras connected to the host, all needing processing in areas where there can be no reliable connection to a cloud-based platform. Additionally, bridge surveying drones, where mapping and path-planning are needed with low latency, can benefit from a lightweight, compact, low-powered device, especially when there are size and energy consumption requirements.
Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit, that operate at low power.
Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation.
The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLAS and SYCL-DNN libraries that include kernels with tuning parameters such as cache size, workgroup size and local memory size based on the device-specific hardware. Once tuned for the Jetson, the SYCL backend showed comparable performance with the native CUDA backend used by ONNX.
Finally, using the ONNX runtime SYCL backend and an Nvidia Jetson Xavier NX edge device, we will discuss ongoing work of aerial classification using image/radar data. Furthermore, we will discuss preliminary lab results to show how our stack affects latency and energy consumption and why it is so important in this use case.
For future work, we hope to enable and tune SYCL-DNN/SYCL-BLAS for other Jetson devices as well as FPGA and RISC-V-based edge devices.
Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit, that operate at low power.
Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation.
The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLAS and SYCL-DNN libraries that include kernels with tuning parameters such as cache size, workgroup size and local memory size based on the device-specific hardware. Once tuned for the Jetson, the SYCL backend showed comparable performance with the native CUDA backend used by ONNX.
Finally, using the ONNX runtime SYCL backend and an Nvidia Jetson Xavier NX edge device, we will discuss ongoing work of aerial classification using image/radar data. Furthermore, we will discuss preliminary lab results to show how our stack affects latency and energy consumption and why it is so important in this use case.
For future work, we hope to enable and tune SYCL-DNN/SYCL-BLAS for other Jetson devices as well as FPGA and RISC-V-based edge devices.
Original language | English |
---|---|
Title of host publication | IWOCL '23 |
Subtitle of host publication | Proceedings of the 2023 International Workshop on OpenCL |
Place of Publication | New York, NY |
Publisher | Association for Computing Machinery |
Number of pages | 1 |
ISBN (Print) | 9798400707452 |
DOIs | |
Publication status | Published - 18 Apr 2023 |
Event | IWOCL '23: International Workshop on OpenCL - Cambridge, United Kingdom Duration: 18 Apr 2023 → 20 Apr 2023 |
Conference
Conference | IWOCL '23 |
---|---|
Country/Territory | United Kingdom |
City | Cambridge |
Period | 18/04/23 → 20/04/23 |