GPU performance or ease of use: why not both?

Track:: Jupyter and Scientific Python
Type:: Talk
Level:: intermediate
Room:: Forum Hall
Start:: 12:25 on 18 July 2025
End:: 12:55 on 18 July 2025
Duration:: 30 minutes

Abstract

There are many ways to fully unleash the power of GPU: using math libraries specialized for GPU scientific computing, device kernel fusion, JIT compilation, or LTO. However, until recently, only C++ developers could fully benefit from those techniques. While Python libraries like CuPy have used at least some of these features under the hood, the knobs and nuances crucial for performance tuning were not exposed to the Python community. It requires advanced users to resort to custom integration with the C and C++ stack, whenever they need fine-grain control.

The nvmath-python open-source library released last year aims to bridge this gap for core math operations. Instead of finding a trade-off between ease of use and performance, its design aims to provide both. While respecting Python principles, it offers the same flexibility and control the C++ developers have, including integration with Numba and NVIDIA runtime compiler stack: nvrtc, and nvjitlink. To seamlessly integrate into the Python ecosystem, it handles all the necessary dependencies while effortlessly interoperating with popular tensor libraries.

The library targets Python developers, scientific computing community, AI-oriented practitioners, and anyone striving for peak GPU kernel performance. nvmath-python, supporting both CPU and GPU execution space, can be easily integrated into existing workflows written in NumPy, CuPy, SciPy, scikit-learn and others. It supports GEMMs and FFTs with prolog/epilog fusion and narrow precision types, random number generation, and all other capabilities of CUDA math libraries exposed to the user.

In this talk, we would like to share the design principles behind nvmath-python and demonstrate how it can be used to accelerate scientific computing workloads in Python. We will put special emphasis on the just-released sparsity support and distributed computing capabilities.