Kernel Tuner
Kernel Tuner greatly simplifies the development of highly-optimized and auto-tuned CUDA, OpenCL, and C code, supporting many advanced use-cases and optimization strategies that speed up the auto-tuning process.
C++ Header-only library for CUDA that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.
CUDA natively offers several reduced precision floating-point types (__half
, __nv_bfloat16
, __nv_fp8_e4m3
, __nv_fp8_e5m2
) and vector types (e.g., __half2
, __nv_fp8x4_e4m3
, float3
). However, working with these types is cumbersome: mathematical operations require intrinsics (e.g., __hadd2
performs addition for __half2
), type conversion is awkward (e.g., __nv_cvt_halfraw2_to_fp8x2
converts float16 to float8),
and some functionality is missing (e.g., one cannot convert a __half
to __nv_bfloat16
).
Kernel Float resolves this by offering a single data type kernel_float::vec<T, N>
that stores N
elements of type T
. Internally, the data is stored as a fixed-sized array of elements. Operator overloading (like +
, *
, &&
) has been implemented such that the most optimal intrinsic for the available types is selected automatically. Many mathetical functions (like log
, exp
, sin
) and common operations (such as sum
, range
, for_each
) are also available.
By using this library, developers can avoid the complexity of working with reduced precision floating-point types in CUDA and focus on their applications.
A Computational Answer to the Soaring MRI demand
Centre of Excellence in Simulation of Weather and Climate in Europe
Self-learning machines hunt for explosions in the universe and speed up innovations in industry and...
Kernel Tuner greatly simplifies the development of highly-optimized and auto-tuned CUDA, OpenCL, and C code, supporting many advanced use-cases and optimization strategies that speed up the auto-tuning process.