CLBlast is a modern, lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators. CLBlast implements BLAS routines: basic linear algebra subprograms operating on vectors and matrices.
The library is not tuned for all possible OpenCL devices: if out-of-the-box performance is poor, please run the tuners first. See the README on GitHub for a list of already tuned devices and instructions on how to tune yourself and contribute to future releases of the CLBlast library.
View on CLBlast on GitHub.
CLBlast: The tuned OpenCL BLAS library
Why CLBlast and not clBLAS or cuBLAS?
Use CLBlast instead of clBLAS:
Use CLBlast instead of cuBLAS:
When not to use CLBlast:
- When you care about achieving maximum performance.
- When you want to be able to inspect the BLAS kernels or easily customize them to your needs.
- When you run on exotic OpenCL devices for which you need to tune yourself.
- When you are still running on OpenCL 1.1 hardware.
- When you prefer a C++ API over a C API (C API also available in CLBlast).
- When you value an organized and modern C++ codebase.
- When you target Intel CPUs and GPUs or embedded devices.
- When you can benefit from the increased performance of half-precision fp16 data-types.
Use CLBlast instead of cuBLAS:
- When you want your code to run on devices other than NVIDIA CUDA-enabled GPUs.
- When you want to tune for a specific configuration (e.g. rectangular matrix-sizes).
- When you sleep better if you know that the library you use is open-source.
- When you are using OpenCL rather than CUDA.
When not to use CLBlast:
- When you run on NVIDIA's CUDA-enabled GPUs only and can benefit from cuBLAS's assembly-level tuned kernels.
Benchmark results
Several benchmarks have been performed using CLBlast's clients and benchmarking script. Below are resuls for various devices: