Tutorial: OpenCL SGEMM tuning for Kepler

Note: the complete source-code is available at GitHub.
Note: a tuned OpenCL BLAS library based on this tutorial is now available at GitHub.

Kernel 5: Transposed input matrix and rectangular tiles

Our first tiled version showed that a large tile size can greatly reduce off-chip memory accesses and can thus improve performance. However, if we want to go to even larger tiles (say 64 by 64) we'll run out of resources. The Tesla K40 GPU which we use has 48KB of local memory per SM, on which multiple work-groups can run. For a 32 by 32 tile, this means we consume 2*32*32*4 = 8KB per work-group, so there is some headroom left.

To get more flexibility and to be able to fine-tune the parameters, we'll want to go for rectangular tiles. Since both matrix A and B share the dimension K, we can actually create rectangular tiles, as is shown in the image below. In the example picture, we'll have to iterate over 3 tiles of 3x2 each, while we are still computing a 3x3 tile in the end.

Rectangular tiles

To get this idea implemented, we'll want to transpose one of the input matrices before starting the matrix-multiplication. This will save us a lot of trouble computing indices, as the K-sized dimension (which A and B share) will be the same dimension. We choose to transpose the B matrix.

The transpose kernel itself is actually quite trivial and is negligible in terms of computational cost compared to the main matrix-multiplication kernel. Here is the transpose kernel which we used:
// Simple transpose kernel for a P * Q matrix
__kernel void transpose(const int P, const int Q,
                        const __global float* input,
                        __global float* output) {
    // Thread identifiers
    const int tx = get_local_id(0);
    const int ty = get_local_id(1);
    const int ID0 = get_group_id(0)*TRANSPOSEX + tx; // 0..P
    const int ID1 = get_group_id(1)*TRANSPOSEY + ty; // 0..Q

    // Set-up the local memory for shuffling
    __local float buffer[TRANSPOSEX][TRANSPOSEY];

    // Swap the x and y coordinates to perform the rotation (coalesced)
    if (ID0 < P && ID1 < Q) {
        buffer[ty][tx] = input[ID1*P + ID0];

    // Synchronise all threads

    // We don't have to swap the x and y thread indices here,
    // because that's already done in the local memory
    const int newID0 = get_group_id(1)*TRANSPOSEY + tx;
    const int newID1 = get_group_id(0)*TRANSPOSEX + ty;

    // Store the transposed result (coalesced)
    if (newID0 < Q && newID1 < P) {
        output[newID1*Q + newID0] = buffer[tx][ty];
Now, let's see what the new matrix-multiplication kernel looks like. We'll take our earlier kernel with an increased work per thread count (WPT) as a starting point. First, let us define some new constants:
#define TSM 64                 // The tile-size in dimension M
#define TSN 64                 // The tile-size in dimension N
#define TSK 32                 // The tile-size in dimension K
#define WPTN 8                 // The work-per-thread in dimension N
#define RTSN (TSN/WPTN)        // The reduced tile-size in dimension N
#define LPT ((TSK*TSM)/(RTSM*RTSN)) // The loads-per-thread for a tile
Note that, although we do define separate tile-sizes in the M and N dimensions, we assume for now they are equal. As before, we use the WPT setting to reduce the amount of threads in the second thread dimension. Our launch-parameters are:
const size_t local[2] = { TSM, TSN/WPTN }; // Or { TSM, RTSN };
const size_t global[2] = { M, N/WPTN };
Our kernel looks a lot like our 3rd version. The most notable differences are:
// Pre-transpose the input matrix B and use rectangular tiles
__kernel void myGEMM5(const int M, const int N, const int K,
                      const __global float* A,
                      const __global float* B,
                      __global float* C) {
    // Thread identifiers
    const int row = get_local_id(0); // Local row ID (max: TSM)
    const int col = get_local_id(1); // Local col ID (max: TSN/WPTN)
    const int globalRow = TSM*get_group_id(0) + row; // 0..M
    const int globalCol = TSN*get_group_id(1) + col; // 0..N

    // Local memory to fit a tile of A and B
    __local float Asub[TSK][TSM];
    __local float Bsub[TSN][TSK];

    // Initialise the accumulation registers
    float acc[WPTN];
    for (int w=0; w<WPTN; w++) {
        acc[w] = 0.0f;
    // Loop over all tiles
    int numTiles = K/TSK;
    for (int t=0; t<numTiles; t++) {

        // Load one tile of A and B into local memory
        for (int l=0; l<LPT; l++) {
            int tiledIndex = TSK*t + col + l*RTSN;
            int indexA = tiledIndex*M + TSM*get_group_id(0) + row;
            int indexB = tiledIndex*N + TSN*get_group_id(1) + row;
            Asub[col + l*RTSN][row] = A[indexA];
            Bsub[row][col + l*RTSN] = B[indexB];
        // Synchronise to make sure the tile is loaded

        // Perform the computation for a single tile
        for (int k=0; k<TSK; k++) {
            for (int w=0; w<WPTN; w++) {
                acc[w] += Asub[k][row] * Bsub[col + w*RTSN][k];

        // Synchronise before loading the next tile

    // Store the final results in C
    for (int w=0; w<WPTN; w++) {
        C[(globalCol + w*RTSN)*M + globalRow] = acc[w];
Using the same settings as before (squared tiles of 32 by 32 and a WPT of 8) we achieve only 460 GFLOPS, less than the 689 of the earlier kernel. The decreased performance is caused by bank-conflicts in the local memory in the following line:
Bsub[row][col + w*RTSN] = B[indexB];
The bank-conflicts arise because threads within a warp (same 'col' value, but different 'row' value) access the same (or many of the same) local memory banks. The image below on the left visualises this: threads in a warp (4 for this toy-example) access data in a column-wise fashion. For example, the first warp will access all elements marked with an 'x'. On our Kepler GPU, local memory is arranged in 32 banks. If TSN is equal to 32, then every column in the picture represents a single bank (let's not consider the different addressing modes for simplicity). To repair this we have to make sure that threads in a warp do not access the same column. A well-known technique is to use padding:
__local float Bsub[TSN][TSK+2];
Padding by 2 will result in the picture on the right, in which we can see that bank conflicts no longer occur (or at least a lot less than before). Note that we pad the memory by 2 rather than 1 to align data to 64-bit (two floats) so that we can still benefit from 64-bit loads from local memory. You probably noted that the padding does consume extra local memory, so this technique might deteriorate performance if occupancy is reduced and latency can no longer be hidden.

Local memory padding

Now that we have all of this working, we have the freedom to select 64 by 32 blocks for example. This gives us a little bit of extra performance over the earlier kernel's 32 by 32 blocks, giving us 740 GFLOPS.

Performance of myGEMM

Tutorial written by Cedric Nugteren, (c) 2014 SURFsara
Last updated in 2014 - information might be outdated
- Introduction
- Matrix-multiplication
- Kernel 1
- Kernel 2
- Kernel 3
- Kernel 4
- Kernel 5
- Kernel 6
- Kernel 7
- Kernel 8
- Kernel 9
- Kernel 10
- What's next?

Extra pages:
- Inside clBlas
- clBlas on AMD GPUs