Shared Memory in Matrix Multiplication (C = AB)

Shared memory enables cooperation between threads in a block. When multiple threads in a block use the same data from global memory, shared memory can be used to access the data from global memory only once. Shared memory can also be used to avoid uncoalesced memory accesses by loading and storing data in a coalesced pattern from global memory and then reordering it in shared memory. Aside from memory bank conflicts, there is no penalty for non-sequential or unaligned accesses by a half warp in shared memory.

The use of shared memory is illustrated via the simple example of a matrix multiplication C = AB for the case with A of dimension M×16, B of dimension 16×N, and C of dimension M×N. To keep the kernels simple, M and N are multiples of 16. A natural decomposition of the problem is to use a block and tile size of 16×16 threads. Therefore, in terms of 16×16 tiles, A is a column matrix, B is a row matrix, and C is their outer product. (See Figure 1) A grid of N/16 by M/16 blocks is launched, where each thread block calculates the elements of a different tile in C from a single tile of A and a single tile of B.

Note that the example discussed throughout this section assumes compute capability 1.x; the example would be much the same for compute capability 2.x, except that it would have a width of 32 instead of 16.

Figure 1. Block-Column Matrix (A) Multiplied by Block-Row Matrix (B) with Resulting Product Matrix (C)

To do this, the simpleMultiply kernel (Unoptimized matrix multiplication) calculates the output elements of a tile of matrix C.

Unoptimized matrix multiplication

__global__ void simpleMultiply(float *a, float* b, float *c, 
			    int N)
{
 int row = blockIdx.y * blockDim.y + threadIdx.y;
 int col = blockIdx.x * blockDim.x + threadIdx.x;
 float sum = 0.0f;
 for (int i = 0; i < TILE_DIM; i++) {
   sum += a[row*TILE_DIM+i] * b[i*N+col];
 }
 c[row*N+col] = sum;
}

In Unoptimized matrix multiplication, a, b, and c are pointers to global memory for the matrices A, B, and C, respectively; blockDim.x, blockDim.y, and TILE_DIM are all 16. Each thread in the 16×16 block calculates one element in a tile of C. row and col are the row and column of the element in C being calculated by a particular thread. The for loop over i multiplies a row of A by a column of B, which is then written to C.

The effective bandwidth of this kernel is only 8.7 GBps on an NVIDIA GeForce GTX 280 and 0.7 GBps on an NVIDIA GeForce GTX 8800. To analyze performance, it is necessary to consider how half warps of threads access global memory in the for loop. Each half warp of threads calculates one row of a tile of C, which depends on a single row of A and an entire tile of B as illustrated in Figure 2.

Figure 2. Computing a Row (Half Warp) of a Tile In C Using One Row of A and an Entire Tile of B

For each iteration i of the for loop, all threads in a half warp read the same value from global memory (the index row*TILE_DIM+i is constant within a half warp), resulting in 16 transactions for compute capability 1.1 or lower, and 1 transaction for compute capability 1.2 or higher. Even though the operation requires only 1 transaction for compute capability 1.2 or higher, there is wasted bandwidth in the transaction because only 4 bytes out of a 32-byte transaction are used. For each iteration, the 16 threads in a half warp read a row of the B tile, which is a sequential and coalesced access for all compute capabilities.

The performance on a device of any compute capability can be improved by reading a tile of A into shared memory as shown in Using shared memory to improve the global memory load efficiency in matrix multiplication.

Using shared memory to improve the global memory load efficiency in matrix multiplication

__global__ void coalescedMultiply(float *a, float* b, float *c,
                                 int N)
{
 __shared__ float aTile[TILE_DIM][TILE_DIM];
 int row = blockIdx.y * blockDim.y + threadIdx.y;
 int col = blockIdx.x * blockDim.x + threadIdx.x;
 float sum = 0.0f;
 aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
 for (int i = 0; i < TILE_DIM; i++) {
   sum += aTile[threadIdx.y][i]* b[i*N+col];
 }
 c[row*N+col] = sum;
}

In Using shared memory to improve the global memory load efficiency in matrix multiplication, each element in a tile of A is read from global memory only once, in a fully coalesced fashion (with no wasted bandwidth), to shared memory. Within each iteration of the for loop, a value in shared memory is broadcast to all threads in a half warp.

In Using shared memory to improve the global memory load efficiency in matrix multiplication, a __syncthreads() synchronization barrier call is not needed after reading the tile of A into shared memory because only threads within the half warp that write the data into shared memory read the data. This kernel has an effective bandwidth of 14.3 GBps on an NVIDIA GeForce GTX 280, and 8.2 GBps on an NVIDIA GeForce GTX 8800.

A further improvement can be made to how Using shared memory to improve the global memory load efficiency in matrix multiplication deals with matrix B. In calculating a tile's row of matrix C, the entire tile of B is read. The repeated reading of the B tile can be eliminated by reading it into shared memory once (Improvement by reading additional data into shared memory).

Improvement by reading additional data into shared memory

__global__ void sharedABMultiply(float *a, float* b, float *c,
				int N)
{
 __shared__ float aTile[TILE_DIM][TILE_DIM],
                  bTile[TILE_DIM][TILE_DIM];
 int row = blockIdx.y * blockDim.y + threadIdx.y;
 int col = blockIdx.x * blockDim.x + threadIdx.x;
 float sum = 0.0f;
 aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
 bTile[threadIdx.y][threadIdx.x] = b[threadIdx.y*N+col];
 __syncthreads();
 for (int i = 0; i < TILE_DIM; i++) {
   sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x];
 }
 c[row*N+col] = sum;
}

Note that in Improvement by reading additional data into shared memory, a __syncthreads() call is required after reading the B tile because a warp reads data from shared memory that were written to shared memory by different warps. The effective bandwidth of this routine is 29.7 GBps on an NVIDIA GeForce GTX 280 and 15.7 GBps on an NVIDIA GeForce GTX 8800. Note that the performance improvement is not due to improved coalescing in either case, but to avoiding redundant transfers from global memory.

The results of the various optimizations are summarized in Table 1.

Table 1. Performance Improvements Optimizing C = AB Matrix Multiply
Optimization NVIDIA GeForce GTX 280 NVIDIA GeForce GTX 8800
No optimization 8.7 GBps 0.7 GBps
Coalesced using shared memory to store a tile of A 14.3 GBps 8.2 GBps
Using shared memory to eliminate redundant reads of a tile of B 29.7 GBps 15.7 GBps
Note: Medium Priority: Use shared memory to avoid redundant transfers from global memory.