Shared Memory in Matrix Multiplication (C = AAT)

A variant of the previous matrix multiplication can be used to illustrate how strided accesses to global memory, as well as shared memory bank conflicts, are handled. This variant simply uses the transpose of A rather than B, or C = AAT.

As in the previous section, this example assumes compute capability 1.x; it would have a width of 32 rather than 16 for compute capability 2.x.

A simple implementation for C = AAT is shown in Unoptimized handling of strided accesses to global memory.

Unoptimized handling of strided accesses to global memory

__global__ void simpleMultiply(float *a, float *c, int M)
{
 int row = blockIdx.y * blockDim.y + threadIdx.y;
 int col = blockIdx.x * blockDim.x + threadIdx.x;
 float sum = 0.0f;
 for (int i = 0; i < TILE_DIM; i++) {
   sum += a[row*TILE_DIM+i] * a[col*TILE_DIM+i];
 }
 c[row*M+col] = sum;
}

In Unoptimized handling of strided accesses to global memory, the row-th, col-th element of C is obtained by taking the dot product of the row-th and col-th rows of A. The effective bandwidth for this kernel is 1.1 GBps on an NVIDIA GeForce GTX 280 and 0.5 GBps on an NVIDIA GeForce GTX 8800. These results are substantially lower than the corresponding measurements for the C = AB kernel. The difference is in how threads in a half warp access elements of A in the second term, a[col*TILE_DIM+i], for each iteration i. For a half warp of threads, col represents sequential columns of the transpose of A, and therefore col*TILE_DIM represents a strided access of global memory with a stride of 16. This results in uncoalesced memory accesses on devices with compute capability 1.1 or lower and plenty of wasted bandwidth on devices with compute capability 1.2 or higher. The way to avoid strided access is to use shared memory as before, except in this case a half warp reads a row of A into a column of a shared memory tile, as shown in Optimized handling of strided accesses using coalesced reads from global memory.

Optimized handling of strided accesses using coalesced reads from global memory

__global__ void coalescedMultiply(float *a, float *c, int M)
{
 __shared__ float aTile[TILE_DIM][TILE_DIM],
                  transposedTile[TILE_DIM][TILE_DIM];
 int row = blockIdx.y * blockDim.y + threadIdx.y;
 int col = blockIdx.x * blockDim.x + threadIdx.x;
 float sum = 0.0f;
 aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
 transposedTile[threadIdx.x][threadIdx.y] = 	a[(blockIdx.x*blockDim.x + threadIdx.y)*TILE_DIM + 
threadIdx.x];  
 __syncthreads();
 for (int i = 0; i < TILE_DIM; i++) {
   sum += aTile[threadIdx.y][i]* transposedTile[i][threadIdx.x];
 }
 c[row*M+col] = sum;
}

Optimized handling of strided accesses using coalesced reads from global memory uses the shared transposedTile to avoid uncoalesced accesses in the second term in the dot product, and the shared aTile technique from the previous example to avoid uncoalesced accesses in the first term. The effective bandwidth of this kernel is 24.9 GBps on an NVIDIA GeForce GTX 280 and 13.2 GBps on an NVIDIA GeForce GTX 8800. These results are slightly lower than those obtained by the final kernel for C = AB. The cause of the difference is shared memory bank conflicts.

The reads of elements in transposedTile within the for loop are free of conflicts, because threads of each half warp read across rows of the tile, resulting in unit stride across the banks. However, bank conflicts occur when copying the tile from global memory into shared memory. To enable the loads from global memory to be coalesced, data are read from global memory sequentially. However, this requires writing to shared memory in columns, and because of the use of 16×16 tiles in shared memory, this results in a stride between threads of 16 banks. These 16-way bank conflicts are very expensive. The simple remedy is to pad the shared memory array so that it has an extra column, as in the following line of code.
 __shared__ float transposedTile[TILE_DIM][TILE_DIM+1];

This padding eliminates the conflicts entirely, because now the stride between threads is 17 banks (33 banks for compute capability 2.x), which, due to modular arithmetic used to compute bank indices, is equivalent to a unit stride. After this change, the effective bandwidth is 30.4 GBps on an NVIDIA GeForce GTX 280 and 15.6 GBps on an NVIDIA GeForce GTX 8800, which is comparable to the results from the last C = AB kernel.

The results of these optimizations are summarized in Table 1.

Table 1. Performance Improvements Optimizing C = AAT Matrix Multiplication
Optimization NVIDIA GeForce GTX 280 NVIDIA GeForce GTX 8800
No optimization 1.1 GBps 0.5 GBps
Using shared memory to coalesce global reads 24.9 GBps 13.2 GBps
Removing bank conflicts 30.4 GBps 15.6 GBps

These results should be compared with those in Table 1. As can be seen from these tables, judicious use of shared memory can dramatically improve performance.

The examples in this section have illustrated three reasons to use shared memory: