How to optimize two for-loops in cuda
Problem Description
Sometimes we need to calculate some pairs values. In our naive way, we maybe write these code:
for(int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum += cal(a[i], b[j]);
}
}
It's very simple and also there are many methods using cuda to achieve this. Firstly, I will introduce one effective method to achieve this and then I will use some tricks to optimize this method.
a
in continuous memory
First Version - access I want to make threads do much calcuation not one, So in one thread it will calculate N * 1 elements. Let's see the code.
__global__ void work(size_t N_ITERATIONS, size_t n, const float *a,
const float *b, float *ans) {
size_t x = blockDim.x * blockIdx.x + threadIdx.x;
size_t y = N_ITERATIONS * blockIdx.y;
float sum = 0;
for (int i = 0; i < N_ITERATIONS; i++) {
float val = (x < n && y + i < n) ? cal(a[x], b[y + i]) : 0;
sum += val;
}
atomicAdd(ans, sum);
}
In this code, every thread gets one element in a
and N_ITERATIONS
elements in b
, it is continuous when getting element in a
. That is an advantage of this method. So let's focus on b
. There is one problem in here. Every thread will access the same elements in b
so many times. To optimize this, we can use shared memory
to optimize. This technology is called Tiling
.
Tiling
to reduce global memory accesses.
Second Version - Use We can load b
elements in shared memory in the begining of the program. We can make the block size equals to N_ITERATIONS
so that in every threads, it loads one element in b
. Let's see the code.
__global__ void work(size_t N_ITERATIONS, size_t n, const float *a,
const float *b, float *ans) {
size_t x = blockDim.x * blockIdx.x + threadIdx.x;
size_t y = N_ITERATIONS * blockIdx.y;
__shared__ float _b[32];
if (y + x % 32 < n) {
_b[x % 32] = b[y + x % 32];
}
float sum = 0;
for (int i = 0; i < N_ITERATIONS; i++) {
float val = (x < n && y + i < n) ? cal(a[x], _b[i]) : 0;
sum += val;
}
atomicAdd(ans, sum);
}
You will notice that the size of _b
is 32. But in your code, you need to make this size equals to N_ITERATIONS
.
Every threads load one element in b
, and it's continuous too. Perfect! But now there is another problem: Bank conflict
. I am not familiar about Bank
. In some other blogs, It says when some threads in the same warp
access the same address in the shared memory, it will occur Bank conflict
. So how to avoid this? we need to make sure that in every iteration, every threads need to access different val in _b
. We can interate it diagonally.

There is an example, a~h
are threads. i~p
are the elements saved in _b
. The number in the matrix is the order for the thread access the elements. For example, the thread a
will access i, p, o, n...j
one by one, and the thread b
will access j, i, p, o, n, m...k
one by one. We can notice that, there are not elements that accessed by more than one thread in the same time! What's more, to make all threads in the same warp. We need to make the block size equals to 32 too. Let's see the code:
__global__ void work(size_t N_ITERATIONS, size_t n, const float *a,
const float *b, float *ans) {
size_t x = blockDim.x * blockIdx.x + threadIdx.x;
size_t y = N_ITERATIONS * blockIdx.y;
__shared__ float _b[32];
if (y + x % 32 < n) {
_b[x % 32] = b[y + x % 32];
}
float sum = 0;
for (int i = 0; i < N_ITERATIONS; i++) {
int idx = (32 - i + threadIdx.x) % 32;
float val = (x < n && y + idx < n) ? cal(a[x], _b[idx]) : 0;
sum += val;
}
atomicAdd(ans, sum);
}
Warp shuffle
to decrease the times of atomicAdd
Third Version - Use Because our block size is 32, we can use warp shuffle
to add all sum
into one element to decrease the times of atomicAdd
. Warp shuffles
are a faster mechanism for moving data between threads in the same warp. Let's see the code:
__global__ void work(size_t N_ITERATIONS, size_t n, const float *a,
const float *b, float *ans) {
size_t x = blockDim.x * blockIdx.x + threadIdx.x;
size_t y = N_ITERATIONS * blockIdx.y;
__shared__ float _b[32];
if (y + x % 32 < n) {
_b[x % 32] = b[y + x % 32];
}
float sum = 0;
for (int i = 0; i < N_ITERATIONS; i++) {
int idx = (32 - i + threadIdx.x) % 32;
float val = (x < n && y + idx < n) ? cal(a[x], _b[idx]) : 0;
sum += val;
}
for (unsigned w = 16; w >= 1; w /= 2) {
sum += __shfl_down_sync(0xffffffff, sum, w);
}
if (threadIdx.x == 0) {
atomicAdd(ans, sum);
}
}
That is all. we use shared memory
, tiling
, warp shuffle
to optimize this program and we also focus on the order to access data in memory. Maybe there are more tricks to continue to optimize this program. Let's see in the next time:)