2011年6月26日 星期日

[Parallel Computing] Sorting

I adopted the odd even sort for sorting on GPU. I use NVIDIA GeForece 9400M, which has the spec (only list related):
CUDA capability 1.1:
1. Max number of threads per block: 512
2. Max blocks in x direction of a grid: 512
3. Shared memory size: 16K
4. Global memory size: 256M

For my application:
Each item in the list: 16 bytes
Usually the size of the list is larger than 1024.

In my implementation, a thread is assigned with two elements in the list. A block containing maximum number of threads (512) can manage 1024 items. In my implementation, the whole list doesn't fit in one block, and inside a block, the elements don't fit in the shared memory. Therefore, I had to use multiple blocks (usually 30~60 blocks) to accomodate the list. Unfortunately, CUDA only provides with us thread synchronization within a block. For interblock synchronization, I have no choice to relaunch the kernel multiple times.


Result:
Sorting 438270 items
CPU version: 86s (This version is not optimized, just want to verify the correctness of GPU version)
GPU version: 504ms

2011年6月22日 星期三

[Parallel Computing] Indexing tricks

When making device kernels (shaders,) how to assign threads to some tasks is very important.
Since the GPU has really little support of recursion, how to turn your recursive program into a non-recursive one becomes very important as well.

I saw this approach when I was working on GPU sorting.
The original Odd Even Merge Sort is described here

I'm not gonna describe the iterative version in detail but mention some tricks to accustom your code to the GPU kernel.

1. When your code has the thread assignment with a circular pattern:
For example:
8 Threads:
Thread Id 0 1 2 3 4 5 6 7
Activate/no n n y y n n y y

Solution:
Here we have a circular pattern with a period of 4.
Of course we can use something like if(Idx.x==xxx), but if the case is making an iterative execution, there will be too many cases to be considered.
A better way is to use (Idx & (period-1)) and set a threshold.
In our example, we can use (Idx&3) and set a threshold equal to 2.
So the values become:

Thread Id 0 1 2 3 4 5 6 7
Activate/no n n y y n n y y
Value 0 1 2 3 0 1 2 3
Thresholded n n y y n n y y

So it's exactly what we want for our program.