I adopted the odd even sort for sorting on GPU. I use NVIDIA GeForece 9400M, which has the spec (only list related):
CUDA capability 1.1:
1. Max number of threads per block: 512
2. Max blocks in x direction of a grid: 512
3. Shared memory size: 16K
4. Global memory size: 256M
For my application:
Each item in the list: 16 bytes
Usually the size of the list is larger than 1024.
In my implementation, a thread is assigned with two elements in the list. A block containing maximum number of threads (512) can manage 1024 items. In my implementation, the whole list doesn't fit in one block, and inside a block, the elements don't fit in the shared memory. Therefore, I had to use multiple blocks (usually 30~60 blocks) to accomodate the list. Unfortunately, CUDA only provides with us thread synchronization within a block. For interblock synchronization, I have no choice to relaunch the kernel multiple times.
Result:
Sorting 438270 items
CPU version: 86s  (This version is not optimized, just want to verify the correctness of GPU version)
GPU version: 504ms
 
沒有留言:
張貼留言