2010年6月22日 星期二

SSE Implementation

This is the introduction from Toshiya which tells you how to implement the ray tracer with SSE.

My experiment:

Add two float[4] together for 10e8 times and compare it with add two __m128 with _mm_add().

Here is what I found through my experiment:
1. If you want the SSE implementation really to work, you have to optimize it (set optimize level when you compile it with gcc.) The performance can be at most nearly 4 times as fast as the implementation without SSE.

2. For my experiment, there is no big difference between using __m128 directly or using __m128* and allocate the memory with _mm_malloc(sizeof(__m128),16).
(From Toshiya: it's no difference between those two. The only thing matters is how to load the data (non-__m128 data). Using _mm_load + _mm_malloc will be compiled as faster instructions)


沒有留言:

張貼留言