My experiment:
Add two float[4] together for 10e8 times and compare it with add two __m128 with _mm_add().
Here is what I found through my experiment:
1. If you want the SSE implementation really to work, you have to optimize it (set optimize level when you compile it with gcc.) The performance can be at most nearly 4 times as fast as the implementation without SSE.
2. For my experiment, there is no big difference between using __m128 directly or using __m128* and allocate the memory with _mm_malloc(sizeof(__m128),16).
(From Toshiya: it's no difference between those two. The only thing matters is how to load the data (non-__m128 data). Using _mm_load + _mm_malloc will be compiled as faster instructions)
沒有留言:
張貼留言