Arc On: SSE2 makes the program slower

In Visual Studio 2010, when I enable extended instruction sets in the following code, the execution time actually increases.

void add(float * input1, float * input2, float * output, int size)
{
    for(int iter = 0; iter < size; iter++)
    {
        output[iter] = input1[iter] * input2[iter];
    }
}

int main()
{

    const int SIZE = 10000000;
    float *in1 = new float[SIZE];
    float *in2 = new float[SIZE];
    float *out = new float[SIZE];
    for(int iter = 0; iter < SIZE; iter++)
    {
        in1[iter] = std::rand();
        in2[iter] = std::rand();
        out[iter] = std::rand();
    }
    clock_t start = clock();
    for(int iter = 0; iter < 100; iter++)
    {
        add(in1, in2, out, SIZE);
    }
    clock_t end = clock();
    double time = difftime(end,start)/(double)CLOCKS_PER_SEC;

    system("PAUSE");
    return 0;
}

      

I keep getting 2.0

seconds for a variable time

with SSE2 enabled, but about 1.7

seconds when it is "Not Set". I am based on Windows 7 64bit, VS 2010 professional, Release configuration, Optimize for speed.

Are there any explanations why enabling SSE results in a longer runtime?

+3


source to share


2 answers


There is an overhead in SSE code for moving values ​​to and from SSE registers, which may outweigh the performance benefits of SSE if you are doing very little simple computation like your example.



Also note that this overhead gets significantly larger if your data is not 16-byte aligned.

+2


source


IMO, it is often not a good idea to rely on the compiler to perform these optimizations. Your code should run faster (unless the compiler already does this for you, which doesn't seem to be the case). I suggest

1 make sure your array is 16 byte aligned

2 use the built-in SSE functions in your built-in add function:



#include <xmmintrin.h>
inline void add(const float * input1, const float * input2, float * output, int size)
{
   // assuming here that 
   // - all 3 arrays are 16-byte aligned
   // - size is a multiple of 4
   for(int iter = 0; iter < size; iter += 4)
     _mm_store_ps( output+iter, _mm_mul_ps( _mm_load_ps(input1+iter),
                                            _mm_load_ps(input2+iter) ) );
}

      

if that doesn't produce faster code, then loading and storing creates too much overhead for a single multiplication operation.

+2


source







All Articles