C versus vDSP versus NEON. How could NEON be as slow as C?

How can NEON be as slow as C?

I am trying to plot a quick histogram function that will inject input values ​​into ranges by assigning a value to them - this is the threshold of the range they are closest to. This is what will be applied to the images, so it should be fast (suppose a 640x480 array of images is 300,000 elements). Histogram range numbers are multiples of (0,25,50,75,100). The inputs will float and the final outputs will obviously be integers

I tested the following versions on xCode, opening a new empty project (no application delegate) and just using the main.m file. I removed all linked libraries except for Accelerate.

Here is the C implementation: there was a lot in the older version if then, but here is the final optimized logic. it took 11 and 300ms.

int main(int argc, char *argv[])
{
  NSLog(@"starting");

  int sizeOfArray=300000;

  float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
  int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray);

  for (int i=0; i<sizeOfArray; ++i)
  {
    inputArray[i]=88.5;
  }

  //Assume range is [0,25,50,75,100]
  int lcd=25;

  for (int j=0; j<1000; ++j)// just to get some good time interval
  {
    for (int i=0; i<sizeOfArray; ++i)
    {
        //a 60.5 would give a 50. An 88.5 would give 100
        outputArray[i]=roundf(inputArray[i]/lcd)*lcd;
    }
  }
NSLog(@"done");
}

      

Here is a vDSP implementation. Even with some tedious floating integers back and forth, it only took 6 seconds! improvement by almost 50%!

//vDSP implementation
 int main(int argc, char *argv[])
 {
   NSLog(@"starting");

   int sizeOfArray=300000;

   float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
   float* outputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);//vDSP requires matching of input output
   int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray); //rounded value to the nearest integere
   float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
   int* finalOutputArray=(int*) malloc(sizeof(int)*sizeOfArray); //to compare apples to apples scenarios output


   for (int i=0; i<sizeOfArray; ++i)
   {
     inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
   }


   for (int j=0; j<1000; ++j)// just to get some good time interval
   {
     //Assume range is [0,25,50,75,100]
     float lcd=25.0f;

     //divide by lcd
     vDSP_vsdiv(inputArray, 1, &lcd, outputArrayF, 1,sizeOfArray);

     //Round to nearest integer
     vDSP_vfixr32(outputArrayF, 1,outputArray, 1, sizeOfArray);

     // MUST convert int to float (cannot just cast) then multiply by scalar - This step has the effect of rounding the number to the nearest lcd.
    vDSP_vflt32(outputArray, 1, outputArrayF, 1, sizeOfArray);
    vDSP_vsmul(outputArrayF, 1, &lcd, finalOutputArrayF, 1, sizeOfArray);
    vDSP_vfix32(finalOutputArrayF, 1, finalOutputArray, 1, sizeOfArray);
   }
  NSLog(@"done");
}

      

Here is a neon implementation. This is my first, so play well! it was slower than vDSP and took 9s and 300ms which didn't make sense to me. Either vDSP is better optimized than NEON or I am doing something wrong.

//NEON implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");

int sizeOfArray=300000;

float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);

for (int i=0; i<sizeOfArray; ++i)
{
    inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}



for (int j=0; j<1000; ++j)// just to get some good time interval
{
    float32x4_t c0,c1,c2,c3;
    float32x4_t e0,e1,e2,e3;
    float32x4_t f0,f1,f2,f3;

    //ranges of histogram buckets
    float32x4_t buckets0=vdupq_n_f32(0);
    float32x4_t buckets1=vdupq_n_f32(25);
    float32x4_t buckets2=vdupq_n_f32(50);
    float32x4_t buckets3=vdupq_n_f32(75);
    float32x4_t buckets4=vdupq_n_f32(100);

    //midpoints of ranges
    float32x4_t thresholds1=vdupq_n_f32(12.5);
    float32x4_t thresholds2=vdupq_n_f32(37.5);
    float32x4_t thresholds3=vdupq_n_f32(62.5);
    float32x4_t thresholds4=vdupq_n_f32(87.5);


    for (int i=0; i<sizeOfArray;i+=16)
    {
        c0= vld1q_f32(&inputArray[i]);//load
        c1= vld1q_f32(&inputArray[i+4]);//load
        c2= vld1q_f32(&inputArray[i+8]);//load
        c3= vld1q_f32(&inputArray[i+12]);//load


        f0=buckets0;
        f1=buckets0;
        f2=buckets0;
        f3=buckets0;

        //register0
        e0=vcgtq_f32(c0,thresholds1);
        f0=vbslq_f32(e0, buckets1, f0);

        e0=vcgtq_f32(c0,thresholds2);
        f0=vbslq_f32(e0, buckets2, f0);

        e0=vcgtq_f32(c0,thresholds3);
        f0=vbslq_f32(e0, buckets3, f0);

        e0=vcgtq_f32(c0,thresholds4);
        f0=vbslq_f32(e0, buckets4, f0);



        //register1
        e1=vcgtq_f32(c1,thresholds1);
        f1=vbslq_f32(e1, buckets1, f1);

        e1=vcgtq_f32(c1,thresholds2);
        f1=vbslq_f32(e1, buckets2, f1);

        e1=vcgtq_f32(c1,thresholds3);
        f1=vbslq_f32(e1, buckets3, f1);

        e1=vcgtq_f32(c1,thresholds4);
        f1=vbslq_f32(e1, buckets4, f1);


        //register2
        e2=vcgtq_f32(c2,thresholds1);
        f2=vbslq_f32(e2, buckets1, f2);

        e2=vcgtq_f32(c2,thresholds2);
        f2=vbslq_f32(e2, buckets2, f2);

        e2=vcgtq_f32(c2,thresholds3);
        f2=vbslq_f32(e2, buckets3, f2);

        e2=vcgtq_f32(c2,thresholds4);
        f2=vbslq_f32(e2, buckets4, f2);


        //register3
        e3=vcgtq_f32(c3,thresholds1);
        f3=vbslq_f32(e3, buckets1, f3);

        e3=vcgtq_f32(c3,thresholds2);
        f3=vbslq_f32(e3, buckets2, f3);

        e3=vcgtq_f32(c3,thresholds3);
        f3=vbslq_f32(e3, buckets3, f3);

        e3=vcgtq_f32(c3,thresholds4);
        f3=vbslq_f32(e3, buckets4, f3);


        vst1q_f32(&finalOutputArrayF[i], f0);
        vst1q_f32(&finalOutputArrayF[i+4], f1);
        vst1q_f32(&finalOutputArrayF[i+8], f2);
        vst1q_f32(&finalOutputArrayF[i+12], f3);
    }
}
NSLog(@"done");
}

      

PS: this is my first benchmarking on this scale, so I tried to keep it simple (big loops, constant tuning code, using NSlog to print start and end times, only speed up frame binding). If any of these assumptions significantly affect the result, please criticize.

thank

+3


source to share


3 answers


First, it is not "NEON" per se. This is internal. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need inline functions, you must write assembler by hand.

vDSP is not better optimized than NEON. The vDSP on iOS uses the NEON processor. vDSP using NEON is much better optimized than using NEON.

I haven't dug out my inline code yet, but the most likely (in fact, almost definite) cause of the problem is because you are creating wait states. Writing in assembler (and inline scripts are just assembler written with welding gloves) is nothing like writing in C. You are not looping the same thing. You are not comparing the same thing. You need a new way of thinking. In an assembly, you can do more than one thing at a time (because you have different logical units), but you absolutely must plan things in such a way that all of these things can run in parallel. A good build fills all these lines completely. If you can read your code and it makes sense, it might be holding the assembly code. If you never repeat yourself, it might be holding assembly code. You need to consider carefully,what is in which register and how many loops there are until you are allowed to read it.

If it were as easy as transliterating C, then the compiler will do it for you. The moment you say "I'm going to write this in NEON" you say "I think I can write better NEON than the compiler" because the compiler uses it too. However, it is often possible to write better NEON than a compiler (especially gcc and clang).



If you're ready to dive into this world (and it's a pretty cool world), you have some reading ahead of you. Here are a few places that I recommend:

EVERYONE THAT SAID ... Always always start by revising your algorithm. Often the answer is not how quickly to compute your loop, but how often not to call the loop.

+6


source


ARM NEON has 32 registers, 64 bits wide (double representation as 16 registers, 128 bits). Your neon implementation already uses at least 18 128 bits, so the compiler will generate code to move them back and forth from the stack, which is not good - too much extra memory access.

If you plan on playing around with assembly, I've found it better to use a tool to dump instructions in object files. One of them is called objdump

in Linux, I believe it is called otool

in the Apple world. This way you can see what the resulting machine code looks like and what the compiler has done with your functions.

Below is a portion of your neon implementation dump from gcc (-O3) 4.7.1. You may notice loading a four-digit register through vldmia sp, {d8-d9}

.



1a6:    ff24 cee8   vcgt.f32    q6, q10, q12
1aa:    ff64 4ec8   vcgt.f32    q10, q10, q4
1ae:    ff2e a1dc   vbit    q5, q15, q6
1b2:    ff22 ceea   vcgt.f32    q6, q9, q13
1b6:    ff5c 41da   vbsl    q10, q14, q5
1ba:    ff20 aeea   vcgt.f32    q5, q8, q13
1be:    f942 4a8d   vst1.32 {d20-d21}, [r2]!
1c2:    ec9d 8b04   vldmia  sp, {d8-d9}
1c6:    ff62 4ee8   vcgt.f32    q10, q9, q12
1ca:    f942 6a8f   vst1.32 {d22-d23}, [r2]

      

It's all compiler-dependent of course, the best compiler can avoid this situation by using the available registers more clearly.

So in the end you are at the mercy of the compiler unless you use an assembly (inline, standalone), or have to constantly check the compiler output until you get what you want from it.

+4


source


As a follow-up to Rob, answer that the NEON letter is in itself (thanks for plugging my Wandering Coder posts by the way) and auselen's answer (that you really have too many registers living at any given time, resulting in spill), I must add that your intrinsics algorithm is more general than the other two: it allows arbitrary ranges, not just multiples, so you're trying to compare things that aren't comparable. Always compare oranges to oranges; except that fair play allows you to compare a custom algorithm more specific than a ready-made generic one if you only want the specific features of the custom one. So this is just another way that NEON's algorithm can be as slow as C: if they are not the same algorithm.

Regarding your needs for a histogram, use what you created with vDSP, and only if the performance does not suit your application, only then investigate optimization in another way; to do this, in addition to using NEON instructions, would involve preventing that much memory movement (likely bottleneck in the vDSP implementation) and incrementing the counters for each bucket when looking at pixels instead of having this intermediate output from a forced value. Efficient DSP code is not only the computation itself, but also the most efficient use of memory bandwidth, etc. Moreover, on mobile devices: I / O memory, even in caches, is more power-intensive than CPU operations, so both memory I / O buses tend to berun at a lower fraction of the processor clock speed, so you don't have much memory bandwidth and should use the memory bandwidth you have wisely, as any use of it requires power.

+2


source







All Articles