How could I vectorize this loop?

I have this loop

void f1(unsigned char *data, unsigned int size) {
    unsigned int A[256] = {0u};      
    for (register unsigned int i = 0u; i < size; i++) {
        ++A[data[i]];
    }
   ...

      

Is there a way to vectorize it manually?

+3


source to share


2 answers


Since multiple records in data[i]

can contain the same value, I don't see how this could be vectorized simply as there might be race conditions. The point of vectorization is that each element is independent of other elements and therefore can be computed in parallel. But your algorithm doesn't allow it. "Vectorizing" is not the same as "doing faster".

What you're building here is a bar chart, and iOS has built-in optimized support for this. You can create a single channel one line image and use vImageHistogramCalculation_Planar8

like this:

void f1(unsigned char *data, unsigned int size) {
    unsigned long A[256] = {0u};

    vImage_Buffer src = { data, 1, size, size };
    vImage_Error err = vImageHistogramCalculation_Planar8(&src, A, kvImageDoNotTile);
    if (err != kvImageNoError) {
        // error
    }
    ...
}

      



Be careful that this is always a victory. It depends on the size of your data. The function call is very expensive to make, so it can take several million bytes of data to make it useful. If you are computing this on smaller sets than this, then a simple compiler-optimized approach is often the best approach. You need to talk about it on real devices to find out which is faster for your goals.

Just make sure the compiler has applied all the vectorization optimizers by including -Ofast

(Fastest, Aggressive). It doesn't matter in this case, because your loop cannot be simply vectorized. But in general, -Ofast

it allows the compiler to apply vectorization optimizations in cases where it can slightly increase the code size (which is not allowed by default -Os

). -Ofast

also allows for a little sloppiness in the way floating point math is done, so should not be used in cases where strict IEEE floating point negotiation is required (but this is almost never the case for iOS apps, so it is -Ofast

almost always the correct setting).

+4


source


The optimization the compiler will try to do here is to parallelize ++A[data[i]]

It cannot do this, because the content of A depends on the previous iteration of the loop.

You can break down this dependency by using one frequency array ( A

) for each parallelism path, and then calculating the sum of those at the end. I assume you have two ways of parallelism and that is size

even.

void f1(const unsigned char * const data, unsigned int size) {
    unsigned int A0[256] = {0u};
    unsigned int A1[256] = {0u}; 


    for (unsigned int i = 0u; i < size /2u; i++) {
       ++A0[data[2*i]];
       ++A1[data[2*i+1]];
    }

    for (unsigned i=0u; i < 256; ++i){
        A0[i] = A0[i] + A1[i];

    }
}

      



Is that a lot for you? There's only one way to find out - to try and measure the results. I suspect the Accelerate structure will be much better than this, even at relatively small values ​​on size

. It is also optimized at runtime for the target architecture.

Compilers are pretty smart, but there are some things you can do in C or C ++ to help the compiler:

  • Apply const

    where possible: then it is obvious which data is invariant.
  • Define pointers to non-overlapping memory regions with restrict

    ( __restrict

    in C ++). Without knowing this, the compiler must assume that writing through one pointer potentially changes data that can be read by another. clang

    will actually generate runtime checks and code paths for both the overlapping and non-overlapping case, but there will be limitations for this and you can probably reduce the size of the code by being explicit.

I doubt the register

qualifier for i

matters.

+3


source







All Articles