I am trying to optimize this c code using 4-way reversal

What I'm trying to do is take this C code and optimize it using a technique called loop unrolling, but in this case, I want to use a 4-way reversal. Now I understand the technique and I understand a concept that I just don't know how to apply it to this code. Do I need to add additional variables? Should I have code after every loop or only at the end of all loops? This code is an 8x8 block code that processes the pixels and rotates it 90 degrees counterclockwise. Any help would be much appreciated. Thank.

/* 
 * rotate8 - rotate with 8x8 blocking
 */

char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";

void rotate8(int dim, pixel *src, pixel *dst) 
{

int i, j, ii, jj;

for(ii = 0; ii < dim; ii += 8)
       for(jj = 0; jj < dim; jj += 8)
              for (i = ii; i < ii + 8; i++)   
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

      

+2


source to share


4 answers


You can replace the inner loop with 8 explicit lines of code

          dst[RIDX(dim-1-jj, i, dim)] = src[RIDX(i, jj, dim)];
          dst[RIDX(dim-1-(jj+1), i, dim)] = src[RIDX(i, (jj+1), dim)];
          ...
          dst[RIDX(dim-1-(jj+7), i, dim)] = src[RIDX(i, (jj+7), dim)];

      

so that you replace the loop variable by explicitly specifying a string for each value you require.



Now you can repeat that for 8 values ​​of the next loop, you will have 8 x 8 lines of code, etc.

As nothing more than an exercise in understanding, this seems completely pointless to me, compilers do this stuff really efficiently, they optimize where it makes sense. Hand rolling rarely creates an optimal code.

+5


source


I wanted to say a profile, but then I did it myself. The amazing part is that the inner loop is fastest with layout - manual unwrapping is slower.

However - the real catch is the RIDX macro. Switching memory layout and switching external loops have a significant impact.

Here's my fastest version indented to show where it differs from your version. The RIDX macro is considered as defined.



#define RIDX(x,y,d) (x+(y)*(d))
typedef unsigned char pixel;
void rotate8(int dim, pixel *src, pixel *dst)
{
    int i, j, ii, jj;
        for(jj = 0; jj < dim; jj += 8)
    for(ii = 0; ii < dim; ii += 8)
              for (i = ii; i < ii + 8; i++)
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

      

... lesson learned: always profile :-)

+4


source


gcc -funrull-loops

You shouldn't unwrap the loops yourself unless GCC can do it (look at the assembly) and you have proven using a profiler that you need to speed up that part of the code.

In this code example, you look like an ideal candidate for automatic loop reversal.

Some other useful flags:

-O3 // turns on a lot of optimizations (almost all)
-ftree-vectorize -msse2 // vectorizes automatically some loops
+3


source


http://www.relisoft.com/book/lang/pointer/2ptrarr.html

If your compiler can't optimize the user-friendly, maintainable version of the algorithm and you need to double up as a human compiler - buy a new compiler! Nobody else can afford human compilers. So, have mercy on yourself and your fellow programmers who will have to look at their code.

0


source







All Articles