I am trying to optimize this c code using 4-way reversal
What I'm trying to do is take this C code and optimize it using a technique called loop unrolling, but in this case, I want to use a 4-way reversal. Now I understand the technique and I understand a concept that I just don't know how to apply it to this code. Do I need to add additional variables? Should I have code after every loop or only at the end of all loops? This code is an 8x8 block code that processes the pixels and rotates it 90 degrees counterclockwise. Any help would be much appreciated. Thank.
/*
* rotate8 - rotate with 8x8 blocking
*/
char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";
void rotate8(int dim, pixel *src, pixel *dst)
{
int i, j, ii, jj;
for(ii = 0; ii < dim; ii += 8)
for(jj = 0; jj < dim; jj += 8)
for (i = ii; i < ii + 8; i++)
for (j = jj; j < jj + 8; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
source to share
You can replace the inner loop with 8 explicit lines of code
dst[RIDX(dim-1-jj, i, dim)] = src[RIDX(i, jj, dim)]; dst[RIDX(dim-1-(jj+1), i, dim)] = src[RIDX(i, (jj+1), dim)]; ... dst[RIDX(dim-1-(jj+7), i, dim)] = src[RIDX(i, (jj+7), dim)];
so that you replace the loop variable by explicitly specifying a string for each value you require.
Now you can repeat that for 8 values of the next loop, you will have 8 x 8 lines of code, etc.
As nothing more than an exercise in understanding, this seems completely pointless to me, compilers do this stuff really efficiently, they optimize where it makes sense. Hand rolling rarely creates an optimal code.
source to share
I wanted to say a profile, but then I did it myself. The amazing part is that the inner loop is fastest with layout - manual unwrapping is slower.
However - the real catch is the RIDX macro. Switching memory layout and switching external loops have a significant impact.
Here's my fastest version indented to show where it differs from your version. The RIDX macro is considered as defined.
#define RIDX(x,y,d) (x+(y)*(d))
typedef unsigned char pixel;
void rotate8(int dim, pixel *src, pixel *dst)
{
int i, j, ii, jj;
for(jj = 0; jj < dim; jj += 8)
for(ii = 0; ii < dim; ii += 8)
for (i = ii; i < ii + 8; i++)
for (j = jj; j < jj + 8; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
... lesson learned: always profile :-)
source to share
gcc -funrull-loops
You shouldn't unwrap the loops yourself unless GCC can do it (look at the assembly) and you have proven using a profiler that you need to speed up that part of the code.
In this code example, you look like an ideal candidate for automatic loop reversal.
Some other useful flags:
-O3 // turns on a lot of optimizations (almost all) -ftree-vectorize -msse2 // vectorizes automatically some loops
source to share
http://www.relisoft.com/book/lang/pointer/2ptrarr.html
If your compiler can't optimize the user-friendly, maintainable version of the algorithm and you need to double up as a human compiler - buy a new compiler! Nobody else can afford human compilers. So, have mercy on yourself and your fellow programmers who will have to look at their code.
source to share