Loss of performance if the function is not defined in the class declaration
I have some high performance code that includes a class that is at the bottom of this post. My problem is that once I do not define the advecu function in the class declaration, but instead separate the declaration and implementation (as I prefer), I lose significant performance in both Intel C ++ compilers and Clang. However, I don't understand why. When I remove templates, the performance is the same for both ways for both ways.
template<bool dim3>
struct Advec4Kernel
{
static void advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4, const Grid &grid)
{
int ijk,kstart,kend;
int ii1,ii2,ii3,jj1,jj2,jj3,kk1,kk2,kk3;
double dxi,dyi;
ii1 = 1;
ii2 = 2;
ii3 = 3;
jj1 = 1*grid.icells;
jj2 = 2*grid.icells;
jj3 = 3*grid.icells;
kk1 = 1*grid.ijcells;
kk2 = 2*grid.ijcells;
kk3 = 3*grid.ijcells;
kstart = grid.kstart;
kend = grid.kend;
dxi = 1./grid.dx;
dyi = 1./grid.dy;
for(int k=grid.kstart; k<grid.kend; k++)
for(int j=grid.jstart; j<grid.jend; j++)
for(int i=grid.istart; i<grid.iend; i++)
{
ijk = i + j*jj1 + k*kk1;
ut[ijk] -= ( cg0*((ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]) * (ci0*u[ijk-ii3] + ci1*u[ijk-ii2] + ci2*u[ijk-ii1] + ci3*u[ijk ]))
+ cg1*((ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]) * (ci0*u[ijk-ii2] + ci1*u[ijk-ii1] + ci2*u[ijk ] + ci3*u[ijk+ii1]))
+ cg2*((ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]) * (ci0*u[ijk-ii1] + ci1*u[ijk ] + ci2*u[ijk+ii1] + ci3*u[ijk+ii2]))
+ cg3*((ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3]) * (ci0*u[ijk ] + ci1*u[ijk+ii1] + ci2*u[ijk+ii2] + ci3*u[ijk+ii3])) ) * cgi*dxi;
if(dim3)
{
ut[ijk] -= ( cg0*((ci0*v[ijk-ii2-jj1] + ci1*v[ijk-ii1-jj1] + ci2*v[ijk-jj1] + ci3*v[ijk+ii1-jj1]) * (ci0*u[ijk-jj3] + ci1*u[ijk-jj2] + ci2*u[ijk-jj1] + ci3*u[ijk ]))
+ cg1*((ci0*v[ijk-ii2 ] + ci1*v[ijk-ii1 ] + ci2*v[ijk ] + ci3*v[ijk+ii1 ]) * (ci0*u[ijk-jj2] + ci1*u[ijk-jj1] + ci2*u[ijk ] + ci3*u[ijk+jj1]))
+ cg2*((ci0*v[ijk-ii2+jj1] + ci1*v[ijk-ii1+jj1] + ci2*v[ijk+jj1] + ci3*v[ijk+ii1+jj1]) * (ci0*u[ijk-jj1] + ci1*u[ijk ] + ci2*u[ijk+jj1] + ci3*u[ijk+jj2]))
+ cg3*((ci0*v[ijk-ii2+jj2] + ci1*v[ijk-ii1+jj2] + ci2*v[ijk+jj2] + ci3*v[ijk+ii1+jj2]) * (ci0*u[ijk ] + ci1*u[ijk+jj1] + ci2*u[ijk+jj2] + ci3*u[ijk+jj3])) ) * cgi*dyi;
}
ut[ijk] -= ( cg0*((ci0*w[ijk-ii2-kk1] + ci1*w[ijk-ii1-kk1] + ci2*w[ijk-kk1] + ci3*w[ijk+ii1-kk1]) * (ci0*u[ijk-kk3] + ci1*u[ijk-kk2] + ci2*u[ijk-kk1] + ci3*u[ijk ]))
+ cg1*((ci0*w[ijk-ii2 ] + ci1*w[ijk-ii1 ] + ci2*w[ijk ] + ci3*w[ijk+ii1 ]) * (ci0*u[ijk-kk2] + ci1*u[ijk-kk1] + ci2*u[ijk ] + ci3*u[ijk+kk1]))
+ cg2*((ci0*w[ijk-ii2+kk1] + ci1*w[ijk-ii1+kk1] + ci2*w[ijk+kk1] + ci3*w[ijk+ii1+kk1]) * (ci0*u[ijk-kk1] + ci1*u[ijk ] + ci2*u[ijk+kk1] + ci3*u[ijk+kk2]))
+ cg3*((ci0*w[ijk-ii2+kk2] + ci1*w[ijk-ii1+kk2] + ci2*w[ijk+kk2] + ci3*w[ijk+ii1+kk2]) * (ci0*u[ijk ] + ci1*u[ijk+kk1] + ci2*u[ijk+kk2] + ci3*u[ijk+kk3])) )
* dzi4[k];
}
}
};
The split version looks like this:
// header
template<bool dim3>
struct Advec4Kernel
{
static void advecu(double *, double *, double *, double *, double *, const Grid &);
}
// source
template<bool dim3>
void Advec4Kernel<dim3>::advecu(double * restrict ut, double * restrict u, double * restrict v, double * restrict w, double * restrict dzi4, const Grid &grid)
{
//...
}
source to share
Apparently the compiler is doing some optimizations with the keyword restrict
. To take advantage of these optimizations, the function declaration must contain a keyword restrict
. This has been determined empirically ; I don't know if this is a compiler flaw or a law.
code:
// header
template<bool dim3>
struct Advec4Kernel
{
static void advecu(double *restrict, double *restrict, double *restrict, double *restrict, double *restrict, const Grid &);
}
source to share