Performance issues with simple computing
EDIT 2: Reduce program computation time by 16%! See bottom for calculation
I wrote an N-body simulator that implements the Barnes-Hut algorithm. Now I have an innocent function called CheckNode
. It's simple and doesn't take a lot of computation time, but the problem is that it is called millions of times, so it takes most of the computation time between each frame.
I have profiled the code and this function is responsible for 84.58%
the total computation time, and this is only with 10K particles, when I do it with 10x precision, this function uses more and more percentage.
Now here's a function, with the percentage of time spent on the right in red.
Now, there are some disturbing things here. As a simple if statement taking 9.17%
, and another if statement is over 20% of the computation time! Is there even the slightest optimization that can be done here that will be multiplied by millions of function calls to make my program run faster?
EDIT:
Here's the function CalculateForceNode
:
void CalculateForceNode(Body* bi, Node* bj) //bi is being attracted to bj. 15 flops of calculation
{
//vector from the body to the center of mass
double vectorx = bj->CenterOfMassx - bi->posX;
double vectory = bj->CenterOfMassy - bi->posY;
//c^2 = a^2 + b^2 + softener^2
double distSqr = vectorx * vectorx + vectory * vectory + Softener * Softener;
// ivnDistCube = 1/distSqr^(3/2)
double distSixth = distSqr * distSqr * distSqr;
double invDistCube = 1.0f / (sqrt(distSixth));
double Accel = (bj->TotalMass * invDistCube * _GRAV_CONST);
bi->AccelX += vectorx * Accel;
bi->AccelY += vectory * Accel;
}
EDIT 2:
Optimization results
Now the function CheckNode
takes 82.03%
all the computation time (measured during 1 min. 37 s of the sample), in contrast to the previously taken one 84.58%
.
Now the logic says that the remaining 15%
computation time coincides with the remaining computation time of the 18%
second program. Thus, these identical periods (the same code) took the 15%
first program and the 18%
second program. Allowing time for this other code to complete x
, the first program took 1/0.15
= 6.666x
and the second took 1/0.18
= 5.555x
. Then you can find the fraction that 5.555x
has 6.666x
, which is calculated as ~0.83
, and for that there was ( 1 - 0.83 = 0.16
) 16% reduction in program calculation time!
The first thing I'll try is to change the elements in one of your conditions, replace:
if(withSqr / distanceSqr < nodeThresholdSqr || pNode->HasChildren == false)
from:
if(pNode->HasChildren == false || (withSqr / distanceSqr < nodeThresholdSqr))
If the first part of the condition is true pNode->HasChildren == false
and the second (withSqr / distanceSqr < nodeThresholdSqr)
will never be met (read: evaluated). Simple conditional checking is much faster than floating point operations (division in your case). You can even go to the next level: * you need to calculate distanceSqr
AT ALL when pNode->HasChildren == false
?
EDIT: even better:
if(pNode->HasChildren == false)
{
CalculateForceNode(pBody,pNode);
}
else
{
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double withSqr = pNode->width * pNode->width;
if(withSqr / distanceSqr < nodeThresholdSqr)
{
CalculateForceNode(pBody,pNode);
}
else
{//if not, repeat function with child
if(pNode->Child[0]->Bodies.size() > 0)
CheckNode(pNode->Child[0],pBody);
//..... - all the rest of your code
}
}
Time-based profiling is not enough, you need to know what time wasted - in other words, use a more advanced profiler.
It also doesn't mention any information about the compiler or platform you are using.
For an if statement using 9% of the time, I don't think it's being compared, it's wasted fetching data. You have multiple levels of indirection (accessing data using a pointer that takes you to another pointer, etc.). This is bad for caching and branch prediction and I think you are wasting time fetching data from memory or useless computation due to branch miss prediction without doing the actual comparison.
another note I noticed: if (pNode-> HasChildren == false) you don't need all the calculations you did to find the widthSqr. I think you should rebuild your logic to check this first, if the condition is false, then you can calculate the widthSqr and continue your logic.
- You have to
inline
function first,Bodies.size()
or accesssize
directly so that there is no overhead when calling the function (it takes a while to push all the necessary information onto the stack and pop out). - I don't see all the code, but it looks like you can do a preliminary calculation
widthSqr
. It can be computed ifwidth
not assigned in a function. - You are using a lot of pointers here and it looks like your structures are scattered all over the memory. This will result in a large cache dip in the processor. Make sure that all of the data required for the calculation is in one long, contiguous and compact memory area.
- The
CalculateForceNode
check may be whether the pre-calculatedSoftener*Softener
. The function issqrt
very time consuming. The algorithmsqrt
is iterative, so you can sacrifice precision for speed by doing fewer iterations, or you can use lookup tables. -
You do the same calculations twice in
CalculateForceNode
.void CalculateForceNode(Body* bi, Node* bj) { //vector from the body to the center of mass double vectorx = bj->CenterOfMassx - bi->posX; double vectory = bj->CenterOfMassy - bi->posY; //c^2 = a^2 + b^2 + softener^2 double distSqr = vectorx * vectorx + vectory * vectory...
vectorx,vectory and distSqr
have already been calculated in CheckNode
how diffX, diffY and distanceSqr
. Manually inline whole function CalculateForceNode
.
Since the function is called many times, you should get rid of the call overhead by CalculateForceNode(...)
manually entering the code. In this case, you will notice other tricks:
void CheckNode(Node* pNode, Body* pBody)
{
double diffX = (pNode->CenterOfMass - pBody->posX);
double diffY = (pNode->CenterOfMass - pBody->posY);
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double widthSqr = pNode->width * pNode->width;
if (widthSqr / distanceSqr < NodeThresholdSqr || pNode->hasChildren == false)
{
//vector from the body to the center of mass
double vectorx = pNode->CenterOfMassx - pBody->posX;
double vectory = pNode->CenterOfMassy - pBody->posY;
//c^2 = a^2 + b^2 + softener^2
double distSqr = vectorx * vectorx + vectory * vectory + Softener * Softener;
// ivnDistCube = 1/distSqr^(3/2)
double distSixth = distSqr * distSqr * distSqr;
double invDistCube = 1.0f / (sqrt(distSixth));
double Accel = (pNode->TotalMass * invDistCube * _GRAV_CONST);
pBody->AccelX += vectorx * Accel;
pBody->AccelY += vectory * Accel;
}
else
{
CheckChildren(pNode, pBody);
}
}
Now you can see that diffX = vectorx
, diffY = vectory
, distSqr = distanceSqr*Softner*Softner
. Reusing some of the calculations already done and pre-calculating what is possible should save you a few cycles:
void CheckNode(Node* pNode, Body* pBody)
{
double diffX = (pNode->CenterOfMass - pBody->posX);
double diffY = (pNode->CenterOfMass - pBody->posY);
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double widthSqr = pNode->width * pNode->width;
double SoftnerSq = Softener * Softener; //precompute this value
if (widthSqr / distanceSqr < NodeThresholdSqr || pNode->hasChildren == false)
{
//c^2 = a^2 + b^2 + softener^2
double distSqr = distanceSqr + SoftnerSq;
// ivnDistCube = 1/distSqr^(3/2)
double distSixth = distSqr * distSqr * distSqr;
double invDistCube = 1.0f / (sqrt(distSixth));
double Accel = (pNode->TotalMass * invDistCube * _GRAV_CONST);
pBody->AccelX += diffX * Accel;
pBody->AccelY += diffY * Accel;
}
else
{
CheckChildren(pNode, pBody);
}
}
Hope this works for you.
Change the if statement and move all your calculations inside the part pNode->hasChildren == false
:
void CheckChildren(Node* pNode, Body* pBody)
{
if (pNode->Child[0]->Bodies.size() > 0)
CheckNode(...
}
void CheckNode(Node* pNode, Body* pBody)
{
if (pNode->hasChildren != false)
{
double diffX = (pNode->CenterOfMass - pBody->posX);
double diffY = (pNode->CenterOfMass - pBody->posY);
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double widthSqr = pNode->width * pNode->width;
if (widthSqr / distanceSqr < NodeThresholdSqr)
{
CalculateForceNode(pBody, pNode);
}
else
{
CheckChildren(pNode, pBody);
}
}
else
{
CheckChildren(pNode, pBody);
}
}