Difference between member functions and global function in release version

I have implemented two functions to do cross product of two vectors (not std :: vector), one is a member function and the other is global, here are the keycodes (extra parts omitted)

//for member function
template <typename Scalar>
SquareMatrix<Scalar,3> Vector<Scalar,3>::outerProduct(const Vector<Scalar,3> &vec3) const
{
    SquareMatrix<Scalar,3> result;
    for(unsigned int i = 0; i < 3; ++i)
        for(unsigned int j = 0; j < 3; ++j)
            result(i,j) = (*this)[i]*vec3[j];
    return result;
}

//for global function: Dim = 3
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
    for (unsigned int i=0; i<Dim; i++)
        for (unsigned int j=0; j<Dim; j++)
        {
            m(i,j) = v1[i]*v2[j];
        }
}

      

They are almost the same, except that one is a member function that has a return value, and the other is a global function where the computed values ​​are directly bound to a square matrix, which does not require any return value.
In fact, I had to replace member one with a global one to improve performance, since the first one involves copy operations. The strange thing, however, is that the time cost of a global function is almost twice as long as a member. Also, I believe that doing

m(i,j) = v1[i]*v2[j]; // in global function

      

takes much longer than

result(i,j) = (*this)[i]*vec3[j]; // in member function

      

So the question is, how does this performance difference between member and global function come about?

Can anyone explain the reasons?
I hope I have clearly stated my question and apologize for my poor English!

// --------------------------------------------- --- ----------------------------------------
More info added:
Below are the codes that I I use to test performance:

    //the codes below is in a loop
    Vector<double, 3> vec1;
    Vector<double, 3> vec2;
    Timer timer;
    timer.startTimer();
    for (unsigned int i=0; i<100000; i++)
    {
        SquareMatrix<double,3> m = vec1.outerProduct(vec2);
    }
    timer.stopTimer();
    std::cout<<"time cost for member function: "<< timer.getElapsedTime()<<std::endl;

    timer.startTimer();
    SquareMatrix<double,3> m;
    for (unsigned int i=0; i<100000; i++)
    {
        outerProduct(vec1, vec2, m);
    }
    timer.stopTimer();
    std::cout<<"time cost for global function: "<< timer.getElapsedTime()<<std::endl;
    std::system("pause");

      

and the result is:
enter image description here

You can see that the funtion function is almost twice as fast as the global one. Also, my project is built on a 64 bit Windows system and the codes are actually used to generate static lib files based on the Scons build tools along with the generated vs2010 project files.

I have to remind you that the strange performance difference only occurs in the release version, and in the debug style - the global function is almost five times faster than the first. (about 0.10s vs 0.02)

+3


source to share


3 answers


One possible explanation:

With nesting in the first case, the compiler can know that result(i, j)

(from a local variable) it has no aliases this[i]

or vec3[j]

, therefore, none of the Scalar arrays from this

and vec3

changed.



In the second case, in terms of functions, variables may be an alias, so each entry m

may change the Scalars v1

or v2

so none of v1[i]

and v2[j]

can not be cached.

You can try restrict keyword expansion to test if my hypothesis is correct.

+3


source


EDIT: Fixed looping exception in original assembly

[paraphrase] Why is performance different from member function and static function?

I'll start with the simplest things mentioned in your question and work my way to the finer points of performance testing / analysis.

Bad idea for measuring the performance of debug builds. Compilers accept in many places, for example, nullable arrays that are not initialized, generating additional code that is not strictly necessary, and (obviously) does not perform any optimization beyond trivial ones such as persistent propagation. This leads to the next point ...

Always look at the assembly. C and C ++ are high-level languages ​​when it comes to the intricacies of performance. Many even consider x86 assembly to be a high-level language, since each instruction is decomposed into several micro-ops during decoding. You can't tell what a computer is doing just by looking at C ++ code. For example, depending on how you implemented SquareMatrix

, the compiler may or may not be able to copy when optimized.

Introducing slightly more subtle topics in performance testing ...

Make sure the compiler is actually generating the loops. Using your example test code, g ++ 4.7.2 doesn't actually create loops with my implementation SquareMatrix

and Vector

. I've implemented them to initialize all components before 0.0

, so the compiler can statically determine that the values ​​never change, and therefore only generate one set of instructions mov

instead of a loop. In my example code, I am using COMPILER_NOP

which (with gcc) is __asm__ __volatile__("":::)

inside a loop to prevent this from happening (since compilers cannot predict side effects from manual assembly and therefore cannot infer the loop).
Edit: I am usingCOMPILER_NOP

, but since the outputs from functions are never used, the compiler can still remove most of the work from the loop and reduce the loop to that:

.L7
   subl $1, %eax
   jne .L7

      

I fixed this by doing additional operations inside the loop. The loop now assigns a value from the output to the inputs, preventing this optimization and forcing the loop to encompass what was originally intended.



To (finally) come to an answer to your question, when I have done all the rest needed to run your code, and verified by checking the assembly that actually generates the loops, the two functions run in the same amount of time . They even have nearly identical implementations in assembly.

Here's the assembly for the member function:

movsd   32(%rsp), %xmm7
movl    $100000, %eax
movsd   24(%rsp), %xmm5
movsd   8(%rsp), %xmm6
movapd  %xmm7, %xmm12
movsd   (%rsp), %xmm4
movapd  %xmm7, %xmm11
movapd  %xmm5, %xmm10
movapd  %xmm5, %xmm9
mulsd   %xmm6, %xmm12
mulsd   %xmm4, %xmm11
mulsd   %xmm6, %xmm10
mulsd   %xmm4, %xmm9
movsd   40(%rsp), %xmm1
movsd   16(%rsp), %xmm0
jmp .L7
.p2align 4,,10
.p2align 3
.L12:
movapd  %xmm3, %xmm1
movapd  %xmm2, %xmm0
.L7:
movapd  %xmm0, %xmm8
movapd  %xmm1, %xmm3
movapd  %xmm1, %xmm2
mulsd   %xmm1, %xmm8
movapd  %xmm0, %xmm1
mulsd   %xmm6, %xmm3
mulsd   %xmm4, %xmm2
mulsd   %xmm7, %xmm1
mulsd   %xmm5, %xmm0
subl    $1, %eax
jne .L12

      

and the assembly for the static function:

movsd   32(%rsp), %xmm7
movl    $100000, %eax
movsd   24(%rsp), %xmm5
movsd   8(%rsp), %xmm6
movapd  %xmm7, %xmm12
movsd   (%rsp), %xmm4
movapd  %xmm7, %xmm11
movapd  %xmm5, %xmm10
movapd  %xmm5, %xmm9
mulsd   %xmm6, %xmm12
mulsd   %xmm4, %xmm11
mulsd   %xmm6, %xmm10
mulsd   %xmm4, %xmm9
movsd   40(%rsp), %xmm1
movsd   16(%rsp), %xmm0
jmp .L9
.p2align 4,,10
.p2align 3
.L13:
movapd  %xmm3, %xmm1
movapd  %xmm2, %xmm0
.L9:
movapd  %xmm0, %xmm8
movapd  %xmm1, %xmm3
movapd  %xmm1, %xmm2
mulsd   %xmm1, %xmm8
movapd  %xmm0, %xmm1
mulsd   %xmm6, %xmm3
mulsd   %xmm4, %xmm2
mulsd   %xmm7, %xmm1
mulsd   %xmm5, %xmm0
subl    $1, %eax
jne .L13

      

In conclusion: you probably need to tighten up the code a bit before you can determine if the implementations are different on your system. Make sure your loops are actually generated (look at the assembly) and see if the compiler has succeeded in inferring the return value from the member function.

If these things are true and you still see the differences, can you post the implementations here for SquareMatrix

and Vector

so we can provide you with more information?

The complete code, makefile, and generated build for my working example are available as the GitHub gist .

+2


source


Do explicit template function instances make a performance difference?

Some experiments I have done to find the difference in performance:

1.
First, I suspected that the performance difference might be caused by the implementation itself. In fact, we have two sets of implementation, one is implemented by us (this one is very similar to the codes on @black) and the other is implemented to be used as a wrapper Eigen::Matrix

that is controlled by the on-off macro, but the transition between the two does not make any changes. global is still slower than member one.

2.
Since these codes (class Vector<Scalar, Dim>

and SquareMatrix<Scalar, Dim>

) are implemented in a large project, I believe that the difference in performance may be affected by other codes (although I think this is not possible, but still worth a try). This way I extract all the required codes (implementation as per our use) and put them in my handcrafted VS2010 project. Surprisingly, I generally find that the global is slightly faster than the member element, which is the same result as @black @Myles Hathcock, although I leave the implementation of the codes unchanged.

3.
Because in our project they outerProduct

are placed in the release lib files, while in my manual project it directly creates .obj files and is a link to .exe files. To rule out this issue, I use the extracted codes and create a lib file through VS2010 and apply that lib file to another VS project to test the performance difference, but still global - slightly faster than member. So both codes have the same implementation and both are put in lib files, although they are created by Scons and others are created by VS project, but they have different performance. Is Scons causing this problem?

4.
For the codes shown in my question, a global function is outerProduct

declared and defined in a file .h

and then a #include

file .cpp

. Therefore, compiling this file .cpp

will create an instance outerProduct

. But if I change this to another way:
(I have to remind you that these codes are now compiled by Scons to a file lib

, not a manually generated VS2010 project)
First, I declare a global function outerProduct

in the file .h

:

\\outProduct.h
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m);

      

then in the file .cpp

,

\\outerProduct.cpp
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
    for (unsigned int i=0; i<Dim; i++)
        for (unsigned int j=0; j<Dim; j++)
        {
            m(i,j) = v1[i]*v2[j];
        }
}

      

Since this is a template function, it requires some explicit instances:

\\outerProduct.cpp
template void outerProduct<double, 3>(const Vector<double, 3> &, const Vector<double, 3> &, SquareMatrix<double, 3> &);
template void outerProduct<float, 3>(const Vector<float, 3> &, const Vector<float, 3> &, SquareMatrix<float, 3> &);

      

Finally, in the file .cpp

that calls this function:

\\use_outerProduct.cpp
#include "outerProduct.h" //note: outerProduct.cpp is not needful.
...
outerProduct(v1, v2, m)
...

      

The weird thing now is that the global is finally slightly faster than the element shown in the following figure:

enter image description here

But this only happens in the Scons environment. In a human VS2010 project, the global will always be slightly faster than the member. So this performance difference only comes from the Scons environment? and if the template function is explicitly instantiated, will it become normal?

Still weird! It seems like Scons would do something I didn't expect.

// --------------------------------------------- --- ------------------------
In addition, test codes have now been changed to the following rules to avoid cycle echelon:

    Vector<double, 3> vec1(0.0);
    Vector<double, 3> vec2(1.0);
    Timer timer;
    while(true)
    {
        timer.startTimer();
        for (unsigned int i=0; i<100000; i++)
        {
            vec1 = Vector<double, 3>(i);
            SquareMatrix<double,3> m = vec1.outerProduct(vec2);
        }
        timer.stopTimer();
        cout<<"time cost for member function: "<< timer.getElapsedTime()<<endl;

        timer.startTimer();
        SquareMatrix<double,3> m;
        for (unsigned int i=0; i<100000; i++)
        {
            vec1 = Vector<double, 3>(i);
            outerProduct(vec1, vec2, m);
        }
        timer.stopTimer();
        cout<<"time cost for global function: "<< timer.getElapsedTime()<<endl;
        system("pause");
    }

      

@black @Myles Hathcock, Many thanks to the heartfelt people!
@Myles Hathcock, your explanation is really subtle and abstruse, but I think I would benefit a lot from this. Finally, the entire implementation is at
https://github.com/FeiZhu/Physika
This is the physics engine that we are developing and from which you can find more information, including all source codes. Vector

and SquareMatrix

defined in the folder Physika_Src/Physika_Core

! But the global function is outerProduct

not loaded, you can add it appropriately somewhere.

0


source







All Articles