How to analyze the performance of cpp / assembly code

Question

How to analyze the performance of cpp / assembly code

I am trying to find out more about how to analyze the performance of my more commonly used methods.

I tried to use rand () and assigned a large number of calls to my methods as a method to measure performance, but I also want to learn more about how to measure performance by understanding what the assembly code is doing.

For example, I read about those trying to optimize the sgn function ( Is there a standard sign (signum, sgn) function in C / C ++? ), So I thought this would be a good place to start. I went to http://gcc.godbolt.org and generated an asm for the following code (ICC with -march=core-avx2 -fverbose-asm -Ofast -std=c++11

):

int sgn_v1(float val)
{
    return (float(0) < val) - (val < float(0));
}

and

int sgn_v2(float val)
{
  if (float(0) < val)      return  1;
  else if (val < float(0)) return -1;
  else                     return  0;
}

This created the following assembly

L__routine_start__Z6sgn_v1f_0:
sgn_v1(float):
        vxorps    %xmm2, %xmm2, %xmm2                           #3.38
        vcmpgtss  %xmm2, %xmm0, %xmm1                           #3.38
        vcmpgtss  %xmm0, %xmm2, %xmm3                           #3.38
        vmovd     %xmm1, %eax                                   #3.38
        vmovd     %xmm3, %edx                                   #3.38
        negl      %eax                                          #3.38
        negl      %edx                                          #3.38
        subl      %edx, %eax                                    #3.38
        ret                                                     #3.38

and

L__routine_start__Z6sgn_v2f_1:
sgn_v2(float):
        vxorps    %xmm1, %xmm1, %xmm1                           #8.3
        vcomiss   %xmm1, %xmm0                                  #8.18
        ja        ..B2.3        # Prob 28%                      #8.18
        vcmpgtss  %xmm0, %xmm1, %xmm0                           #
        vmovd     %xmm0, %eax                                   #
        ret                                                     #
..B2.3:                         # Preds ..B2.1
        movl      $1, %eax                                      #9.12
        ret                                                     #9.12

My analysis starts with sgn_v1

9 teams and sgn_v2

6 or 5 teams depending on the jump results. The previous post talks about what sgn_v1

is branching and seems like a good thing, I'm guessing this means multiple commands in sgn_v1

can run concurrently. I went to http://www.agner.org/optimize/instruction_tables.pdf and I couldn't fund most of these operations in the haswell section (p187-p202).

How can I analyze this?

Edit:

Responding to @Raxvan's comments, I ran the following test program

extern "C" int sgn_v1(float);
__asm__(
"sgn_v1:\n"
"  vxorps    %xmm2, %xmm2, %xmm2\n"
"  vcmpgtss  %xmm2, %xmm0, %xmm1\n"
"  vcmpgtss  %xmm0, %xmm2, %xmm3\n"
"  vmovd     %xmm1, %eax\n"
"  vmovd     %xmm3, %edx\n"
"  negl      %eax\n"
"  negl      %edx\n"
"  subl      %edx, %eax\n"
"  ret\n"
);

extern "C" int sgn_v2(float);
__asm__(
"sgn_v2:\n"
"  vxorps    %xmm1, %xmm1, %xmm1\n"
"  vcomiss   %xmm1, %xmm0\n"
"  ja        ..B2.3\n"
"  vcmpgtss  %xmm0, %xmm1, %xmm0\n"
"  vmovd     %xmm0, %eax\n"
"  ret\n"
"  ..B2.3:\n"
"  movl      $1, %eax\n"
"  ret\n"
);

#include <cstdlib>
#include <ctime>
#include <iostream>

int main()
{
  size_t N = 50000000;
  std::clock_t start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v1(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v1 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;

  start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v2(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v2 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;

  start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v2(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v2 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;

  start = std::clock();
  for (size_t i = 0; i < N; ++i)
  {
    sgn_v1(float(std::rand() % 3) - 1.0);
  }
  std::cout << "v1 Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms " << std::endl;
}

And I got the following result:

g++-4.8 -std=c++11 test.cpp && ./a.out
v1 Time: 423.81 ms
v2 Time: 657.226 ms
v2 Time: 666.233 ms
v1 Time: 436.545 ms

Thus, the ramified result is clearly better; @Jim suggested that I look into how industry predictors work, but I still can't seem to find a way to calculate how "full" the pipeline is ...

+3

c ++ performance assembly c ++ 11

Jon Sep 18 14 at 16:07

source to share

1 answer

computador7 · Accepted Answer · 2014-09-25T15:55:22+0000

In general, this is a rather noisy measurement, especially when you measure things sequentially in one pass / process, this means that one after the other there is an alternation of events that can add noise. Since you mention that branches have a big impact on the pipeline and generally thumb code with fewer branches should perform better, in general the two main things that play a role in performance are link locality and branch prediction, while while in more complex cases, such as when using multithreading, there are additional factors. To answer your question, I'd say it's better to use tools like perf, for example, which can indicate the number of cache misses and branch skip predictions that should give a good indication, in general,depending on the platform you are developing for you, you may be able to find a suitable tool that can query the CPU performance counters. Also you should really generate a set of random values and use the same functions with both functions so that you reject noise from doing std :: rand (). Finally, keep in mind that the code will work differently depending on different compilers, compilation options (obviously), and target architectures, however some logic you can apply should be kept independently since in your example the code is without conditional branches should pretty much always work better.If you want to get your head around this, you should really read the Intel manuals (specifically for avx).you can find a suitable tool that can query the CPU performance counters. Also you should really generate a set of random values and use the same functions with both functions so that you reject noise from executing std :: rand (). Finally, keep in mind that the code will work differently depending on different compilers, compilation options (obviously), and target architectures, however some logic you can apply should be kept independently since in your example the code is without conditional branches should pretty much always work better.If you want to figure it out, you should really read the Intel manuals (for avx in particular).you can find a suitable tool that can query the CPU performance counters. Also you should really generate a set of random values and use the same functions with both functions so that you reject noise from doing std :: rand (). Finally, keep in mind that the code will work differently depending on different compilers, compilation options (obviously), and target architectures, however some logic you can apply should be kept independently since in your example the code is without conditional branches should pretty much always work better.If you want to figure it out, you should really read the Intel manuals (for avx in particular).Also you should really generate a set of random values and use the same functions with both functions so that you reject noise from doing std :: rand (). Finally, keep in mind that the code will work differently depending on different compilers, compilation options (obviously), and target architectures, however some logic you can apply should be kept independently since in your example the code is without conditional branches should pretty much always work better.If you want to figure it out, you should really read the Intel manuals (specifically for avx).Also you should really generate a set of random values and use the same functions with both functions so that you reject noise from doing std :: rand (). Finally, keep in mind that the code will work differently depending on different compilers, compilation options (obviously), and target architectures, however some logic you can apply should be kept independently since in your example the code is without conditional branches should pretty much always work better.If you want to figure it out, you should really read the Intel manuals (for avx in particular).compilation options (obviously) and target architectures, however some logic you can apply should be preserved independently, since in your example, code without conditional branches should pretty much always perform better.If you want to understand this, you should really read Intel manuals (specifically for avx).compilation options (obviously) and target architectures, however some logic you can apply should be kept independently, since in your example, code without conditional branches should pretty much always perform better.If you want to understand this, you should really read Intel manuals (specifically for avx).

How to analyze the performance of cpp / assembly code

More articles: