GCC / Clang. If some -O flags are optimal on a particular machine, will they also be optimal on another machine?

I am looking for the optimal optimization flags for my specific code. Some time later, I discovered that there is no golden rule for choosing the best optimizations. The answer depends on the specific code, compiler and machine.

Recommended optimization flag -O2

, although in some cases -Os

(to generate a smaller binary) may generate a faster executing binary. I prefer to ignore the usage -O3

and excellent optimizations because it can be dangerous in some situations. In some cases, combining -O2

with flags -Os

gives better results. Or in other situations, compiling with -march=native

yields an optimized binary for a particular machine (and hence it can generate a binary with less execution time).

For my specific code (using valgrind --tool=calgrind

and perf stat

) I found that it -march=native

doesn't generate the least runtime.

Now, my question is:

if, for my specific code, I find that an optimal binary (I mean, a binary that produces faster runtime) is generated using -Os

and / or -O2

would that be optimal in other machines ?.

I would like to determine the optimal optimization flags on only one computer, but I need to run on different machines (someone with macOS, another with Linux, and all of them with different OS versions).

Thanks in advance for any suggestions or ideas.

+3


source to share


2 answers


TL: DR : No, different processors like different things. Auto-vectorization can be a win on one machine, but a loss on another if the compiler was able to do it inefficiently.

gcc -O2 -march=native

or gcc -O3 -march=native

- not bad options. Or better, those + time optimizations and profile optimizations, so the compiler knows which loops are hot, and which branches usually only go one path, and which branches are unpredictable.

IDK if you've ever tried gcc -march=native

without the option -O

, but it won't be helpful; -march=native

s -O0

will still be garbage by default .


-O3

worth a try. The clang man page says that sometimes there can be more code, so be careful with it and compare to make sure your code is actually faster. This is just a performance "risk", not correctness. The compiler will not bend language rules without other parameters to specifically enable unsafe optimizations.

Clang docs say it -O4

is currently equivalent to-O3

At least for gcc, auto-injection is only allowed when -O3

. The Clan probably has other good things that only happen in -O3

.

I'm not sure if the "general use" recommendation only remains -O2

, or if -O3

it is usually conservative enough to be used all the time. With -fprofile-generate

/, the -fprofile-use

compiler should avoid bloating the code size for frequently run paths and only unwrap loops that are really hot.

If it perf

shows any I-cache misses then it's worth a try -Os

. Perhaps -Os

for most of your source files and -O3

source file with your hottest feature.

clang -O3

does some loop unrolling but gcc -O3

doesn't allow loop unrolling without -fprofile-use

(or of -funroll-loops

course).

There's also -Ofast

one that allows potentially dangerous optimizations. If your code is still working correctly, go to it. (I think that unsafe

basically means it can overflow in different ways. For FP code, if you don't care how your code works, if there is NaN / Inf or about the exact order of operations, you can use -Ofast

(or just -O3 -ffast-math

).



My list of things to test will absolutely include if I want to spend some time looking for the best options to compile something that I was going to spend a lot of CPU time:

  • clang -O3 -march=native

  • clang -O3 -march=native -fprofile-use

    (after ... -fprofile-generate

    )
  • above with -flto -emit-llvm

    ( link time optimization , to optimize the whole program: nesting functions between source files (or at least viewing if they have side effects or not, even if not inlined).) gcc has -flto

    too.
  • gcc -O3 -march=native -fno-stack-protector -fprofile-use

  • above using -ffast-math

    or even-Ofast

I think it -Os

's worth trying too. Alignment habits can help, even if I-cache / uop-cache misses are not an issue.

If -fomit-frame-pointer

not the default in your compiler, use this also to free up the extra integer register and shorten the function prologue / epilogue.

If you want to use the same binary on all machines then -march=some_baseline -mtune=something

. (Assuming clang shares gcc arch / tune options.)

Or just -msse4.2 -mtune=sandybridge

or something like that. I think as long as you are building the x86-64 binary, I think only the new SSE commands are of interest to the compiler. (not popcnt, BMI, etc.)

An alternative is to check the source code in your home directory on each computer and build your program with a local compiler. But if you have a really new version of gcc or clang or the Intel compiler on the same machine then it makes sense to just use that.

You can also look at automatic parallelization:, gcc -ftree-parallelize-loops=n

where n

is the number of threads used.


The caveat to all of this is that I have heard of code breaking with help -O3

as it depended on behavior not required by language rules. So Aggressive Optimization has found a way to optimize so that the code no longer does the same. If you want your code to run fast, make sure you avoid undefined behavior so you can fully optimize the optimizer. (IIRC, there was a recent check question where the compiler optimized something based on the assumption of something because undefined behavior would have been different).

+4


source


You are actually pointing to the answer:

The recommended optimization flag is -O2, although in some cases -O

We just need to add that this variation is not only related to the source code (one codebase gives better results with -O, and the other gives better results with -O2), but on the computer itself that the code is running on.



Imagine different processors for the same instruction set (no need to recompile). It would be possible to have a small cache, then -O would produce a smaller executable, avoiding a lot of cache misses that can mess up performance with -O2. The second processor is like a huge cache, so it doesn't have as many cache misses when the code is compiled with -O2, and then allows the code to run faster in the case of -O2.

This is of course just a naively oversimplified example, I believe the combination of parameters in the real world is quite difficult. But it gives you a hint why it is very difficult to determine the optimal compilation in advance.

What some projects do: They compile with different target binaries and instruction set extensions, and then try to determine which actual binary will be executed when the application is loaded (looping over the runtime platform properties first to make an educated guess).

+1


source







All Articles