Optimizing compiler predicates

Consider the following examples of conditions / predicates:

  • x > 10 and x > 20

  • (x > 10 or x == 10) and (x < 10 or x == 10)

    aka x >= 10 and x <= 10

The predicate 1. can be simplified to x > 20

and 2. can be simplified to x == 10

. Will the compiler optimize such (or more complex) predicates, and if so, what algorithms are used to do this?

What are some common optimization techniques for predicates?

+3


source to share


2 answers


It depends on the compiler, but clang and gcc do this optimization:

#include <stdio.h>

void foo(int x) {
  if (x > 10 && x > 20)
    puts("foo");
}

void foo2(int x) {
  if ((x > 10 || x == 10) && (x < 10 || x == 10))
    puts("foo2");
}

      



You can see the assembly here - both functions contain one comparison.

For clang (which uses LLVM), it uses the command comb pass ('instcombine'). You can see the transformations in the source code of InstructionSimplify.cpp .

+2


source


Looking at the IL code that the C # compiler pulls out for the next method, at least the compiler doesn't seem smart enough in this case. Not sure, however, what happens when the IL code is converted to native code or even later to the processor pipeline - optimizations will continue:

private static bool Compare(int x)
{
   return (x > 10 || x == 10) && (x < 10 || x == 10);
}

      

Corresponding IL:

IL_0000: ldarg.0      // x
IL_0001: ldc.i4.s     10 // 0x0a
IL_0003: bgt.s        IL_000a
IL_0005: ldarg.0      // x
IL_0006: ldc.i4.s     10 // 0x0a
IL_0008: bne.un.s     IL_0017
IL_000a: ldarg.0      // x
IL_000b: ldc.i4.s     10 // 0x0a
IL_000d: blt.s        IL_0015
IL_000f: ldarg.0      // x
IL_0010: ldc.i4.s     10 // 0x0a
IL_0012: ceq          
IL_0014: ret          
IL_0015: ldc.i4.1     
IL_0016: ret          
IL_0017: ldc.i4.0     
IL_0018: ret

      

Here's the second (optimized) version:

private static bool Compare(int x)
{
   return x >= 10 && x <= 10;
}

      

Again, the corresponding IL code:

IL_0000: ldarg.0      // x
IL_0001: ldc.i4.s     10 // 0x0a
IL_0003: blt.s        IL_000e
IL_0005: ldarg.0      // x
IL_0006: ldc.i4.s     10 // 0x0a
IL_0008: cgt          
IL_000a: ldc.i4.0     
IL_000b: ceq          
IL_000d: ret          
IL_000e: ldc.i4.0     
IL_000f: ret          

      

Since the second version is clearly shorter, it has a better chance of being queued at runtime, so we should expect it to be faster.

Finally, third, call it "best" ( x == 10

):

private static bool Compare(int x)
{
    return x == 10;
}

      



And its IL:

IL_0000: ldarg.0      // x
IL_0001: ldc.i4.s     10 // 0x0a
IL_0003: ceq          
IL_0005: ret          

      

Nice and concise.

Running the benchmark using Benchmark.NET and [MethodImpl(MethodImplOptions.NoInlining)]

showing run-time behavior that is still significantly different for the two implementations:

Case 1: check candidates who are not 10 (negative case).

     Method |       Jit | Platform |     Mean 
----------- |---------- |--------- |----------
   TestBest | LegacyJit |      X64 | 2.329 ms
    TestOpt | LegacyJit |      X64 | 2.704 ms
 TestNonOpt | LegacyJit |      X64 | 3.324 ms
   TestBest | LegacyJit |      X86 | 1.956 ms
    TestOpt | LegacyJit |      X86 | 2.178 ms
 TestNonOpt | LegacyJit |      X86 | 2.796 ms
   TestBest |    RyuJit |      X64 | 2.480 ms
    TestOpt |    RyuJit |      X64 | 2.489 ms
 TestNonOpt |    RyuJit |      X64 | 3.101 ms
   TestBest |    RyuJit |      X86 | 1.865 ms
    TestOpt |    RyuJit |      X86 | 2.170 ms
 TestNonOpt |    RyuJit |      X86 | 2.853 ms

      

Case 2: test using 10 (positive case).

     Method |       Jit | Platform |     Mean
----------- |---------- |--------- |---------
   TestBest | LegacyJit |      X64 | 2.396 ms
    TestOpt | LegacyJit |      X64 | 2.780 ms
 TestNonOpt | LegacyJit |      X64 | 3.370 ms
   TestBest | LegacyJit |      X86 | 2.044 ms
    TestOpt | LegacyJit |      X86 | 2.199 ms
 TestNonOpt | LegacyJit |      X86 | 2.533 ms
   TestBest |    RyuJit |      X64 | 2.470 ms
    TestOpt |    RyuJit |      X64 | 2.532 ms
 TestNonOpt |    RyuJit |      X64 | 2.552 ms
   TestBest |    RyuJit |      X86 | 1.911 ms
    TestOpt |    RyuJit |      X86 | 2.210 ms
 TestNonOpt |    RyuJit |      X86 | 2.753 ms

      

It is interesting to see that in both cases the new JIT runs at about the same time for the optional and non-optimal X64 version.

The question is still: why doesn't the compiler optimize these templates? I guess this is because of things like operator overloading, which makes it impossible for the compiler to infer some correct inference, but II can be extremely disabled ... Also, for built-in value types, this should be possible. Oh good...

Finally, here's a good article on optimizing for booleans: https://hbfs.wordpress.com/2008/08/26/optimizing-boolean-expressions-for-speed/

+1


source







All Articles