Nesting private and secured virtual function calls

Consider the following piece of C ++ code:

class IFoo {
 public:
  virtual void Bar() const = 0;
};

template <typename Derived>
class AbstractFoo : public IFoo {
 public:
  void Bar() const override {
    int i = 0;
    auto derived = static_cast<const Derived *>(this);
    while (derived->ShouldBar(i++)) {
      derived->DoBar();
    }
  }
};

class FooImpl : public AbstractFoo<FooImpl> {
 private:
  bool ShouldBar(int i) const {
    return i < 10;
  }

  void DoBar() const {
    std::cout << "Bar!" << std::endl;
  }

  friend class AbstractFoo<FooImpl>;
};

int main() {
  std::unique_ptr<IFoo> foo(new FooImpl());
  foo->Bar();
}

      

This is, of course, a curiously repeating template pattern with a slight twist: after a virtual method is Bar

polymorphically dispatched once through an interface IFoo

, the calls are ShouldBar

and DoBar

remain static and may even be inline. If this was done by other means, ( AbstractFoo

not be common and ShouldBar

and DoBar

private virtual methods), each iteration will have two virtual call.

The situations were such optimization issues, including iteration schemes such as depth-first search and saturation of huge state spaces. At some point in these algorithms, a particular implementation must make a choice in which direction to continue searching, whether to add the state to the result set, etc. Implemented polymorphically, this could potentially lead to millions of virtual calls to relatively small functions (some of them might even be empty!) That have performance even as measured by profiling. (Keep in mind that these iterative algorithms usually have no I / O, unlike the toy example above.)

In languages โ€‹โ€‹without CRTP, the only alternative solution is to duplicate the skeleton of the iteration scheme. For example, in C #, this is not too painful because we have partial methods:

interface IFoo {
  void Bar();
}

// This is copy-pasted for every IFoo implementation.
partial class FooImpl : IFoo {
  void Bar() {
    int i = 0;
    bool shouldBar = false;
    ShouldBar(i++, out shouldBar);
    while (shouldBar) {
      DoBar();
      ShouldBar(i++, out shouldBar);
    }
  }

  partial void ShouldBar(int i, out bool result);

  partial void DoBar();
}

partial class FooImpl {
  partial void ShouldBar(int i, our bool result) {
    result = i < 10;
  }

  partial void DoBar() {
    Console.WriteLine("Bar!");
  }
}

      

As you can see, there is still some awkwardness, because the partial methods must return void

, and the abstract class code must be duplicated.

Are there languages โ€‹โ€‹/ runtimes that can perform this optimization on simple virtual protected methods?

I think the problem boils down to the fact that virtual public methods should not contain machine code generated for each implementation, but for each specific class. Thinking of a simple vtable, a slot in a vtable FooImpl

should not contain AbstractFoo#Bar

in a slot IFoo#Bar

, but specialized FooImpl#Bar

with non-virtual / inline calls to ShouldBar

and DoBar

JIT generated.

Are there any frameworks that are capable of doing this optimization, or at least some research along these lines?

+3


source to share


1 answer


Don't JIT, use the processor branch predictor. Any decent processor will try to cache the target of every indirect branch instruction, so the cost of a correctly predicted indirect branch is the same as a conditional branch, usually zero.

Optimizing this template is no different from the normal optimization process. Your profiler should specify special indirect instructions as a bottleneck. Optimize by splitting each slow instruction into several more predictable ones like

if ( likely_to_be_FooImpl ) {
    foo->Bar();
} else {
    foo->Bar();
}

      



Preventing the compiler by eliminating the obviously redundant branch is left as an exercise;). Or, ideally, one branch doesn't need to be dispatched indirectly at all:

if ( certain_to_be_FooImpl ) {
    static_cast< FooImpl * >( foo )->fooImpl::Bar();
} else {
    foo->Bar();
}

      

In any case, it would be a high order for JIT to find correlations between local programmed state and branch target. The JIT may notice that the branch tends to go to some specific destination, but the CPU is already optimizing that case on the hardware. Conversely, as long as the number of branches does not exceed the predictor memory limit, falsely indirect branches will be predicted.

+1


source







All Articles