Code-Dependent CPU: How to Avoid Function Pointers?

I have performance critical code written for multiple processors. I am detecting the processor at runtime and based on that I use the appropriate function for the detected CPU. So now I have to use function pointers and call functions with these function pointers:

void do_something_neon(void);
void do_something_armv6(void);

void (*do_something)(void);

if(cpu == NEON) {
    do_something = do_something_neon;
}else{
    do_something = do_something_armv6;
}

//Use function pointer:
do_something(); 
...

      

Not that it matters, but I want to mention that I have optimized features for different processors: armv6 and armv7 with NEON support. The problem is that with function pointers, the code becomes slower in many places, and I would like to avoid this problem.

Basically, when loading time, the linker solves the relox and fixes the code with the function addresses. Is there a better way to control this behavior?

Personally, I would suggest two different ways to avoid function pointers: create two separate .so (or .dll) for processor dependent functions, put them in different folders, and based on the CPU detected, add one of those folders to the search path (or LD_LIB_PATH). The main load code and dynamic linker will retrieve the required DLL from the search path. Another way is to compile two separate copies of the library :) The disadvantage of the first method is that it forces me to have at least 3 shared objects (dlls): two for the processor dependent functions and one for the main code that uses them. I need 3 because I need to be able to do CPU detection before loading code that uses these CPU dependent functions. The good part of the first method isthat the application will not need to download multiple copies of the same code for multiple processors, it will only download the copy that will be used. The disadvantage of the second method is quite obvious, there is no need to talk about it.

I would like to know if there is a way to do this without using shared objects and manually load them at runtime. One way is hacking, which involves fixing the code at runtime, perhaps too hard to get it right.) Is there a better way to control movement while loading? Maybe it is worth mentioning the processor related features in different sections, and then somehow indicate which section has priority? I think the macho macho format has something similar.

ELF-only solution (for hand target) is enough for me, I don't care what PE (dll's) is.

thank

+3


source to share


4 answers


Here is the exact answer I was looking for.

 GCC __attribute__((ifunc("resolver")))

      



This requires fairly fresh binutils.
There's a good article describing this extension: Gnu support for cpu dispatching - kind ...

+1


source


You might want to look for an extension to the GNU dynamic linker STT_GNU_IFUNC

. From Drepper's blog when it was added:

Therefore, Ive developed an ELF extension that allows you to decide which implementation should be used once at a time. It is implemented using the new ELF character type (STT_GNU_IFUNC). Whenever a symbol lookup resolves a symbol of this type, the dynamic linker does not immediately return the found value. Instead, it interprets the value as a function, a pointer to a function that takes no argument and returns a pointer to the real function. The called code can be under the control of the executor and can choose, based on any information that the developer wants to use, which of two or more implementations to use.



Source: http://udrepper.livejournal.com/20948.html

However, as others have said, I think you are wrong about the performance impact of indirect calls. All code in shared libraries will be called using the (hidden) function pointer in the GOT and PLT record that loads / calls that function pointer.

+3


source


For maximum performance, you need to minimize the number of indirect calls (via pointers) per second and let the compiler optimize your code better (DLLs make this difficult because there must be a clear boundary between the DLL and the main executable, and there is no optimization at that boundary).

I would suggest doing the following:

  • moves as part of the main executable code that often calls DLL functions in DLLs. This will keep the number of indirect calls per second to a minimum and provide better optimization at compile time.
  • translates almost all of your code into separate processor-specific libraries and leaves in main () only the task of loading the proper DLL OR , which makes the processor-specific executables without DLLs.
+2


source


Lazy loading ELF symbols from shared libraries is described in section 1.5.5 Ulrich Drepper's DSO How To (updated 2011-12-10), For ARM it is described in section 3.1.3 ELF for ARM .

EDIT: with the STT_GNU_IFUNC extension specified by R. I forgot it was an extension. GNU Binutils has been supporting this for ARM, apparently since March 2011, according to the changelog .

If you want to call functions without PLT indirection, I suggest function pointers or shared libraries within which function calls do not go through PLT (be careful: the exported function is called through PLT).

I would not fix the code at runtime. I mean, you can. You can add a build step: after compilation, parse your binaries, find all call offsets to functions that have multivariate alternatives, build a patch location table, link this to your code. Basically, reassign the text segment being written, correct the offsets according to the table you prepared, put it back into read-only mode, clear the command cache, and continue. I'm pretty sure this will work. How much do you expect from this approach? I think loading different shared libraries at runtime is easier. And pointers to objects are even simpler.

0


source







All Articles