Unmanaged x86 and x64 managed interop performance

In my tests, I see the cost of executing unmanaged double interaction when compiling for x64 instead of x86. What is causing this slowdown?

I am testing releases that do not work under the debugger. The cycle is 100,000,000 iterations.

On x86, I average 8ns per interop call, which seems to be in line with what I've seen elsewhere. Unity x86 interop - 8.2ns. Microsoft and Hans Passant article mentions 7ns. 8ns is 28 clocks on my machine, which seems reasonable at least, although I really wonder if it's possible to go faster.

On x64, I average 17ns per interop call. I can't find anyone mentioning the difference between x86 and x64, or even mentioning what they refer to when quoting the time. Unity x64 accommodates about 5.9ns.

Regular function calls (including to an unmanaged C ++ DLL) cost an average of 1.3 ns. This does not change significantly between x86 and x64.

Below is my minimal C ++ / CLI code to measure this, although in my actual project I see the same numbers composed of a native C ++ project that is called on the managed side of the C ++ / CLI library.

#pragma managed
void
ManagedUpdate()
{
}


#pragma unmanaged
#include <wtypes.h>
#include <cstdint>
#include <cwchar>

struct ProfileSample
{
    static uint64_t frequency;
    uint64_t startTick;
    wchar_t* name;
    int count;

    ProfileSample(wchar_t* name_, int count_)
    {
        name = name_;
        count = count_;

        LARGE_INTEGER win32_startTick;
        QueryPerformanceCounter(&win32_startTick);
        startTick = win32_startTick.QuadPart;
    }

    ~ProfileSample()
    {
        LARGE_INTEGER win32_endTick;
        QueryPerformanceCounter(&win32_endTick);
        uint64_t endTick = win32_endTick.QuadPart;

        uint64_t deltaTicks = endTick - startTick;
        double nanoseconds = (double) deltaTicks / (double) frequency * 1000000000.0 / count;

        wchar_t buffer[128];
        swprintf(buffer, _countof(buffer), L"%s - %.4f ns\n", name, nanoseconds);
        OutputDebugStringW(buffer);

        if (!IsDebuggerPresent())
            MessageBoxW(nullptr, buffer, nullptr, 0);
    }
};

uint64_t ProfileSample::frequency = 0;

int CALLBACK
WinMain(HINSTANCE, HINSTANCE, PSTR, INT)
{
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);
    ProfileSample::frequency = frequency.QuadPart;

    //Warm stuff up
    for ( size_t i = 0; i < 100; i++ )
        ManagedUpdate();

    const int num = 100000000;
    {
        ProfileSample p(L"ManagedUpdate", num);

        for ( size_t i = 0; i < num; i++ )
            ManagedUpdate();
    }

    return 0;
}

      

1) Why x64 interop costs 17ns when x86 interop costs 8ns

2) Is 8ns the fastest that I can reasonably expect?

Edit 1

More info CPU i7-4770k @ 3.5 GHz
Test case is one C ++ / CLI project in VS2017.
Default Release Configuration
Full Optimization / O 2
I accidentally played around with settings like "Search Size or Speed", "Lower Boundaries", "Include C ++ Exceptions" and "Security Check" and none of them changed x86 / x64 divergence.

Edit 2

I went through a disassembly (not something I know very well at the moment).

In x86, it seems something like lines

call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
jmp     _IJWNOADThunkJumpTarget@0

      

In x64 I see

call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
        //Some jumping around that quickly leads to IJWNOADThunk::MakeCall:
call    IJWNOADThunk::FindThunkTarget
        //MakeCall uses the result from FindThunkTarget to jump into UMThunkStub:

      

FindThunkTarget is pretty heavy and it looks like they spend most of their time there. So my working theory is that in x86 the thunk target is known and execution can more or less jump straight to it. But in x64, the thunk target is unknown and the search process takes place to find it before moving on to it. I wonder why this is?

+3


source to share


1 answer


I have no recollection of ever giving a guarantee for such a code. 7 nanoseconds is the kind you would expect in C ++ Interop code, managed code that calls native code. It does the opposite, native code calling managed code, otherwise known as "reverse pinvoke".

You are definitely getting a slow taste of this kind of interaction. As far as I can see, "No AD" in IJWNOADThunk is a rather annoying little thing. This code did not get the love of micro-optimizations that are common in interop stubs. It is also very specific to C ++ / CLI code. Disgusting because it cannot accept anything in the AppDomain in which the managed code should be managed. In fact, he cannot even assume that the CLR is loaded and initialized.

Is 8ns the fastest that I can reasonably expect?

Yes. This measurement is at its lowest level. Your hardware is much better than mine, I test it on mobile Haswell. I am observing ~ 26 to 43 nanoseconds for x86, ~ 40 to 46 nanoseconds for x64. So you get x3 of better times, pretty impressive. To be honest, a little too impressive, but you see the same code as me, so we have to measure the same scenario.

Why does x64 interop cost 17ns when x86 interop costs 8ns?

This is not optimal code, the Microsoft programmer was very pessimistic about what corners he could cut. I have no real idea if this was justified, the comments in UMThunkStub.asm explain nothing about the choice.



There is nothing special about reverse pinvoke. It happens all the time, say, in a GUI that processes Windows messages. But this is done in a completely different way, such code uses a delegate. What a way to get ahead and get it done faster. Using Marshal :: GetFunctionPointerForDelegate () is the key. I've tried this approach:

using namespace System;
using namespace System::Runtime::InteropServices;


void* GetManagedUpdateFunctionPointer() {
    auto dlg = gcnew Action(&ManagedUpdate);
    auto tobereleased = GCHandle::Alloc(dlg);
    return Marshal::GetFunctionPointerForDelegate(dlg).ToPointer();
}

      

And it is used as in the WinMain () function:

typedef void(__stdcall * testfuncPtr)();
testfuncPtr fptr = (testfuncPtr)GetManagedUpdateFunctionPointer();
//Warm stuff up
for (size_t i = 0; i < 100; i++) fptr();

    //...
    for ( size_t i = 0; i < num; i++ ) fptr();

      

Which made the x86 version a little faster. And the x64 version is just as fast.

If you are going to use this approach, keep in mind that an instance method as a delegate target is faster than a static method in x64 code, there is less work to stub the calls to change the function arguments. And be careful, I took the shortcut in a variable tobereleased

, there is a memory management option here, and in a plugin script it may be preferable or necessary to call GCHandle :: Free ().

+6


source







All Articles