Optimized code in VC ++ and ASM

Good evening. Sorry I used google tradutor. I am using NASM in VC ++ on x86 and I am learning how to use MASM on x64.

Is it possible to specify where each argument will go and return the build function in such a way that the compiler can leave the data there in the fastest way possible? We can also specify which registers will be used so the compiler knows what data is still being stored in order to make the best use of it?

For example, since there is no built-in function that applies exactly IDIV r / m64 (64-bit integer division in assembly language), we may need to implement it. IDIV requires the small value of the dividend / numerator part to be in RAX, high RDX and divisor / denominator in any register or memory area. At the end, the factor is in EAX, and the rest is in EDX. Therefore, we may want to develop functions, so (I gave examples of use):

void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
    __asm(
        // Specify used register: [rax], specify pre location: NumLow --> [rax]
        reg(rax)=NumLow ,
        // Specify used register: [rdx], specify  pre location: NumHigh --> [rdx]
        reg(rdx)=NumHigh ,
        // Specify required memory: memory64bits [den], specify pre location: Den --> [den]
        mem[64](den)=Den ,
        // Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
        reg(st0)=25*0.5 ,
        // Specify used register: [bh]
        reg(bh) ,
        // Specify required memory: memory64bits [nothing]
        mem[64](nothing) ,
        // Specify used register: [st1]
        reg(st1)
    ){
        // Specify code
        IDIV [den]
    }(
        // Specify pos location: [rax] --> *Quo
        *Quo=reg(rax) ,
        // Specify pos location: [rdx] --> *Rem
        *Rem=reg(rdx)
    ) ;
}

      

Is there anything that can be done at least close to this? Thanks for the help.

If there is no way to do this, it's a shame, because it is definitely a great way to implement high-level functionality with assembly-level functionality. I think it is a simple interface between C ++ and ASM that should already exist and include assembly code for inline inline and high-level, almost like plain C ++ code.

+3


source to share


2 answers


As others have mentioned , MSVC does not support any form of inline assembly when targeting x86-64.

Inline assembly is only supported on x86-32 assemblies, and even there it is rather limited in what you can do. In particular, you cannot specify inputs and outputs, so using inline assembly necessarily entails a lot of shuffling of values ​​between register and memory, which is exactly the opposite of what you want when writing high-performance code. If there is something that you cannot do otherwise than by manually emitting machine code, you should avoid inline assembly. Its original purpose was to do things like generate instructions OUT

and interrupt BIOS ROM ROMs in legacy 8-bit and 16-bit programming environments. It ended up in a 32 bit compiler for compatibility purposes, but the team drew a line with a 64 bit version.

Intrinsics is now the recommended solution because they work better with the optimizer. Pretty much any SIMD code the compiler needs to generate can be executed using inline functions, as with most other x86 compiler targeting, so not only you get better code, but you also get slightly more portable code.

Even with Gnu-style compilers that support extended asm blocks that give you the type of I / O power you are looking for, there are still many good reasons to avoid using inline asm . Intrinsics is still the best solution out there as it finds a way to represent what you want in C and convince the compiler to generate the assembly code you want to emit.

The only exception is when there are no built-in functions available. The instruction IDIV

, unfortunately, refers to one of these cases. (There are built-in capabilities for 128-bit multiplication. They come by different names: Windows - specific or compiler-specific .)

In Gnu compilers that support 128-bit integer types as an extension for 64-bit targets, you can force the compiler to generate the code for you:

__int128_t dividend = 1234;
int64_t    divisor  = 64;
int64_t    quotient = (dividend / divisor);

      

This is now usually compiled as a call to their library function that performs 128-bit division, rather than a built-in command IDIV

that returns a 64-bit ratio. Presumably this is due to the need to handle overflows as David mentioned . This is actually worse. No C or C ++ implementation can use the DIV

/ instructions IDIV

as they are not compliant. These instructions will result in overflow exceptions, whereas the standard states that the result should be truncated. (With multiplication, you get inline IMUL

/ commands MUL

because they don't have an overflow problem since they return 128-bit results.)

It actually doesn't hurt as much as you think. You seem to think that the 64-bit command IDIV

is very fast. This is not true. Although the actual numbers vary based on the number of significant bits in the absolute value of the dividend, your values ​​are probably quite large if you really want a 128-bit integer range. Looking at Agner Fog instruction tables, you will get some idea of ​​the performance that you can expect on different architectures. It gets faster and faster on newer architectures (especially newer AMD processors, it's still sluggish on Intel), but it still has quite significant latencies. Just because one instruction doesn’t mean that it works in one cycle or something like that. One command can be useful for code density when you are optimizing for size and are worried about calling a library function preempting other instructions from your cache, but splitting is a rather slow operation that usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it - if possible, they will multiply by their counterparts, which is much faster. And if you really need to do multiplications quickly,you should look into their parallelism with SIMD instructions that have access to internal capabilities.



Back to MSVC (although of course everything I said in the last paragraph still applies), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write code to an external assembly module and link his. The code is pretty straightforward, and Visual Studio has excellent built-in support for building code using MASM and linking it directly to your project:

; Windows 64-bit calling convention passes parameters as follows:
; RCX == first  64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8  == third  64-bit integer parameter (divisor)
; R9  == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
    mov  rax, rcx
    idiv r8          ; 128-bit divide (RDX:RAX / R8)
    mov  [r9], rdx   ; store remainder
    ret              ; return, with quotient in RDX:RAX
Div128x64 ENDP

      

Then you just prototype this in your C ++ code as:

extern int64_t Div128x64(int64_t  loDividend,
                         int64_t  hiDividend,
                         int64_t  divisor,
                         int64_t* pRemainder);

      

and you're done. Name it whatever you want.

The equivalent can be written for unsigned division using the instruction DIV

.

No, you don't get smart register allocation, but that's not a big problem with register renaming in the interface, which can often go all the way to register registers (in other words, MOV

latent operations). In addition, the instruction is IDIV

so limited in terms of its operands, since they are hardcoded to RAX

and RDX

that it is unlikely that the planner could maintain values ​​in these registers anyway, at least for any non-trivial piece of code.

Beware that after you write the necessary code to check for overflows, or worse, exception handling code, it will most likely end up the same or worse than a library function that does the correct 128-bit separation, so you should just write and use that (for as long as Microsoft sees fit to provide it). It could be written in C (see also Implementing a Library Function __divti3

for Gnu Compilers), making it a candidate for inlining and otherwise performing better with the optimizer.

+2


source


No, it cannot be done. MSVC does not support inline assembly for x64 assembly. You should use intrinsics instead; almost everything is available. The sad thing is that as far as I know, 128-bit is idiv

missing from the built-in functionality.



Note: you can solve your problem with two mov

(to put the inputs in the correct registers). And you don't have to worry about it; current processors handle mov

very much . Typing mov

in code may not slow it down at all. And div

very expensive compared to mov

, so it doesn't really matter.

0


source







All Articles