Memcpy where size is known at compile time

I find myself setting up a piece of code where memory is copied with memcpy

and the third parameter (size) is known at compile time.

The consumer of the calling function memcpy

does something similar to this:

template <size_t S>
void foo() {
    void* dstMemory = whateverA
    void* srcMemory = whateverB
    memcpy(dstMemory, srcMemory, S) 
}

      

Now I would expect the inline memcpy

to be smart enough to figure out what it is:

foo<4>()

      

... Can replace memcpy

in a function with a 32-bit integer assignment. However, I suddenly found myself seeing> 2x speedup by doing this:

template<size_t size>
inline void memcpy_fixed(void* dst, const void* src) {
    memcpy(dst, src, size);
}


template<>
inline void memcpy_fixed<4>(void* dst, const void* src) { *((uint32_t*)dst) =  *((uint32_t*)src); }

      

And rewriting foo

to:

 template <size_t S>
 void foo() {
    void* dstMemory = whateverA
    void* srcMemory = whateverB
    memcpy_fixed<S>(dstMemory, srcMemory) 
}

      

Both tests are on clang (OS X) with -O3. I would really expect it to be memcpy

internally smarter when the size is known at compile time.

My compiler flags:

-gline-tables-only -O3 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer

Am I asking for too much C ++ compiler or is there some compiler flag I am missing?

+3


source to share


2 answers


memcpy

does not match *((uint32_t*)dst) = *((uint32_t*)src)

.

memcpy can handle non-smooth memory.



By the way, most modern compilers replace memcpy of a known size with a suitable code emission. for small sizes, it usually highlights things like rep movsb

what may not be the fastest, good enough in most cases.

If you find your specific case, you get 2x speed and you think you need to speed it up, you might end up with a dirty hand (with clear comments).

+5


source


If both source and target buffers are provided as function parameters:

template <size_t S>
void foo(char* dst, const char* src) {
    memcpy(dst, src, S);
}

      



then clang ++ 3.5.0 memcpy

only uses when S

big, but it uses statement movl

when S = 4

.

However, your source and destination addresses are not parameters to this function, and this seems to prevent the compiler from doing this aggressive optimization.

+1


source







All Articles