Memcpy where size is known at compile time
I find myself setting up a piece of code where memory is copied with memcpy
and the third parameter (size) is known at compile time.
The consumer of the calling function memcpy
does something similar to this:
template <size_t S>
void foo() {
void* dstMemory = whateverA
void* srcMemory = whateverB
memcpy(dstMemory, srcMemory, S)
}
Now I would expect the inline memcpy
to be smart enough to figure out what it is:
foo<4>()
... Can replace memcpy
in a function with a 32-bit integer assignment. However, I suddenly found myself seeing> 2x speedup by doing this:
template<size_t size>
inline void memcpy_fixed(void* dst, const void* src) {
memcpy(dst, src, size);
}
template<>
inline void memcpy_fixed<4>(void* dst, const void* src) { *((uint32_t*)dst) = *((uint32_t*)src); }
And rewriting foo
to:
template <size_t S>
void foo() {
void* dstMemory = whateverA
void* srcMemory = whateverB
memcpy_fixed<S>(dstMemory, srcMemory)
}
Both tests are on clang (OS X) with -O3. I would really expect it to be memcpy
internally smarter when the size is known at compile time.
My compiler flags:
-gline-tables-only -O3 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
Am I asking for too much C ++ compiler or is there some compiler flag I am missing?
memcpy
does not match *((uint32_t*)dst) = *((uint32_t*)src)
.
memcpy can handle non-smooth memory.
By the way, most modern compilers replace memcpy of a known size with a suitable code emission. for small sizes, it usually highlights things like rep movsb
what may not be the fastest, good enough in most cases.
If you find your specific case, you get 2x speed and you think you need to speed it up, you might end up with a dirty hand (with clear comments).
If both source and target buffers are provided as function parameters:
template <size_t S>
void foo(char* dst, const char* src) {
memcpy(dst, src, S);
}
then clang ++ 3.5.0 memcpy
only uses when S
big, but it uses statement movl
when S = 4
.
However, your source and destination addresses are not parameters to this function, and this seems to prevent the compiler from doing this aggressive optimization.