This article describes a fast and portable memcpy implementation that can replace the standard library version of memcpy when higher performance is needed. This implementation has been used successfully in several project where performance needed a boost, including the iPod Linux port, the xHarbour Compiler, the pymat python-Matlab interface, the Inspire IRCd client, and various PSP games.
The story began when a co-worker of mine made an implementation of memcpy that he was very proud of. His implementation was faster than many standardized C library routines found in the embedded market. When looking at his code, I found several places where improvements could be made. I made an implementation, which was quite a lot faster than my co-vorker's and this started a friendly competition to make the fastest portable C implementation of memcpy(). Both our implementations got better and better and looked more alike and finally we had an implementation that was very fast and that beats both the native library routines in Windows and Linux, especially when the memory to be copied is not aligned on a 32 bit boundary.
The following paragraphs contain descriptions to some of the techniques used in the final implementation.
The story began when a co-worker of mine made an implementation of memcpy that he was very proud of. His implementation was faster than many standardized C library routines found in the embedded market. When looking at his code, I found several places where improvements could be made. I made an implementation, which was quite a lot faster than my co-vorker's and this started a friendly competition to make the fastest portable C implementation of memcpy(). Both our implementations got better and better and looked more alike and finally we had an implementation that was very fast and that beats both the native library routines in Windows and Linux, especially when the memory to be copied is not aligned on a 32 bit boundary.
The following paragraphs contain descriptions to some of the techniques used in the final implementation.
2. Mimic the CPU's Architecture
One of the biggest advantages in the original implementation was that the C code was made to imitate the instruction sets on the target processor. My co-work discovered that different processors had different instructions for handling memory pointers. On a Motorola 68K processor the code
*dst8++ = *src8++;
that copies one byte from the address src8 to the address dst8 and increases both pointers, compiled into a single instruction:
MOV.B (A0)+, (A2)+
This piece of code can be put into a while loop and will copy memory from the address src to the address dest:
void* memcpy(void* dest, const void* src, size_t count) {
char* dst8 = (char*)dest;
char* src8 = (char*)src;
while (count--) {
*dst8++ = *src8++;
}
return dest;
}
While this is pretty good for the Motorola processor, it is not very efficient on a PowerPC that does not have any instructions for post incrementing pointers. The PowerPC uses four instructions for the same task that only required one instruction on the Motorola processor. However, the PowerPC has a set of instructions to load and store data that utilize pre increment of pointers which means that the following code only results in two instructions when compiled on the PowerPC:
*++dst8++ = *++src8;
In order to use this construction, the pointers have to be decreased before the loop begins and the final code becomes:
void* memcpy(void* dest, const void* src, size_t count) {
char* dst8 = (char*)dest;
char* src8 = (char*)src;
--src8;
--dst8;
while (count--) {
*++dst8 = *++src8;
}
return dest;
}
Unfortunately some processors have no instructions for either pre increment or post increment of pointers. This means that the pointer needs to be incremented manually. If the example above is compiled to an processor without pre and post increment instructions, the while loop would actually look something like:
while (count--) {dst8[0] = src8[0];
++dst8;
++src8;
}
There are luckily other ways of accessing memory. Some processors can read and write to memory at a fixed offset from a base pointer. This resulted in a third way of implementing the same task:
void *memcpy(void* dest, const void* src, size_t count) {
char* dst8 = (char*)dest;
char* src8 = (char*)src;
if (count & 1) {
dst8[0] = src8[0];
dst8 += 1;
src8 += 1;
}
count /= 2;
while (count--) {
dst8[0] = src8[0];
dst8[1] = src8[1];
dst8 += 2;
src8 += 2;
}
return dest;
}
Here the number of turns the loop has to be executed is half of what it was in the earlier examples and the pointers are only updated half as often.
3. Optimizing memory accesses
In most systems, the CPU clock runs at much higher frequency than the speed of the memory bus. My first improvement to my co-worker's original was to read 32 bits at the time from the memory. It is of course possible to read larger chunks of data on some targets with wider data bus and wider data registers. The goal with the C implementation of memcpy() was to get portable code mainly for embedded systems. On such systems it is often expensive to use data types like double and some systems doesn't have a FPU (Floating Point Unit).
By trying to read and write memory in 32 bit blocks as often as possible, the speed of the implementation is increased dramatically, especially when copying data that is not aligned on a 32-bit boundary.
It is however quite tricky to do this. The accesses to memory need to be aligned on 32-bit addresses. The implementation needs two temporary variables that implement a 64-bit sliding window where the source data is kept temporary while being copied into the destination. The example below shows how this can be done when the destination buffer is aligned on a 32-bit address and the source buffer is 8 bits off the alignment:
By trying to read and write memory in 32 bit blocks as often as possible, the speed of the implementation is increased dramatically, especially when copying data that is not aligned on a 32-bit boundary.
It is however quite tricky to do this. The accesses to memory need to be aligned on 32-bit addresses. The implementation needs two temporary variables that implement a 64-bit sliding window where the source data is kept temporary while being copied into the destination. The example below shows how this can be done when the destination buffer is aligned on a 32-bit address and the source buffer is 8 bits off the alignment:
srcWord = *src32++;
while (len--) {
dstWord = srcWord << 8;
srcWord = *src32++;
dstWord |= srcWord >> 24;
*dst32++ = dstWord;
}
4. Optimizing branches
Another improvement is to make it easier for the compiler to generate code that utilizes the processors compare instructions in an efficient way. This means creating loops that terminates when a compare value is zero. The loop
while (++i > count)
often generates more complicated and inefficient code than
while (count--)
Another thing that makes the code more efficient is when the CPU's native loop instructions can be used. A compiler often generates better code for the loop above than for the following loop expression
while (count -= 2)
5. Conclusion
The techniques described here makes the C implementation of memcpy() a lot faster and in many cases faster than commercial ones. The implementation can probably be improved even more, especially by using wider data types when available. If the target and the compiler supports 64-bit arithmetic operations such as the shift operator, these techniques can be used to implement a 64-bit version as well. I tried to find a compiler with this support for SPARC but I didn't find one. If 64-bit operations can be made in one instruction, the implementation will be faster than the native Solaris memcpy() which is probably written in assembly.
The version available for download in the end of the article, extends the algorithm to work on 64-bit architectures. For these architectures, a 64-bit data type is used instead of a 32-bit type and all the methods described in the article is applied to the bigger data type.
The version available for download in the end of the article, extends the algorithm to work on 64-bit architectures. For these architectures, a 64-bit data type is used instead of a 32-bit type and all the methods described in the article is applied to the bigger data type.
6. The complete source code
The complete source code for a memcpy() implementation using all the tricks described in the article can be downloaded from
http://www.vik.cc/daniel/memcpy.zip
The code is configured for an intel x86 target but it is easy to change configuration as desired.

4 comments:
I'm amazed that you haven't mentioned Duff's device anywhere. Am I missing something? http://en.wikipedia.org/wiki/Duff%27s_device
I actually used early on, but Duff's device only works well for pre and post increment. Most targets seem to work best with indexed copy, and then you can't use Duff's device like presented on the wiki. It is easy though to replace the while (length & 7) in COPY_SHIFT with a switch statement, but iirc, it didn't really give a performance boost. But such switch statement isn't a DUff's device since it doesn't have the while loop, but the idea is similar.
I tried this as-is in a c++ project compiled with Xcode for x86_64 on a Mac Mini with the following code:
int copy1[50];
int copy2[50];
for (int i = 0; i < 500000000; ++i)
{
memcpy(copy1, copy2, 50 * sizeof(int));
}
The standard memcpy took about 7 seconds, the new memcpy took about 14 seconds. Am I using it in the wrong environment, or with the wrong data set to take advantage of its speed?
I think you are using it correct. Not sure what optimization you turned on, but I don't think you'll beat the standard memcpy when copying aligned data on a standard processor. Nowadays, most standard library memcpy's are pretty good, especially on established processors.
This implementation is mainly beneficial if you are running a new embedded processor and sometimes on other systems (like initial versions of clib for PSP).
Occasionally this implementation works better than standard implementations on unaligned data, even on more established processors, but I wouldn't bother testing it on e.g. x86-64 unless speed is really a big problem.
Post a Comment