Wednesday, February 10, 2010

Fast memcpy in c

1. Introduction

This article describes a fast and portable memcpy implementation that can replace the standard library version of memcpy when higher performance is needed. This implementation has been used successfully in several project where performance needed a boost, including the iPod Linux port, the xHarbour Compiler, the pymat python-Matlab interface, the Inspire IRCd client, and various PSP games.

The story began when a co-worker of mine made an implementation of memcpy that he was very proud of. His implementation was faster than many standardized C library routines found in the embedded market. When looking at his code, I found several places where improvements could be made. I made an implementation, which was quite a lot faster than my co-vorker's and this started a friendly competition to make the fastest portable C implementation of memcpy(). Both our implementations got better and better and looked more alike and finally we had an implementation that was very fast and that beats both the native library routines in Windows and Linux, especially when the memory to be copied is not aligned on a 32 bit boundary.


The following paragraphs contain descriptions to some of the techniques used in the final implementation.


2. Mimic the CPU's Architecture

One of the biggest advantages in the original implementation was that the C code was made to imitate the instruction sets on the target processor. My co-work discovered that different processors had different instructions for handling memory pointers. On a Motorola 68K processor the code

    *dst8++ = *src8++;

that copies one byte from the address src8 to the address dst8 and increases both pointers, compiled into a single instruction:

    MOV.B (A0)+, (A2)+

This piece of code can be put into a while loop and will copy memory from the address src to the address dest:

    void* memcpy(void* dest, const void* src, size_t count) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;

        while (count--) {
            *dst8++ = *src8++;
        }
        return dest;
    }

While this is pretty good for the Motorola processor, it is not very efficient on a PowerPC that does not have any instructions for post incrementing pointers. The PowerPC uses four instructions for the same task that only required one instruction on the Motorola processor. However, the PowerPC has a set of instructions to load and store data that utilize pre increment of pointers which means that the following code only results in two instructions when compiled on the PowerPC:

    *++dst8++ = *++src8;

In order to use this construction, the pointers have to be decreased before the loop begins and the final code becomes:

    void* memcpy(void* dest, const void* src, size_t count) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;
        --src8;
        --dst8;
        
        while (count--) {
            *++dst8 = *++src8;
        }
        return dest;
    }

Unfortunately some processors have no instructions for either pre increment or post increment of pointers. This means that the pointer needs to be incremented manually. If the example above is compiled to an processor without pre and post increment instructions, the while loop would actually look something like:

    while (count--) {
        dst8[0] = src8[0];
        ++dst8;
        ++src8;
    }


There are luckily other ways of accessing memory. Some processors can read and write to memory at a fixed offset from a base pointer. This resulted in a third way of implementing the same task:

    void *memcpy(void* dest, const void* src, size_t count) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;
        
        if (count & 1) {
            dst8[0] = src8[0];
            dst8 += 1;
            src8 += 1;
        }
        
        count /= 2;
        while (count--) {
            dst8[0] = src8[0];
            dst8[1] = src8[1];
        
            dst8 += 2;
            src8 += 2;
        }
        return dest;
    }

Here the number of turns the loop has to be executed is half of what it was in the earlier examples and the pointers are only updated half as often.


3. Optimizing memory accesses

In most systems, the CPU clock runs at much higher frequency than the speed of the memory bus. My first improvement to my co-worker's original was to read 32 bits at the time from the memory. It is of course possible to read larger chunks of data on some targets with wider data bus and wider data registers. The goal with the C implementation of memcpy() was to get portable code mainly for embedded systems. On such systems it is often expensive to use data types like double and some systems doesn't have a FPU (Floating Point Unit).

By trying to read and write memory in 32 bit blocks as often as possible, the speed of the implementation is increased dramatically, especially when copying data that is not aligned on a 32-bit boundary.

It is however quite tricky to do this. The accesses to memory need to be aligned on 32-bit addresses. The implementation needs two temporary variables that implement a 64-bit sliding window where the source data is kept temporary while being copied into the destination. The example below shows how this can be done when the destination buffer is aligned on a 32-bit address and the source buffer is 8 bits off the alignment:

    srcWord = *src32++;

    while (len--) {
        dstWord = srcWord << 8;
        srcWord = *src32++;
        dstWord |= srcWord >> 24;
        *dst32++ = dstWord;
    }


4. Optimizing branches

Another improvement is to make it easier for the compiler to generate code that utilizes the processors compare instructions in an efficient way. This means creating loops that terminates when a compare value is zero. The loop

    while (++i > count)

often generates more complicated and inefficient code than

    while (count--)

Another thing that makes the code more efficient is when the CPU's native loop instructions can be used. A compiler often generates better code for the loop above than for the following loop expression

    while (count -= 2)


5. Conclusion

The techniques described here makes the C implementation of memcpy() a lot faster and in many cases faster than commercial ones. The implementation can probably be improved even more, especially by using wider data types when available. If the target and the compiler supports 64-bit arithmetic operations such as the shift operator, these techniques can be used to implement a 64-bit version as well. I tried to find a compiler with this support for SPARC but I didn't find one. If 64-bit operations can be made in one instruction, the implementation will be faster than the native Solaris memcpy() which is probably written in assembly.

The version available for download in the end of the article, extends the algorithm to work on 64-bit architectures. For these architectures, a 64-bit data type is used instead of a 32-bit type and all the methods described in the article is applied to the bigger data type.


6. The complete source code

The complete source code for a memcpy() implementation using all the tricks described in the article can be downloaded from

http://www.vik.cc/daniel/memcpy.zip

The code is configured for an intel x86 target but it is easy to change configuration as desired.

19 comments:

  1. I'm amazed that you haven't mentioned Duff's device anywhere. Am I missing something? http://en.wikipedia.org/wiki/Duff%27s_device

    ReplyDelete
  2. I actually used early on, but Duff's device only works well for pre and post increment. Most targets seem to work best with indexed copy, and then you can't use Duff's device like presented on the wiki. It is easy though to replace the while (length & 7) in COPY_SHIFT with a switch statement, but iirc, it didn't really give a performance boost. But such switch statement isn't a DUff's device since it doesn't have the while loop, but the idea is similar.

    ReplyDelete
  3. I tried this as-is in a c++ project compiled with Xcode for x86_64 on a Mac Mini with the following code:

    int copy1[50];
    int copy2[50];
    for (int i = 0; i < 500000000; ++i)
    {
    memcpy(copy1, copy2, 50 * sizeof(int));
    }

    The standard memcpy took about 7 seconds, the new memcpy took about 14 seconds. Am I using it in the wrong environment, or with the wrong data set to take advantage of its speed?

    ReplyDelete
  4. I think you are using it correct. Not sure what optimization you turned on, but I don't think you'll beat the standard memcpy when copying aligned data on a standard processor. Nowadays, most standard library memcpy's are pretty good, especially on established processors.

    This implementation is mainly beneficial if you are running a new embedded processor and sometimes on other systems (like initial versions of clib for PSP).

    Occasionally this implementation works better than standard implementations on unaligned data, even on more established processors, but I wouldn't bother testing it on e.g. x86-64 unless speed is really a big problem.

    ReplyDelete
  5. Just wanted to let you know that I'm seeing compiler warnings about unreachable statements from the 'break;' statements after the COPY_XXX() macros because the macro is returning.

    ReplyDelete
  6. I've been testing your memcpy() on an embedded system, Arm7 under the iar compiler.

    I tried several configurations of the memcpy(), with or without indexes and pre incremented pointers and I'm only getting about 65% of the performance of the compiler built-in version. The performance of copying aligned to unaligned or unaligned to aligned buffers is much quicker with your memcpy() however, about 2x faster.

    Chris

    ReplyDelete
  7. Its probably more common with data being aligned and nowadays built-in memcpy implementations are usually faster for these cases. If you have a mix of unaligned and aligned copies and speed is important, you can always use the built-in memcpy for aligned copies and this one for unaligned. Otherwise I would stick with the built-in. Processors nowadays are so fast so time spent in memcpy isn't a big issue, but there are cases when it matters and then its good with options...

    ReplyDelete
  8. Sorry I don't understand . Isn't it true that "memcmp" and "memcpy" are implemented using only 1 x86 instruction? (Only one x86 machine language is executed to perform the whole copy in one shot)
    In that case, how these functions can be optimized by any algorithm using For Loops?

    ReplyDelete
  9. I am a beginner who has not quite understand with this. just greetings from tyang @ codeprogram.blogspot.com just a beginner programer's blog

    ReplyDelete
  10. its my understanding that things like the glibc and related libs are not using any worthwhile SIMD optimised instructions at this time for several routines, perhaps you should investigate and try some options.

    see:
    http://www.freevec.org/content/commentsconclusions

    "... Finally, with regard to glibc performance, even if we take into account that some common routines are optimised (like strlen(), memcpy(), memcmp() plus some more), most string functions are NOT optimised. Not only that, glibc only includes reference implementations that perform the operations one-byte-at-a-time! How's that for inefficient? We're not talking about dummy unused joke functions here like memfrob(), but really important string and memory functions that are used pretty much everywhere, like strcmp(), strncmp(), strncpy(), etc.
    In times where power consumption has become so much important, I would think that the first thing to do to save power is optimise the software, and what better place to start than the core parts of an operating system? I can't speak for the kernel -though I'm sure it's very optimised actually- but having looked at the glibc code extensively the past years, I can say that it's grossly unoptimised, so much it hurts."

    ReplyDelete
  11. Your code does not account for cases when the memory region pointed by the source overlap the memory region pointed by the destination.

    ReplyDelete
  12. Indeed. Typically memcmp doesn't. The standard function that do is memmove. Its probably easy to add a test to check whether the overlap cause problems with the algorithm. Since the algorithm uses 32/64 bit blocks, the special cases are dependent on alignment, but shouldn't be hard to add in order to get the characteristics of memmove

    ReplyDelete
  13. Technically, --src8; and --dst8; invoke undefined behavior when src and dest point to the beginning of their respective blocks. See 6.5.6.8 in the C99 standard. And indeed some segmented architectures will trap on these instructions.

    So it's not portable in the sense that it works when compiled by compliant compilers on these architectures.

    ReplyDelete
  14. Adding an integer to a pointer does I believe not cause undefined behavior if there are no overflows in the operation. The rule states that if Q=P+1, then Q-1==P, also if Q is outside the boundary of the array.
    However an indirection of Q (e.g. v = *Q) will cause memory access violation if the memory is not accessible.

    ReplyDelete
  15. No, it does not state that Q-1==P when Q is outside the boundary of the array. It states: "If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."

    There is a provision for t+n, but none for t-1 (t being an array of size n).

    I gave the number of the paragraph earlier. Look it up: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

    ReplyDelete
  16. I talked with my c standard guru friends and the behavior is indeed undefined, but most architectures and compilers does behave as the snipplet intends to work.

    From a portability point of view, its not a concern though. The full memcpy implementation contain three different implementations:

    1. Post increment
    2. Pre increment
    3. Indexed copy

    The undefined behavior only applies to option 2, and the others should be fine to use on architecture that doesn't have the more common implementation of the undefined behavior.

    Thanks for pointing it out.

    ReplyDelete
  17. There is a (mostly benign) bug in the code.

    #if TYPE_WIDTH >= 4

    The code in the #if block will always be included. That should be ">", not ">=".

    ReplyDelete
    Replies
    1. Thanks for finding the bug. I've updated the zip to reflect the change.

      Delete
  18. I tried simpler loops on x86_64 and Arm v7 with GCC.
    There were difference in times, but when i did -O9, all variations give same result - this is probably because loops are really silly

    https://gist.github.com/nmmmnu/9833691

    ReplyDelete