Wednesday, October 20, 2010

Calculating Reverse CRC

Background

    Cyclic Redundancy Check is a commonly used method for performing integrity checks on data. It is a simple hash function that provides assurance that data is not corrupted and is widely used in data communication and data storage applications.

    CRC is not suitable for protecting against intentional modification of data. It is a cryptographically weak algorithm and can easily be reversed. This article describes a simple algorithm to generate a four byte data sequence that can be added to a message to make the resulting CRC any value you want.


CRC algorithm

    The mathematics behind CRC is described in many articles on the internet and I will not go into any details about the theory. The algorithm to calculate a ‘reverse CRC’ described here is based on the 32-bit polynomial, CRC-32-IEEE, most commonly used by standards bodies, but can easily be adapted to other CRC types.

    The algorithm described in this article uses a table driven CRC implementation. A CRC lookup table contains 256 pre-calculated values, derived from the CRC polynomial. When the CRC of a message is calculated, each byte in the message modifies a 32 bit hash:

     crc = (crc >> 8) ^ crc32Table_[*ptr++ ^ (crc & 0xff)];

    One small thing to note is that the final CRC hash is the inverse of the calculated hash. A typical method to calculate CRC of a message or to append more data to a CRC would look something like:


  UInt32 append(const void* buffer, UInt32 size, UInt32 oldCRC)
  {
     const UInt8* ptr = (const UInt8*)buffer;
     UInt32 crc = ~oldCrc;

     while (size--) {
        crc = (crc >> 8) ^ crc32Table_[*ptr++ ^ (crc & 0xff)];
     }

     UInt32 newCrc = ~crc;
     return newCrc;
  }


Reversing CRC

    One of the properties of CRC is that you can find and append a four byte sequence to any message and get any desired resulting CRC hash. This means that a message can be patched with a four byte sequence in order to get any hash you want.

    Let us define our desired CRC as AABBCCDD where AA is the most significant byte and DD is the least significant byte. For each byte in the message, the CRC algorithm shifts the hash one byte to the right and xor it with a value from the lookup table. This means that in each step a new byte is added to the most significant byte in the hash, and in consecutive calculations the byte is shifted to less significant positions. This behavior can be used to quite easily create a sequence of bytes that will result in our desired CRC.

    The table below shows how the CRC algorithm transforms the hash into the desired CRC. The xx values are reminders of the CRC of the message preceding the patch bytes.

        CRC after 1st patch byte   = DD3 xx xx xx
        CRC after 2nd patch byte  = CC2 DD2 xx xx
        CRC after 3rd patch byte  = BB1 CC1 DD1 xx
        CRC after 4th patch byte   = AA BB CC DD


Getting more into concrete details of the algorithm, we can easily see that the byte AA is the high byte of a CRC table lookup value xored with 0. The CRC table is called T in the following equations.


      AA = ( T[ i3 ] ^ 0 ) >> 24


What we need to achieve is to find the appropriate index i3 that makes this equation true. This can be rewritten as a Boolean expression

      ( AABBCCDD ^ T[ i3 ]) >> 24 == 0, where 0 <= i3 < 256


    So the challenge is to find the index i3 that satisfies the expression and this shouldn’t be a hard challenge for anyone with basic skills in programming. Although it is not really needed to spell out the whole hash, AABBCCDD, in the expression above, it allows us to simplify successive calculations.

Moving forward to the second highest byte, we can unroll the CRC algorithm to get


      BB = ( ( ( BB1 ^ 0 ) >> 8 ) ^ T[ i3 ] ) >> 16


Where BB1 is calculated the same way as AA, i.e.


      BB1 = ( T[ i2 ] ^ 0 ) >> 24


Now we can create a Boolean expression for calculating the table index for the 3rd patch byte. By putting the equations together and rearrange the shifts a bit, we can see that the index of the table lookup for calculating BB1 needs to satisfy the Boolean expression


      (((AABBCCDD ^ T[ i3 ]) << 8) ^ T[ i2 ]) >> 24 == 0, where 0 <= i2 , i3 < 256


Note that the value BB is fetched from the full desired CRC value and shifted up to become most significant byte, just like AA was the most significant byte in the expression to calculate i3.

Similarly the table indexes for calculating the first and second byte and we four nice equations to solve:


          (AABBCCDD ^ T[ i3 ]) >> 24 == 0
        (((AABBCCDD ^ T[ i3 ]) << 8) ^ T[ i2 ]) >> 24 == 0
      (((((AABBCCDD ^ T[ i3 ]) << 8) ^ T[ i2 ]) << 8) ^ T[ i1 ]) >> 24 == 0
  (((((((AABBCCDD ^ T[ i3 ] ) << 8) ^ T[ i2 ]) << 8) ^ T[ i1 ]) << 8) ^ T[ i0 ]) >> 24 == 0


   So by finding the indexes of the table entries, it is possible to end up with the desired CRC. Solving these equations are easy, just do a reverse lookup on the CRC table. As you see, the first equation gets the table index for the last patch byte, and the value can be used to calculate the earlier table indexes.
    The code below shows the algorithm in practice. The calculated indexes are stored in the tableIdx array. Note that the desired CRC is inverted before starting the algorithm. The reason is that the CRC algorithm operates with inverted values as described in the beginning of the article.


  int tableIdx[4];

  UInt32 iterCrc = ~desiredCrc;
  for (int j = 3; j >= 0; j--) {
     for (int i = 0; i < 256; i++) {
        if (((iterCrc ^ crc32Table_[i]) >> 24) == 0) {
           tableIdx[j] = i;
           iterCrc = (iterCrc ^ crc32Table_[i]) << 8;
           break;
        }
     }
  }


    If performance is a concernt, the inner for loop can easily be converted into a table lookup. The code would then look like


  int tableIdx[4];

  UInt32 iterCrc = ~desiredCrc;
  for (int j = 3; j >= 0; j--) {
     tableIdx[j] = crc32RevTable_[iterCrc];
     iterCrc = (iterCrc ^ crc32Table_[i]) << 8;
  }


    So far we haven’t looked at either the CRC of the message we are patching, nor the actual values of the patch bytes. None of these values were needed to find the table indexes. The next step in the algorithm is to find the values that generate the indexes needed to produce the final desired CRC. This is done by starting the normal CRC algorithm from the end of the original message and appending the four patch bytes. In each step, we need to calculate the actual patch byte before calculating the CRC. Since we know what table index to expect, we simply need to make sure that

  crc32Table_[tableIdx] == crc32Table_[patch ^ (crc & 0xff)]

for each patch byte. So with msgCrc being the CRC hash of the message, we get the following code to finally find the four patch bytes:


  UInt32 crc = ~msgCrc;
  for (int j = 0; j < 4; j++) {
     patch[j] = (crc ^ tableIdx[j]) & 0xff;
     crc = (crc >> 8) ^ crc32Table_[patch[j] ^ (crc & 0xff)];
  }


Usage

    Reverse CRC may be useful when a message needs to be modified but the CRC of the message cannot be changed. The message can then be patched as desired, and a four byte patch block will be inserted to correct changes in CRC.

    Other uses may include cases where you want to CRC any message but for some reason have constraints on what the generated CRC can be. I’ve seen cases where this actually was needed and the best solution was to add a patch to the message CRC to force the final CRC to be within a certain range.


Source Code

    The complete source code of a CRC implementation that includes a method for finding a four byte patch for the reverse CRC, as well as some test cases can be found here.


Monday, August 30, 2010

Overcoming hardware limitations – iPhone accelerometer

Background

    It is not too uncommon to find that the hardware you are given for a project does not meet your expectations and limits the design possibilities. Still you are required finish the job and produce a product that is acceptable for the customers. The limitations can come from insufficient processing power, memory constraints, lack of GPIO, bandwidth limitations, inaccurate sensors, and many other factors.

    This article talks about a quite specific issue that I encountered when developing my very first iPhone application. I’ve been using accelerometers at work for various purposes, and am quite familiar with their strengths and limitations so I figured I would do something fun with the accelerometer in the iPhone. I didn’t want to spend a whole lot of time developing the application, but I wanted to create something that was pleasant and fun for the users. So I did a little bit of research and found that no one had done an application that measures the speed of a throw, in particular that of a baseball pitcher. This turned out to be untrue, but I understand now why there aren’t any best selling iPhone apps measuring throwing motions.


Measuring speed…    …or not

    At first it sounds like a pretty simple deal, calculating the velocity of a fictitious ball thrown by the user. Just measure the acceleration of the iPhone device and integrate, then show the maximum speed to the user. Of course there is the problem that accelerometer readings include the gravitational force, but I thought that could easily be filtered out using a high pass filter, or better yet, combine the accelerometer readings with readings from the gyro available in the iPhone 4.

    What I forgot to think about, is that the accelerometer only measures acceleration forces up to three times the gravitational force. A throwing motion generates forces up to 10 times as much, so any acceleration readings would cap at 3G, and wouldn’t be of much use for integrating to get speed.

    Another thing I didn’t know about the iPhone, is that it is not easy for an application to do high resolution samples from the accelerometer, or any other sensor for that matter. It turns out that although the API’s support high resolution timers, the operating system doesn’t give your application unlimited access to the Processor. The operating system actually locks out the application for quite significant amounts of time, at least in the context of high frequency sampling. So although it theoretically would be easy to just poll the timer, and read accelerometer samples say every millisecond, it turned out to not work that well.

    So after a couple of depressing findings I was thinking if this was the end of my simple idea application. Since I had done all the graphics and coded the controls, I didn’t want to let that all go to waste. So I came up with something that could be done to at least approximately measure the speed.

    After doing some tests and plotting the graphs of different accelerometer readings, I found that it was fairly easy to detect the duration of the throwing motion. The acceleration during the throwing motion was always greater than 3G so by checking the start and end time of when the accelerometer readings cap at 3G, I got a pretty good estimate of the duration of the throw.

    Then I estimated the length of the throwing motion to be around 0.5 meter. Then it is straight forward to estimate the acceleration using the equation s = v0t + at2/2. Some compensation for the actual twisting motion of the wrist had to be added, as well as for the fact that the acceleration is not linear. Calculating the speed using the approximated acceleration is then trivial.


Spin Approximation

    Another thing I was hoping to include in the motion calculation was the amount of spin the user adds to the ball. The iPhone is obviously not a baseball, so techniques to add spin could not entirely mimic a real world scenario. But at least I was hoping to simulate it by calculating the rotational speed and then estimate the spin of the ball.

    This turned off course also out to be a hopeless task with the limitations of the accelerometer. So to work around the issues I set up a simple model, where the program checks the acceleration vector when the device is at rest just prior to the throwing motion as well as just after. By calculating the angle between the vectors, it is possible to make an approximation of how the hand was turned during the throw. This is obviously even less accurate than the speed approximation, but testing showed that it was more accurate than I first thought.

    Based on the change in angle, it is possible to estimate if the user put e.g. a top spin, clockwise or counter-clockwise spin in the throw. This can roughly be correlated to the different types of pitches in baseball, and even left and right handed pitching can be simulated.





Conclusion

    Although the result is not as great as I was hoping for before I started the project, it turned out to not be too bad either. The biggest issue is that the approximations require the user to follow the models when performing the throwing motion. And if the user don't move the iPhone according to the model, the results may be quite arbitrary. The problem getting good high frequency samples are also affecting the result, and sometimes the output is not accurately reflecting the throw.

    Regardless of whether you like the end result or not, It was a pretty exciting challenge to get something reasonably accurate considering the pretty tough constraints of the iPhone accelerometer.

    You can download the application at the Apple App Store, or visit the Baseball iPitcher homepage

Saturday, May 1, 2010

C Language Quirks

Background

    During the years, I’ve had the privilege to work with prototypes of processors still under development that sometimes existed only in a handful of copies in the world.  My task has been to port operating systems to these new processors. Similar the processors, the accompanying tools were also early beta versions. As with most beta software, the compilers I used were somewhat buggy and they required a quite deep knowledge about the C language in order to debug issues introduced by incorrect compilations.

    Learning the ins and outs of the C language was something I really enjoyed and at the time I knew more or less the entire C language standard with all its weird behavior that are pretty unknown to most regular developers. In this article I’ll try to describe some of the more interesting things I learned over the years.


Shift Operator Oddities

    Many developers interested in the details of the C language know that shift operators don’t always have an intuitive behavior. For example, shifting down a negative number may on some targets be a division while it isn’t a division on others. This is called undefined behavior and the result of the operation depends on whether the CPU fills the high bits with the sign bit or not. Basically it is strongly recommended not to use shift operators on signed operands (unless you are absolutely sure you are only working with positive numbers, which you probably can’t be).

    My favorite quirks on shift operators are on unsigned operands, for which shift operations generally work fine. Consider the following example and try to guess the output:

    unsigned char c = 2;
    unsigned int  i = 2;

    printf("%d\n", c >> 9);
    printf("%d\n", i >> 33);
    printf("%d\n", c << 9);
    printf("%d\n", i << 33);

    The example is performing left and right shift on a byte and a word, and the amount shifted is bigger than the width of the type, in this case 9 for the byte and 33 for the word. The idea is to perform the same operation on two different types and highlight some artifacts. The example actually shows a couple of different quirks of the C language, and the result also dependent on the processor it is executing on.

    First off, the result of the shift operation is undefined if the right hand operand is greater or equal to the width of the left hand operand. This means that if the amount shifted is larger than 32 on a 32 bit type, the result depends on the compiler and the processor. Power PC and ARM processors typically yield the result 0 for both left and right shift of more than 32 bits, which is what most humans would think is logical. But most Intel processors since 80286 use only the low five bits in the right hand operand, so the expressions above are in reality shifting the integers only one bit.

    The second issue is that shift operations always promote the left hand operand to (16 or 32 bit) integer.  Although it looks like the bytes in the example are shifted 9 bits which is more than a byte, it is not affected the undefined behavior described above. The reason is that the byte is promoted (i.e. converted) to an integer before the shift operation is executed, and 9 bits is of course a perfectly valid amount to shift an integer value without any side effects.

    The result of the small program is shown in the table below. As we are dealing with undefined behavior, I captured the results of the most common architectures to highlight the difference:

                   Intel    PPC & ARM 
      c >> 9         0          0
      i >> 33        1          0
      c << 9         1024       1024
      i << 33        4          0

More Shift Operator Confusion

    Shift operators also have another unexpected behavior. The shift operators work logically much like multiplications   or divisions and most programmers would probably expect the same precedence as the multiplicative operators.  That is, the expectation may be that shift operations are executed before additions. This is however a big mistake. The shift operators have in fact very low precedence, and most arithmetic operators including additive operators have higher precedence. This can easily lead to bugs for example:

    int i = a << 2 + 1;  /* Calculate a * 4 + 1 */

But because of the weaker precedence of the shift operator, the expression above is actually equivalent to

    int i = a << (2 + 1);

which is far from what the programmer intended. 

Bitwise Operator Mistakes

    Many developers that are programming embedded systems and applications that are close to hardware are well aware of the powers of bitwise operators and use them frequently. There is really nothing strange with these operators, but sometimes there are bugs caused by not understanding their behavior or just too speedy programming. The example below shows one case where the behavior isn’t what the programmer intended, and it is quite well hidden:

/*****************************************************
** Outputs state of individual bits in input value
*****************************************************/
void printBits(unsigned char val)
{
    if (val & 1 != 0) printf("Bit 0 is set\n");
    else              printf("Bit 0 is cleared\n");
    
    if (val & 2 != 0) printf("Bit 1 is set\n");
    else              printf("Bit 1 is cleared\n");
    
    if (val & 4 != 0) printf("Bit 2 is set\n");
    else              printf("Bit 2 is cleared\n");
}

    The function is nicely written, well commented, and the logic is straight forward. The developer designed it to test one bit at the time by masking other bits, and if the result is non-zero, the bit is set. So for example invoking the function like

    printBits(0x05);

should ideally show that bit 0 and 2 is set and bit 1 is cleared. Somewhat surprising though is that the output is

    Bit 0 is set
    Bit 1 is set
    Bit 2 is set

    The reason is however quite simple. Most mistakes with bitwise operators are caused by not understanding the precedence rules. In this case the inequality operator has higher precedence than the bitwise AND operator. What this means is that the expressions in fact look like:

    if (val & (1 != 0))

    A logical expression X != Y returns 0 if X and Y are equal and non zero if they are different. Most compilers always generate 1 as result if the test is true, which means that all if statements in the function turns out to be the same:

    if (val & 1)

and basically test if bit 0 is set or cleared regardless of the numeric value, which wasn’t at all what the developer had in mind when writing the function. 

Bitwise Not Operator

    The bitwise not is a unary operator that calculates the bitwise complement of the operand and is a handy operator when performing bitwise arithmetic. Below is an actual real life example on when the behavior may not be exactly as one would expect. The code sets the bitwise complement of two bytes into an integer:

    unsigned char a = 0x44;
    unsigned char b = 0x88;

    unsigned short val = (((~a) << 8) | (~b));

    This looks like a safe operation, the code carefully left shifts one of the values to avoid overlap, and then OR the two bitwise complements into the integer. The expression is well parenthesized to ensure there are no precedence issues. The expected value is naturally 0xBB77, with the high 8 bits being the bitwise complement of variable a, and the low 8 bits being the bitwise complement of variable b.

    There is a small catch though; the developer writing this program overlooked that the bitwise complement operator performs integer promotion of the operand. In the example above, this means that the bitwise complement of 0x88 is not 0x77 as the developer assumed. The integer promotion is done prior to the operator being executed, and the result is actually 0xFFFFFF77. When OR-ing with the bitwise complement of the other byte, and after truncation to unsigned short, the resulting value turns out to be 0xFF77.

Bizarre Promotions

    Some of the examples so far have shown unexpected consequences of automatic integer promotion. Sometimes these cause bugs that are quite hard to find. The following example shows a perfectly valid and commonly used test to see if an addition of two positive integers overflows.

    void testOverflow(int a, int b)
    {
        if (a + b < a) printf("overflow\n");
        else           printf("no overflow\n");
    }

This function could for example be called with two large numbers, and the function would print “overflow”, e.g.

    testOverflow(INT_MAX, INT_MAX);

    It is easy to think that this piece of code can be reused with different data types. Assume that someone needs to do the exact same test, but he is working with signed bytes instead. It may seem to be straightforward to just take the code and change the types:

    void testOverflow2(char a, char b)
    {
        if (a + b < a) printf("overflow\n");
        else           printf("no overflow\n");
    }

But when invoking it with the signed byte equivalent of INT_MAX:

    testOverflow2(CHAR_MAX, CHAR_MAX);

The output surprisingly enough is “no overflow”.

    Considering the only change that was made was to change the type, the different output may be a bit confusing. But there is a very good reason for the behavior, and you guessed, it is spelled integer promotion, and the variables a and b are promoted to integer before the addition is executed, and thus the result is always greater  or equal to the right hand operand in the comparison.

More Bizarre Promotions

    Another example when the promotion rules play some tricks is shown below. Consider the code:

    signed char a = 127;

    printf("%d\n", (a + 1));
    printf("%d\n", ++a);

    In this example it looks like both print statements would show the same output. Since the variable a is a signed character with the value 127, the natural expectation is that when adding one, the value would overflow and wrap around, and become -128.

    This is however only true for one of the statements. The ++ operator adds one to the operand and the result is stored in the operand. Since the variable is a char, the addition will overflow and give the result -128. The printf function takes an integer as argument, and the char value will be promoted to integer. Promoting a negative char value to integer, keeps the sign and value, and thus the output of the printf function will be -128.

    For the first statement the integer promotion works differently. The variable a is promoted to integer prior to executing the addition, which results in an integer addition 127 + 1, which equals 128. The result is already an integer so no further promotion is needed for the printf call, and the output is 128.

Even More Bizarre Promotions

    Integer promotion can also be a problem in conditions. The following example illustrates this:

    signed char   cs = -1;
    unsigned char cu = cs;

    signed int   is = -1;
    unsigned int iu = is;

    if (is == iu)     printf("equal\n");
    else              printf("not equal\n");

    if (cs == cu)     printf("equal\n");
    else              printf("not equal\n");

    Again, we are comparing the behavior of the same operations made on an integer and on a byte type and as you may expect we get different result and the cause is not surprisingly integer promotion.

    The first if statement compares a signed and an unsigned version of the same integer value. Both operands are promoted to the biggest type larger or equal to int. In this case unsigned int is consider biggest and both operands are thus converted to unsigned int. The conversion is trivial, and since we assigned the variables to the same value the comparison will show the values are equal.

    More interesting is the second comparison. The same rules apply, both operands are promoted to the biggest type larger or equal to int. In this case both types are smaller than int, so both operands are promoted to int before the comparison is executed. Here comes the interesting part. The variable cs is a signed char with the value -1. 

    Promoting it to integer preserves the sign and the resulting iteger is thus -1. The variable iu on the other hand has the value 255 (same binary representation) but promoting unsigned characters to int also preserves the sign, and the resulting integer will be 255. It is clear then that the test will show that the two values are not equal.

Lexical Ambiguities

    The C language has many lexical and syntactical issues including case statements without breaks, if – else statements without brackets, and other weak constructions that allow for mistakes. There are however a couple of cases, where the language actually is ambiguous and you can write a statement that doesn’t do what you expect. One of my favorites is shown the following example:

    #define inverse(a)  1.0/a

    double d[2] = { 2.0, 2.0 };

    printf("%.1f\n", inverse(*d)   /* print inverse */ );
    printf("%.1f\n", inverse(d[1]) /* print inverse */ );

    The program is very straight forward with a a method, inverse, that is implemented as a macro and does just what it suppose to do, calculating the inverse of the input argument. The output is not as straight forward though. You would expect the two print statements to output the same value, 0.5, since the program is passing the value 2.0 to inverse in both cases. At least that what it looks like and the output is correct for the second statement. But the first statement outputs 1.0 (!). So did we exploit some weird floating point bug here or what is going on?

    Well, the answer may get clearer if we perform macro expansion on the first statement:

    printf("%.1f\n", 1.0/*d   /* print inverse */ );

    Maybe not entirely clear yet, we all know that the indirection has higher precedence than the division, so the program should fetch the first value of the array which is 2.0.

    The problem is however not at all related to any pointer indirection issues or floating point bugs. What happens here is that the division followed by the indirection is parsed as a multi-character token in the C language, in this case the start of a comment. As you know a comment ends with the */ token, and everything in between is treated as a comment and not code. So the division and the dividend are interpreted as part of a comment and are ignored.

Fun with unary operators

    Although the following example is really not that weird and it’s a perfectly valid and well specified C expression, I find it somewhat amusing. 

    int i = 2 - - - - - - 2;

    printf("i = %d\n", i);

    Most programmers know that unary operators have higher precedence than additive operators which basically means that e.g. 2 + (-1) equals 1. Usually the unary operator is placed immediately before the operand, but this is not required by the language. Furthermore, the parentheses aren’t needed either, because the unary operator has higher precedence than the additive operator. A more explicit version of the same expression would look like

    int i = 2 – (-(-(-(-(-2)))));

and it is more clear that it is a subtraction of two values where the latter has a series of negations. The output of the small program would thus be 4.

Conclusion

    I tried to show a couple of interesting cases where the C language may not be as clear and straight forward as it perhaps should be. These issues have resulted in many bugs and I think a stricter definition had helped a lot, but I’m sure there are good reasons for all the undefined and unintuitive behavior in the C language. The solution is rather simple; spend a lot of time to learn the specification. This may sound harsh, but I’m really happy that I was forced to learn it the hard way. I’ve had plenty of fun with the knowledge over the years and I know I share it with others that also had the privilege to dig deep into the language.

Sunday, April 25, 2010

User Interfaces and Toddlers

    It is amazing to see how quick young children adopt computers, either for playing simple games with their parents, watch movies on YouTube or Skype with grandparents. I have a daughter that is three now and I first got her interested in computers when she was around two months old. At that time I was working on a computer game for an old console and I think what attracted her was the colorful big objects in the game and the square wave based music that sounded a bit like cheap toys.



    Later I bought an educational game collection with Sesame Street and my daughter got hooked to a basic Elmo game where he shows new photos every time she hits a key on the keyboard. I don’t think she was more than nine months when she first started playing the game and the game doesn’t really require much motor skills or intellectual skills, just banging the keyboard and a funny picture appears. At least it was a first successful attempt to interact with the computer.

    The game collection also contains games to learn how to use the mouse. Nowadays my daughter has certainly figured out that there is some connection between moving the mouse and getting things to happen on the computer screen. But she hasn’t really successfully moved the mouse pointer to an object and clicked the buttons to get something to happen. She’ll of course click the button since she’s seen me do it and occasionally something happen, but she gives up pretty quick.

    A while back we went on a road trip and I brought my Nintendo DS to have another activity along the usual books, crayons, and toys. My daughter was just over two at the time. I brought two games, Mario Kart DS and a pony game for six to ten year old kids. It was quite interesting to watch how my daughter approached these games and what parts she was able to understand and play.




    Mario Kart DS is a rally game with a pretty intuitive interface, at least for older people, with a joypad for steering and buttons for acceleration and break. It didn’t take long until my daughter was able to get the car moving back and forward and occasionally steering the car in one direction round and round. She didn’t really master the steering although she understood that the joypad was an important part of keeping the car on the road.

    More interesting was to see her interact with the pony game. The game is like a Tamagotchi where the player needs to feed and pet a pony. The game is fully controlled by the stylus and the player uses drop down menus to open popup windows where she can pick different types of food to give the pony or to grab tools to clean and pet the pony. I was expecting that both using a stylus and accessing menus would be too hard for her so I started driving the game myself and just have her watch. But soon she wanted to play herself, and I was quite surprised that operating the menus using the stylus was no problem at all.

    I imagined that the idea of menus was too abstract, especially after seeing her not figuring out how to use a mouse at all. It seems like the biggest challenge with the mouse may not have been to understand what to do, but rather doing it.

    I guess it is hard to make any generic conclusions from observing one kid. But it was really interesting to see that navigating menus was no problem when the input device was something simple to use, like a stylus. I could imagine that a touchscreen would give the same easy input. It is also interesting to see her struggle so much with the mouse which is something most people take for granted. I’m looking forward to see other aspects of her computer interactions as she gets older although I probably won’t be around to see her playing Nintendo wii in the retirement home...

Saturday, April 24, 2010

Cycle Accurate Computer Emulation

Background

    A computer emulator is a program that imitates the functionality of another computer. This article discusses techniques primarily addressing needs of emulators that mimic old game consoles and computer systems. These emulators allow users to get close to the same look and feel of an old computer using a modern PC.

 
Emulator running the game Metal Gear
 
    Computer emulation requires a fair amount of resources on the host computer. Early emulators had to implement several shortcuts to speed up the emulation in order to run games at normal speed on a PC.

    One common optimization was to break the emulated time into chunks that are small enough to not have a major impact to the appearance but large enough to allow optimizing the implementation of each emulated subsystem. These chunks of time were called cyclic tasks and they were often the length of a video scan line. On 80’s computers, a video scan line is usually a few hundred CPU cycles. During each scan line cycle, the emulator is advancing the time by emulating the behavior of each subsystem independently of each other.

    This approach works pretty well and in many cases a user wouldn’t be able to tell much difference between a real old system and the emulated system. But when looking more closely to the output of the emulated system, you would see several small graphical glitches where the wrong pixels are shown on the screen. The emulator will basically fail to update the screen accurately if any change to the graphics is made in the middle of a scan line. Even worse cases occur as well, when a game may rely on the synchronization between  subsystems such as timing of status register changes. This may lead to a game or application to fail to run at all.

Issues

    Some modern emulators try to address these issues and avoid artifacts caused by running subsystems independently for fixed relatively large chunks of time. The most obvious way to address this would perhaps be to advance the emulated time with one CPU cycle at the time and make sure all peripheral devices are updated each cycle. Although this is technically possible, it would require too much resource from the host computer and not even a modern PC would be able to emulate an 80’s computer at normal speed.

    Another approach to handle synchronization between subsystems without artifacts and glitches is to keep the idea of executing each subsystem independently from each other as much as possible, but recognize the points in time when subsystems interact.

    The basic idea is to identify synchronization points where two subsystems interact and advance the emulated time of the subsystems independently to that point in time. There are several possible synchronization points:
  • Writing to memory shared by subsystems (eg. Video RAM)
  • Reading from memory mapped I/O
  • Reading from and writing to I/O ports
  • Interrupts
    All these synchronization points occur within the emulated system, but there is also some synchronization required between the emulated system and the host system, in particular:
  • Video rendering
  • Audio playback
Solution

    An easy way to accommodate the need for dynamic synchronization points is to build the synchronization around a timeout service. The timeout service would allow subsystems to register timeouts at points in time when a subsystem will perform an operation that may affect other subsystems. For example a video system may set up a timeout to occur when the horizontal blank status bit is modified, or a DMA device could set up a timeout when the DMA transfer is complete. 

    The timeout service will basically manage all scheduled timeouts and pick the timeout that is nearest into the future. The timeout service will then tell each subsystem that requires synchronization at that time to run their emulation forward to the point in time of the timeout. When the emulated time is advanced, the timeout service will pick the next timeout and repeat the operation.

    With most peripheral devices, the next synchronization point is well known, but in the main CPU, it may be harder to know when the CPU access a shared resource. The reason is of course that the CPU executes a program that has alternative flows that are not known ahead of time. In a single CPU system this is not really an issue. The CPU can be the first subsystem to run to the next synchronization point, and if it needs to access shared resources, it inserts a new synchronization point at the time of the interaction. The timeout service will then make sure all devices are synchronized before the CPU access the shared resource. 

Conclusion

    This approach is similar to the early emulators in the way that subsystems advance the emulated time independently. The big difference is that nowadays, a host computer is powerful enough to dynamically set these synchronization points instead of using few statically defined synchronization points.

    These and other techniques are used in the blueMSX emulator (www.bluemsx.com) which is a cycle accurate emulator for Z80 based computer systems.

Friday, February 26, 2010

Optimizing Code For Fun

Introduction

     There are many ways of stimulating the brain; cooking, reading, cross word puzzles, sudoku, etc. In addition, for a handful of nerds myself included, we can add writing small, obfuscated computer programs that hides the logic behind unreadable but artistically constructed source code.

The International Obfuscated C Code Contest

     To facilitate the urge to develop these obscure programs there is of course a solution. The solution is a programming contest that encourages creative developers to show off their programming skills. The contest is known as the International Obfuscated C Code Contest and the entries are judged on how obfuscated they are and how novel the idea behind a program is.

     The goal is to write something clever and somewhat interesting in a small C language program that shadows the intent of the program. Developers use various techniques that typically are not used in normal software development, like abusing some of the C language constructs. The competition has strict rules on how much code an entry can have, and its frankly not much, only 2048 characters. A normal program would only be able to do a couple of very basic operations with these limitations, such as do some string management, or do some simple calculations.

     I’ve had the honor to win the competition three times and in this article I’ll describe my favorite entry which I think also happens to be the easiest to explain. The entry really pushes the limits on the number of features that can be squeezed into 2kB of source code. It is a complete 3-D racing game with multiple tracks, hills, mountains, snow, opponent cars, different traction depending on weather conditions, and a status panel with lap time and speed. The game even includes high scores. The picture below show a screenshot of the game in action.

        

        Obfuscated rally game in action


Technical challenges

     I will try to give some technical details that perhaps give an insight into how it was possible to include all features into only 2048 characters of C code.

     A lot of the obfuscation comes free by just fanatically optimize for source code size. The initial program with fairly readable logic and algorithms was about five times as big as the final version, while still using short variable names etc. It took several hundreds of hours to squeeze the code down to the size limits, and many techniques were used.

     The program uses many temporary variables; more or less all lower and upper case single letter variables are used. Some have multiple purposes, such as representing both a color and the length of a track.

     The C language have some operators that typically aren’t used that much, but they provide a way to create complex expressions in a small amount of code. My favorite operators are the ? operator and the comma operator. In combination, these operators can be used to implement complicated logic that is both hard to read and very compact.

     Formatting is also an important aspect of an obfuscated program. Typically the C source code is indented to improve readability and to group logical blocks of code, but there are no strict rules on how to format a program. Most entries to IOCCC heavily abuses the flexibility of formatting, and so does this entry. By carefully formatting the source code, its possible do hide some of the algorithms and provide something quite impossible to understand.

Conclusion

     It took quite some time to create the rally program, mainly because of the extreme optimization. But back to the opening statement of the article, it is just another way to stimulate the brain. Optimizing code is very much like solving sudoku or a crossword puzzle. It’s a matter of find pieces of the code that can be expressed in less characters. The more optimized the program is, the harder it is to find additional optimizations. Towards the end of the development, I was able to reduce size with only a couple of characters an hour. But in the end it is a quite awarding feeling to reach a point where you proudly can submit the entry to the competition and wait for the judges response.

     For any fellow nerd that want to dig deeper into the source code or perhaps try the game, I provide a link to the source code here.

Enjoy!

Wednesday, February 10, 2010

Fast memcpy in c

1. Introduction

This article describes a fast and portable memcpy implementation that can replace the standard library version of memcpy when higher performance is needed. This implementation has been used successfully in several project where performance needed a boost, including the iPod Linux port, the xHarbour Compiler, the pymat python-Matlab interface, the Inspire IRCd client, and various PSP games.

The story began when a co-worker of mine made an implementation of memcpy that he was very proud of. His implementation was faster than many standardized C library routines found in the embedded market. When looking at his code, I found several places where improvements could be made. I made an implementation, which was quite a lot faster than my co-vorker's and this started a friendly competition to make the fastest portable C implementation of memcpy(). Both our implementations got better and better and looked more alike and finally we had an implementation that was very fast and that beats both the native library routines in Windows and Linux, especially when the memory to be copied is not aligned on a 32 bit boundary.


The following paragraphs contain descriptions to some of the techniques used in the final implementation.


2. Mimic the CPU's Architecture

One of the biggest advantages in the original implementation was that the C code was made to imitate the instruction sets on the target processor. My co-work discovered that different processors had different instructions for handling memory pointers. On a Motorola 68K processor the code

    *dst8++ = *src8++;

that copies one byte from the address src8 to the address dst8 and increases both pointers, compiled into a single instruction:

    MOV.B (A0)+, (A2)+

This piece of code can be put into a while loop and will copy memory from the address src to the address dest:

    void* memcpy(void* dest, const void* src, size_t count) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;

        while (count--) {
            *dst8++ = *src8++;
        }
        return dest;
    }

While this is pretty good for the Motorola processor, it is not very efficient on a PowerPC that does not have any instructions for post incrementing pointers. The PowerPC uses four instructions for the same task that only required one instruction on the Motorola processor. However, the PowerPC has a set of instructions to load and store data that utilize pre increment of pointers which means that the following code only results in two instructions when compiled on the PowerPC:

    *++dst8++ = *++src8;

In order to use this construction, the pointers have to be decreased before the loop begins and the final code becomes:

    void* memcpy(void* dest, const void* src, size_t count) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;
        --src8;
        --dst8;
        
        while (count--) {
            *++dst8 = *++src8;
        }
        return dest;
    }

Unfortunately some processors have no instructions for either pre increment or post increment of pointers. This means that the pointer needs to be incremented manually. If the example above is compiled to an processor without pre and post increment instructions, the while loop would actually look something like:

    while (count--) {
        dst8[0] = src8[0];
        ++dst8;
        ++src8;
    }


There are luckily other ways of accessing memory. Some processors can read and write to memory at a fixed offset from a base pointer. This resulted in a third way of implementing the same task:

    void *memcpy(void* dest, const void* src, size_t count) {
        char* dst8 = (char*)dest;
        char* src8 = (char*)src;
        
        if (count & 1) {
            dst8[0] = src8[0];
            dst8 += 1;
            src8 += 1;
        }
        
        count /= 2;
        while (count--) {
            dst8[0] = src8[0];
            dst8[1] = src8[1];
        
            dst8 += 2;
            src8 += 2;
        }
        return dest;
    }

Here the number of turns the loop has to be executed is half of what it was in the earlier examples and the pointers are only updated half as often.


3. Optimizing memory accesses

In most systems, the CPU clock runs at much higher frequency than the speed of the memory bus. My first improvement to my co-worker's original was to read 32 bits at the time from the memory. It is of course possible to read larger chunks of data on some targets with wider data bus and wider data registers. The goal with the C implementation of memcpy() was to get portable code mainly for embedded systems. On such systems it is often expensive to use data types like double and some systems doesn't have a FPU (Floating Point Unit).

By trying to read and write memory in 32 bit blocks as often as possible, the speed of the implementation is increased dramatically, especially when copying data that is not aligned on a 32-bit boundary.

It is however quite tricky to do this. The accesses to memory need to be aligned on 32-bit addresses. The implementation needs two temporary variables that implement a 64-bit sliding window where the source data is kept temporary while being copied into the destination. The example below shows how this can be done when the destination buffer is aligned on a 32-bit address and the source buffer is 8 bits off the alignment:

    srcWord = *src32++;

    while (len--) {
        dstWord = srcWord << 8;
        srcWord = *src32++;
        dstWord |= srcWord >> 24;
        *dst32++ = dstWord;
    }


4. Optimizing branches

Another improvement is to make it easier for the compiler to generate code that utilizes the processors compare instructions in an efficient way. This means creating loops that terminates when a compare value is zero. The loop

    while (++i > count)

often generates more complicated and inefficient code than

    while (count--)

Another thing that makes the code more efficient is when the CPU's native loop instructions can be used. A compiler often generates better code for the loop above than for the following loop expression

    while (count -= 2)


5. Conclusion

The techniques described here makes the C implementation of memcpy() a lot faster and in many cases faster than commercial ones. The implementation can probably be improved even more, especially by using wider data types when available. If the target and the compiler supports 64-bit arithmetic operations such as the shift operator, these techniques can be used to implement a 64-bit version as well. I tried to find a compiler with this support for SPARC but I didn't find one. If 64-bit operations can be made in one instruction, the implementation will be faster than the native Solaris memcpy() which is probably written in assembly.

The version available for download in the end of the article, extends the algorithm to work on 64-bit architectures. For these architectures, a 64-bit data type is used instead of a 32-bit type and all the methods described in the article is applied to the bigger data type.


6. The complete source code

The complete source code for a memcpy() implementation using all the tricks described in the article can be downloaded from

http://www.vik.cc/daniel/memcpy.zip

The code is configured for an intel x86 target but it is easy to change configuration as desired.