Ok, this is neat - the identifier lookup routine in a Microsoft compiler. It is tres fast, and tres clever, and worth sharing for a few reasons. Good design is very important, and clarity of code is what you really want out of life, but! But there are some things that will just never be fast enough, and string lookup might just be one of those things.
The problem is you have a bunch of strings (variable identifiers in this case), and you want to look them up in a table quite often, so you want this lookup to be quite fast. The solution here was to step back and notice that it was not important at all how those strings were stored, they don't just have to be chucked into an array or hashtable - that isn't always an easy leap. Instead of being stored in the order created, or alphabetically, or in a hash, they were stored in an array of arrays -- based on the string length. So first, there is an array with all the single letter strings, then the two letter strings, then three letter ones etc. Then inside of each of those arrays, the strings are sorted alphabetically. This allows skipping over each one testing only the first letter, and when a match is found the second letter, then third letter and so forth. As soon as the tested letter comes later in the alphabet than the source letter, then you are sure there is no match (because of the data structure). Wow, quick : ).
Some things to take away here are:
1) Don't assume 'normal' data structures are static when optimizing code.
2) Each type of data/access will lend itself to a different optimized format.
3) Don't optimize for fun, but go to town when you need to.
4) Look at typical data samples - eg this would be slow on strings that had common prefixes.
5) Looking at well written code is pretty useful, and fun too : ).
Anyway, this got me to wondering how idents were looked up in swfs - I had always assumed changing all the idents to two character strings would reduce size, and also speed things up. I assumed swf just used hashtables or something similar to lookup strings. However if they use this method, all idents being two character strings would actually slow it down. So I did some tests. It seems two char idents are in fact faster, whew, so that will still speed things up a little in an swf (things that don't end up in registries that is). The thing is, I had never tested that before today even though I have written a lot of code based on that assumption - it just seemed so logical. I guess that leads to one more take away:
6) Always test your assumptions, especially the ones that aren't right.
posted on Tuesday, October 21, 2003 4:33 AM