As a side project recently I've been building a C# parser, pass one is now complete. The previous parser I had made was done using Antlr, which requires generated code to be both painfully slow and sorely unmanageable (compiler compilers are a complete waste of time imo). This one is all hand written, and which ends up being a lot faster, much easier to work with, and honestly easier to do.
I have a very crude page up that will be its home for now, and it can be downloaded there. It is mostly complete, I think I have a few preprocessor things left, and unsafe code (which I plan to fully implement). The parse currently goes into an object 'tree' (or graph if you prefer) somewhat akin to the codeDom. The only output atm is to source, which spits out the source code using the object graph, which can then be compared with the original source. The API will no doubt change a bit as bugs are found, and esp as I write real generators, but the current version is certainly the gist.
The next step will probably be to output ILASM text, in order to verify it at a more complete level. This will end up being a full compiler to .Net IL when used with the ILASM tool (although that isn't really the goal).
My personal goal is to build a generator that allows using C# to target microcontrollers, such as the SX and the very parallel Propeller chips from Parallax, and eventually Pic chips as well. This will require modifying the language a bit to account for things like interrupts, parallization, the tiny amount of memory available, etc. I don't plan to use a VM in there, but we'll see how that goes. This language will be called CMicro (Cµ) partly because it targets microcontrollers, but mostly because C type languages need some form of funny letter or punctuation to feel complete (and yes, Greek is funny, just ask Comega).
The original motivation for getting into all this stuff was noticing how hardware has so few bugs, and software has so many -- why is this? Bugs are one of the two giant 'need to solve soon' problems in software imo. Looking at the languages they use to create hardware (Verilog, VHDL) tells something, but the real news is looking at the gigantically different approach to testing for hardware and software. It would be really helpful having some of these concepts at the language level. This will involve as much clamping of the language as extending I'm sure.
The second giant issue in software is how hard it is currently to automatically parallize code. In the last short while the number of processors available to a home PC has went from one, to two, to four, to 8... Regardless of age, you are probably old enough to know where this trend is going. So what happens when there are 1024 processors? Current languages (other than maybe functional languages like Haskell) make is pretty hard for the compiler to automatically cut your code into chunks and run them of different processors (due to pointers, you can hardly even optimize C/C++, never mind parallize). So my guess is C(++) is going to become less and less rewarding to use, and languages that lend them self to parallization (which by nature lend themselves to analysis) will have growing advantages. Even with just optimization, over the years the price of accessing memory (cache miss, or gasp, disk access) has went from a few cycles to 1000+. So all other optimization tends to hardly matter at all if required memory isn't readily available. Again, C and its pointer use makes guaranteeing this a crap shoot at best, where Java, C# or Fortran are much better (though still room to improve to be sure).
Actually the original original motivation for this was the many great conversations with Kevin Lindsey (over many a beer) on computer language design and implementation. He is extremely sharp and creative, and brings both of these to language conversations. And often brings beer to them as well. There are few things better in this world than talking lexers, parsers, generators and language while tearing through the cold ones.
posted on Wednesday, December 06, 2006 7:16 AM