I've been working frantically on this swf>.Net>swf compiler (work on the
horizon), and right now I'm parsing swf bytecode into the CodeDom. The plan is
to be able to round trip from both the CodeDom (actually it will have to a modified
CodeDom) and IL, each has its uses. The problem with the CodeDom is that it is
incomplete. There are two categories of things missing - some stuff at the 'top'
end, like nested classes etc, and some expression stuff from the 'bottom' end,
which is a much bigger pain. This deals with the bottom end stuff - mostly unary
operators and a few binary ones.
To be fair here, all schemes to generically map random high level languages
to random high level languages, will almost certainly be incomplete. Here
is the problem - what do you do when language X has a feature that language
Y just doesn't have? For example Visual Basic can't do shifts (like >>,
although I've heard VB 2003 can, but anyway). So for a 'metalanguage' like the
CodeDom, you have three choices:
1) allow shift operators, VB programs will just have to deal with it.
2) don't allow shift operators, C#, JScript etc programs will not be allowed
to use them.
3) don't allow shift operators, and pretend that 'CodeSnippets' (literal text
snippets) are a nice compromise.
Personally I like the idea of number one. VB programmers are such smarmy bastards
anyway, let 'em hang I say. Ok, that's a joke. VB programmers seem to be very
sensitive people, especially sensitive to slights, so I thought I'd try one.
In fact I learned to program in BASIC on a Vic20 (great language for kids..
err, 'kids too' I mean). Anyway, I like the idea, especially in .Net, of having
a standard library that will emulate every CodeDom expression/tricky concept.
That way, languages that are mentally
challenged can always fall back on calling something like:
CodeDom.EmulationForDummies.RightShift(leftExp, rightExpr);
Oh yeah, no semicolon, sorry. Not that I'm singling out VB here, I'm sure there
are many languages that can't shift bits. Spanish and English come immediately
to mind. Sure that call will be slower than a real >>, but it allows full
compatibility for a very common construct. Besides, if
It is possible to write programs for your entire life and never need to use
a bit shift operator, than there is nothing to worry about, as it won't
ever come up.
Number two, lowest common denominator only, has obvious problems. This isn't
so bad for the higher end functionality - at the conceptual end you can often
juggle things around a bit, and you can also require 'CLS compatible' which is
a nice standard subset target. At the lower end though, it is real painful. You pretty
much have to just reject sections of code. Running a program with missing sections
of code tends to lead to problems (though that isn't mathematically provable).
Essentially number 2 means you can't (always/usually/ever) map valid code from
language X to language Y -- which we may remember, is the point.
Option number three is the CodeDom solution - CodeSnippets, and it is very
similar to number two in the end. The solution is to emit text, so instead of
(expr)(rightShiftOp)(expr)
you have
(expr)(">>")(expr)
Well, first thing to notice, VB is still hosed. Before your welling tears interfere
with reading, there is a second, important thing. Second thing is (">>")
doesn't have a lot of metadata, to say the least. What if you are going to a
third language that can do shifts, but it uses a different symbol, or a call
for it? It probably won't be attempting a parse of every snippet, just like
you wouldn't for their CodeCompileUnits. You can put your own metadata in the
CodeSnippetExpression.UserData property, but there is still no way another person's
code will ever digest your CodeCompileUnits without tweaking their code to fit
your ideas. So really you lose your original goal again, a portable description
of code. And you don't solve the VB red-headed bastard stepchild problem.
There is a third more subtle problem here too. The missing binary expressions
should fit into the CodeDom's CodeBinaryOperatorExpression class. This essentially
is a LeftExpression, an operator, and a RightExpression. The operator is an
Enum, which is naturally sealed, so you won't be extending that to include your
missing operator. Instead you need LeftExpression, Snippet, RightExpression,
which of course is no longer compatible with what the CodeBinaryOperatorExpression
is expecting - so you have to convert all three to a snippet. However, when
you are generating your CompileUnit, you may need to fill things in later, swap
things, read metadata to derive types etc. Hard to do when all you really have
is the string "x >> y". It is worth noting that while you can
generate C#, VB or JS code from CodeCompileUnits, and you can compile and run
CodeCompileUnits, there is nothing in the .Net Framework that actually makes
CodeCompileUnits. I assume that is because pretty much every real program out
there wouldn't work in the current state.
The solution then, is to use your own AST (eg your own CodeDom, but made more
friendly to interm code representation). This would be complete regarding your
target language(s), and generally easier to work with anyway. You can then map
your AST to your target languages, as well as the CodeDom. You still have the
problem of an incomplete CodeDom here though, but you do have an easier structure
to map from at least. If you extend the CodeDom enough, it could probably become
usuable as an AST.
What I've done so far in my work, is add the missing classes (derived from
CodeDom classes) in a separate namespace, and then just before generation, the
compile unit is cloned, and all the custom classes are replaced with CodeSnippets
for the (current) target language. It leads to the question, if you have an
AST you're happy with, and the CodeDom you generate isn't portable anyway, why
bother? Well, you sort of get portability - just you get multiple CodeCompileUnits
that are language specific. People can still edit and run that without needing
to know about your CodeDom extensions, so it is something. You can still round
trip with your own program. You also get the IL code generated by the Framework.
In this case I'm mostly interested in bytecode>IL>bytecode so that is
a pretty big consideration. Microsoft generates better IL than I do, hard to
beleive I know.
The way .Net gets around this whole multi language problem is with the IL layer
(that is, lower level Inermediate Language). It is a pretty brilliant system
actually - you consume programs written in other languages via their interfaces,
which conform to fairly generic 'lowest common standards' (and no, I don't mean
VB here, cripes, don't be so sensitive!), the CLS. You only have to follow those
minimal standards (eg no publicly exposed uint's for example) if you want other
languages to consume your code (ok, and you have to rewrite VB to bring it up
to this minimal level, but I didn't say that). Better yet, you only need the
minimal standards on the face of it - inside a method you can shift left until
you run out of bits, and then some, because other languages only need to call
code, not run it.. The IL that all languages compile to is a generic pcode kind
of thing, with tons of metadata. It gets compiled just before its run, optimized
to your machine, so it is very fast. The IL has the ability to do most things
asked of it, a superset of most languages at a lower level (though it doesn't
inherently do everything - eg no multiple inheritance). Just because IL can
shift left, doesn't mean VB has to of course, just it has that option. So a
language can produce any IL it is comfortable with, and consume off generic
interfaces. Its like sex - the trick to having it with many different people
is to avoid any specific commitments. Well, there's also the issue of gaining
FullTrust for interop, but we can't cover everything here.
For the record, what are the missing 'low level' things from the CodeDom? The
following binary operator expressions:
LeftShift (<<)
RightShift (>>)
UnsignedRightShift (>>>)
ExclusiveOr (^)
All unary operator expressions (I have no idea why these aren't in, is there
a VB lite I don't know about?):
Increment (++)
Decrement (--)
UnaryPlus (+expr)
UnaryMinus (-expr)
LogicalNegation (!)
BitwiseComplement (^)
There are some higher level things too - nested classes, readonly etc, but
these mostly seem to have fairly simple workarounds. I can say this bravely
because I'm not doing that part yet.
If you want to read a most excellent book about .Net compiler construction,
I can't recommend ('enough' coming after title) John
Gough's "Compiling
for the .NET Common Language Runtime (CLR)" enough. It is a fantastic
book. Most compiler books seem to have 13 chapters dedicated to scanning and
generating AST's, and then when it comes actual design decisions, that is "left
as an excercise for the reader, but here is an example that is great for addition
of integers". This book however, covers everything important and skips
everything that (in fact) has almost nothing to do with writing a compiler.
It is based on writing a Pascal compiler, which I thought I wouldn't like (another
one of those god damn languages I don't use), but it is actually perfect. Writing
about a C# compiler wouldn't help much, because IL is so C# already. With Pascal
there are enough tricky mappings that you really get a feel for the art, as
well as the science of it. I'm assuming of course that if you read this far,
you are interested in the subject. If you are just scanning, hoping for one
more sex joke, well sorry I don't have one. But Redd Fox does. A woman walking
with a friend sees her husband coming out of the florist's with a dozen roses.
"Damn, now I'm going to have my legs up in the air all week.", to
which her friend replies, "Why don't you just get a vase?".
PS I know there are many VB programmers out there wanting to comment on what
an idiot I am. For sure. VB is faster, it invented the word rad, its used by
millions, even chickens, it can do all this stuff, and all that stuff too, my
facts are just wrong etc etc. I know that, I'm just being silly. I totally respect
VB, and VB programmers, really. I say this because I'm somewhat fearful of full
bore VB wrath overrunning the comment section in here. Fortunatly I just installed
that spam guard thingy. The copy-the-number-into-the-textbox step should keep
the majority of them at bay.
posted on Thursday, November 06, 2003 2:36 AM