When I wrote about creating a GW-BASIC translator, I admit I didn’t fully grasp what I was getting myself into. Even a simple language like BASIC has a relatively involved syntax, which makes parsing a challenge (doubly so because I never did take that compilers class in school!). Add to that the difficulty of producing sensible C# code for each construct, using the unbelievably verbose Roslyn API. (On this journey, I have become fond of Kirill Osenkov‘s Roslyn Quoter to automate some of the arcana.)
My initial throwaway attempt involved a simplistic parsing scheme that started to break down as the expressions became more recursive; e.g. “DIM R$(NR),MA(NR,ND)” involves parsing a comma-separated list of variable and subscript pairs, where the subscripts can themselves be expressions, and also comma-separated (for multi-dimensional arrays). After grinding away, I came up with a strategy which boiled down to using a class to read parsed items from a line of text while keeping track of the unprocessed tail. Doing some more research, I realized I was essentially building a very primitive monadic parser. Ultimately it was too primitive, but along the way I found a great tool to crack this parsing problem without reinventing the wheel: Sprache.
Sprache has made it quite easy to code up the parsing rules directly in the target language — no messing with EBNF grammars and compiler-compiler toolchains. For example, here is a snippet of my code which shows how to define expressions referring to BASIC single-precision and string variables:
var dollar = Parse.Char('$'); var identifierPrefix = Parse.Identifier(Parse.Letter, Parse.LetterOrDigit); var singleIdentifier = from prefix in identifierPrefix select new BasicIdentifier(prefix, TypeCode.Single); var stringIdentifier = from prefix in identifierPrefix from d in dollar select new BasicIdentifier(prefix, TypeCode.String); var identifier = stringIdentifier.Or(singleIdentifier);
These fancy LINQ expressions ultimately build up sets of parser combinators — functions that produce higher level parsers composed from simpler ones. This should become clearer with a commented version of the code:
// Note: I use the 'var' syntax throughout, but be aware that the data // type for each parser is the special Sprache delegate type Parser<T>. // Most built-in Sprache parsers are of type Parser<char> or // Parser<string>, meaning that they can produce a char or string from // a given parsing input (generally a string, wrapped in an interface). // A parser that reads a single literal char '$'. var dollar = Parse.Char('$'); // Sprache has a built-in Identifier parser where you specify two inner // parsers, one for the 'head' and one for the 'tail'. Here we are saying // we want a parser that reads one or more characters where the start // must be a letter, followed by one or more letters or digits. var identifierPrefix = Parse.Identifier(Parse.Letter, Parse.LetterOrDigit); // This is where the fancy LINQ syntax comes in. These "list comprehensions" // as they are known allow a fairly readable way to describe a parse // sequence. // // Read this as, "produce a parser that first reads an identifier prefix // (as above) and returns an instance of my BasicIdentifier class // referring to a single-precision variable, accepting the prefix as a // string." var singleIdentifier = from prefix in identifierPrefix select new BasicIdentifier(prefix, TypeCode.Single); // This is very much like the above, with an intervening step which reads // a literal dollar sign using the 'dollar' parser defined above. The // dollar sign denotes a string variable (e.g. "ABC$"). var stringIdentifier = from prefix in identifierPrefix from d in dollar select new BasicIdentifier(prefix, TypeCode.String); // Finally, we compose the above parsers to produce one that can read // any identifier. This one uses the Sprache 'Or' extension method to // represent "alternation," i.e. the choice between more than one parser. // VERY IMPORTANT: Order matters here! You must always put the parser // which reads the longer prefix first. Otherwise, the final parser will // not understand that "ABC$" is really a string identifier and not just a // single-precision identifier followed by an orphaned '$' char. Think of // it as applying the next parser in the list only if the previous one // "fails." var identifier = stringIdentifier.Or(singleIdentifier);
About the most complicated thing I’ve had to do so far involved recursive rules. In Backus-Naur form, one could write the following to recognize a comma-separated list of zeroes and ones:
<bit> ::= "0" | "1" <list> ::= <bit> "," <list> | <bit>
The production rule for <list>
is “right recursive” since <list>
itself appears on the right side of its first expansion. (This is the common recursion case, as left recursion is generally a ticket to infinite descent for most parsers.) Trying to do the same directly in Sprache will lead to trouble since you cannot reference a variable in the declaration and assignment of that same variable. Instead, you would do something like this:
var comma = Parse.Char(','); var zero = Parse.Char('0'); var one = Parse.Char('1'); var bit = zero.Or(one); // Read: "Produce a parser that reads one bit, and then reads zero // or more occurrences of a comma followed by a bit." The extension // method 'Then' handles the "followed by" logic, while 'Many' handles // the "zero or more." The final step is to concatenate these terms // into a single sequence (see definition of 'Concat' below). var list = from head in bit from rest in comma.Then(_ => bit).Many() select Concat(head, rest); // ... private static IEnumerable<T> Concat<T>(T head, IEnumerable<T> rest) => new T[] { head }.Concat(rest);
Sadly, although Sprache has improved my productivity quite a bit, I’m not really that close to finishing my translator. I guess compilers are hard, after all.
Pingback: BASIC complications: expressions – WriteAsync .NET