BASIC complications: parser combinators

Spread the love

When I wrote about creating a GW-BASIC translator, I admit I didn’t fully grasp what I was getting myself into. Even a simple language like BASIC has a relatively involved syntax, which makes parsing a challenge (doubly so because I never did take that compilers class in school!). Add to that the difficulty of producing sensible C# code for each construct, using the unbelievably verbose Roslyn API. (On this journey, I have become fond of Kirill Osenkov‘s Roslyn Quoter to automate some of the arcana.)

My initial throwaway attempt involved a simplistic parsing scheme that started to break down as the expressions became more recursive; e.g. “DIM R$(NR),MA(NR,ND)” involves parsing a comma-separated list of variable and subscript pairs, where the subscripts can themselves be expressions, and also comma-separated (for multi-dimensional arrays). After grinding away, I came up with a strategy which boiled down to using a class to read parsed items from a line of text while keeping track of the unprocessed tail. Doing some more research, I realized I was essentially building a very primitive monadic parser. Ultimately it was too primitive, but along the way I found a great tool to crack this parsing problem without reinventing the wheel: Sprache.

Sprache has made it quite easy to code up the parsing rules directly in the target language — no messing with EBNF grammars and compiler-compiler toolchains. For example, here is a snippet of my code which shows how to define expressions referring to BASIC single-precision and string variables:

var dollar = Parse.Char('$');
var identifierPrefix = Parse.Identifier(Parse.Letter, Parse.LetterOrDigit);
var singleIdentifier =
    from prefix in identifierPrefix
    select new BasicIdentifier(prefix, TypeCode.Single);
var stringIdentifier =
    from prefix in identifierPrefix
    from d in dollar
    select new BasicIdentifier(prefix, TypeCode.String);
var identifier = stringIdentifier.Or(singleIdentifier);

These fancy LINQ expressions ultimately build up sets of parser combinators — functions that produce higher level parsers composed from simpler ones. This should become clearer with a commented version of the code:

// Note: I use the 'var' syntax throughout, but be aware that the data
// type for each parser is the special Sprache delegate type Parser<T>.
// Most built-in Sprache parsers are of type Parser<char> or
// Parser<string>, meaning that they can produce a char or string from
// a given parsing input (generally a string, wrapped in an interface).

// A parser that reads a single literal char '$'.
var dollar = Parse.Char('$');

// Sprache has a built-in Identifier parser where you specify two inner
// parsers, one for the 'head' and one for the 'tail'. Here we are saying
// we want a parser that reads one or more characters where the start
// must be a letter, followed by one or more letters or digits.
var identifierPrefix = Parse.Identifier(Parse.Letter, Parse.LetterOrDigit);

// This is where the fancy LINQ syntax comes in. These "list comprehensions"
// as they are known allow a fairly readable way to describe a parse
// sequence.
//
// Read this as, "produce a parser that first reads an identifier prefix
// (as above) and returns an instance of my BasicIdentifier class
// referring to a single-precision variable, accepting the prefix as a
// string."
var singleIdentifier =
    from prefix in identifierPrefix
    select new BasicIdentifier(prefix, TypeCode.Single);

// This is very much like the above, with an intervening step which reads
// a literal dollar sign using the 'dollar' parser defined above. The
// dollar sign denotes a string variable (e.g. "ABC$").
var stringIdentifier =
    from prefix in identifierPrefix
    from d in dollar
    select new BasicIdentifier(prefix, TypeCode.String);

// Finally, we compose the above parsers to produce one that can read
// any identifier. This one uses the Sprache 'Or' extension method to
// represent "alternation," i.e. the choice between more than one parser.
// VERY IMPORTANT: Order matters here! You must always put the parser
// which reads the longer prefix first. Otherwise, the final parser will
// not understand that "ABC$" is really a string identifier and not just a
// single-precision identifier followed by an orphaned '$' char. Think of
// it as applying the next parser in the list only if the previous one
// "fails."
var identifier = stringIdentifier.Or(singleIdentifier);

About the most complicated thing I’ve had to do so far involved recursive rules. In Backus-Naur form, one could write the following to recognize a comma-separated list of zeroes and ones:

<bit>  ::= "0" | "1"
<list> ::= <bit> "," <list> | <bit>

The production rule for <list> is “right recursive” since <list> itself appears on the right side of its first expansion. (This is the common recursion case, as left recursion is generally a ticket to infinite descent for most parsers.) Trying to do the same directly in Sprache will lead to trouble since you cannot reference a variable in the declaration and assignment of that same variable. Instead, you would do something like this:

var comma = Parse.Char(',');
var zero = Parse.Char('0');
var one = Parse.Char('1');

var bit = zero.Or(one);

// Read: "Produce a parser that reads one bit, and then reads zero
// or more occurrences of a comma followed by a bit." The extension
// method 'Then' handles the "followed by" logic, while 'Many' handles
// the "zero or more." The final step is to concatenate these terms
// into a single sequence (see definition of 'Concat' below).
var list =
    from head in bit
    from rest in comma.Then(_ => bit).Many()
    select Concat(head, rest);

// ...
private static IEnumerable<T> Concat<T>(T head, IEnumerable<T> rest) =>
    new T[] { head }.Concat(rest);

Sadly, although Sprache has improved my productivity quite a bit, I’m not really that close to finishing my translator. I guess compilers are hard, after all.

WriteAsync .NET

Testing, coding, in that order.

BASIC complications: parser combinators

One thought on “BASIC complications: parser combinators”

Leave a Reply Cancel reply