{"id":5438,"date":"2018-04-01T13:00:08","date_gmt":"2018-04-01T13:00:08","guid":{"rendered":"http:\/\/writeasync.net\/?p=5438"},"modified":"2018-03-27T06:56:45","modified_gmt":"2018-03-27T06:56:45","slug":"basic-complications-parser-combinators","status":"publish","type":"post","link":"http:\/\/writeasync.net\/?p=5438","title":{"rendered":"BASIC complications: parser combinators"},"content":{"rendered":"<p>When I wrote about <a href=\"http:\/\/writeasync.net\/?p=5408\">creating a GW-BASIC translator<\/a>, I admit I didn&#8217;t fully grasp what I was getting myself into. Even a simple language like BASIC has a relatively involved syntax, which makes parsing a challenge (doubly so because I never did take that compilers class in school!). Add to that the difficulty of producing sensible C# code for each construct, using the <a href=\"https:\/\/tomassetti.me\/code-generation-with-roslyn-a-skeleton-class-from-uml\/\">unbelievably verbose Roslyn API<\/a>. (On this journey, I have become fond of <a href=\"http:\/\/osenkov.com\/\">Kirill Osenkov<\/a>&#8216;s <a href=\"http:\/\/roslynquoter.azurewebsites.net\/\">Roslyn Quoter<\/a> to automate some of the arcana.)<\/p>\n<p>My initial throwaway attempt involved a simplistic parsing scheme that started to break down as the expressions became more recursive; e.g. &#8220;DIM R$(NR),MA(NR,ND)&#8221; involves parsing a comma-separated list of variable and subscript pairs, where the subscripts can themselves be expressions, and also comma-separated (for multi-dimensional arrays). After grinding away, I came up with a strategy which boiled down to using a class to read parsed items from a line of text while keeping track of the unprocessed tail. Doing some more research, I realized I was essentially building a very primitive <a href=\"https:\/\/lorgonblog.wordpress.com\/2007\/12\/02\/c-3-0-lambda-and-the-first-post-of-a-series-about-monadic-parser-combinators\/\">monadic parser<\/a>. Ultimately it was too primitive, but along the way I found a great tool to crack this parsing problem without reinventing the wheel: <a href=\"https:\/\/github.com\/sprache\/Sprache\">Sprache<\/a>.<\/p>\n<p>Sprache has made it quite easy to code up the parsing rules directly in the target language &#8212; no messing with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Extended_Backus\u2013Naur_form\">EBNF grammars<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Compiler-compiler\">compiler-compiler<\/a> toolchains. For example, here is a snippet of my code which shows how to define expressions referring to BASIC single-precision and string variables:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\nvar dollar = Parse.Char('$');\r\nvar identifierPrefix = Parse.Identifier(Parse.Letter, Parse.LetterOrDigit);\r\nvar singleIdentifier =\r\n    from prefix in identifierPrefix\r\n    select new BasicIdentifier(prefix, TypeCode.Single);\r\nvar stringIdentifier =\r\n    from prefix in identifierPrefix\r\n    from d in dollar\r\n    select new BasicIdentifier(prefix, TypeCode.String);\r\nvar identifier = stringIdentifier.Or(singleIdentifier);\r\n<\/pre>\n<p>These fancy LINQ expressions ultimately build up sets of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Parser_combinator\">parser combinators<\/a> &#8212; functions that produce higher level parsers composed from simpler ones. This should become clearer with a commented version of the code:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\n\/\/ Note: I use the 'var' syntax throughout, but be aware that the data\r\n\/\/ type for each parser is the special Sprache delegate type Parser&lt;T&gt;.\r\n\/\/ Most built-in Sprache parsers are of type Parser&lt;char&gt; or\r\n\/\/ Parser&lt;string&gt;, meaning that they can produce a char or string from\r\n\/\/ a given parsing input (generally a string, wrapped in an interface).\r\n\r\n\/\/ A parser that reads a single literal char '$'.\r\nvar dollar = Parse.Char('$');\r\n\r\n\/\/ Sprache has a built-in Identifier parser where you specify two inner\r\n\/\/ parsers, one for the 'head' and one for the 'tail'. Here we are saying\r\n\/\/ we want a parser that reads one or more characters where the start\r\n\/\/ must be a letter, followed by one or more letters or digits.\r\nvar identifierPrefix = Parse.Identifier(Parse.Letter, Parse.LetterOrDigit);\r\n\r\n\/\/ This is where the fancy LINQ syntax comes in. These &quot;list comprehensions&quot;\r\n\/\/ as they are known allow a fairly readable way to describe a parse\r\n\/\/ sequence.\r\n\/\/\r\n\/\/ Read this as, &quot;produce a parser that first reads an identifier prefix\r\n\/\/ (as above) and returns an instance of my BasicIdentifier class\r\n\/\/ referring to a single-precision variable, accepting the prefix as a\r\n\/\/ string.&quot;\r\nvar singleIdentifier =\r\n    from prefix in identifierPrefix\r\n    select new BasicIdentifier(prefix, TypeCode.Single);\r\n\r\n\/\/ This is very much like the above, with an intervening step which reads\r\n\/\/ a literal dollar sign using the 'dollar' parser defined above. The\r\n\/\/ dollar sign denotes a string variable (e.g. &quot;ABC$&quot;).\r\nvar stringIdentifier =\r\n    from prefix in identifierPrefix\r\n    from d in dollar\r\n    select new BasicIdentifier(prefix, TypeCode.String);\r\n\r\n\/\/ Finally, we compose the above parsers to produce one that can read\r\n\/\/ any identifier. This one uses the Sprache 'Or' extension method to\r\n\/\/ represent &quot;alternation,&quot; i.e. the choice between more than one parser.\r\n\/\/ VERY IMPORTANT: Order matters here! You must always put the parser\r\n\/\/ which reads the longer prefix first. Otherwise, the final parser will\r\n\/\/ not understand that &quot;ABC$&quot; is really a string identifier and not just a\r\n\/\/ single-precision identifier followed by an orphaned '$' char. Think of\r\n\/\/ it as applying the next parser in the list only if the previous one\r\n\/\/ &quot;fails.&quot;\r\nvar identifier = stringIdentifier.Or(singleIdentifier);\r\n<\/pre>\n<p>About the most complicated thing I&#8217;ve had to do so far involved recursive rules. In Backus-Naur form, one could write the following to recognize a comma-separated list of zeroes and ones:<\/p>\n<pre>&lt;bit&gt;  ::= \"0\" | \"1\"\r\n&lt;list&gt; ::= &lt;bit&gt; \",\" &lt;list&gt; | &lt;bit&gt;<\/pre>\n<p>The production rule for <code>&lt;list&gt;<\/code> is &#8220;right recursive&#8221; since <code>&lt;list&gt;<\/code> itself appears on the right side of its first expansion. (This is the common recursion case, as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Left_recursion\">left recursion<\/a> is generally a ticket to <a href=\"https:\/\/www.cut-the-knot.org\/WhatIs\/WhatIsInfiniteDescent.shtml\">infinite descent<\/a> for most parsers.) Trying to do the same directly in Sprache will lead to trouble since you cannot reference a variable in the declaration and assignment of that same variable. Instead, you would do something like this:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\nvar comma = Parse.Char(',');\r\nvar zero = Parse.Char('0');\r\nvar one = Parse.Char('1');\r\n\r\nvar bit = zero.Or(one);\r\n\r\n\/\/ Read: &quot;Produce a parser that reads one bit, and then reads zero\r\n\/\/ or more occurrences of a comma followed by a bit.&quot; The extension\r\n\/\/ method 'Then' handles the &quot;followed by&quot; logic, while 'Many' handles\r\n\/\/ the &quot;zero or more.&quot; The final step is to concatenate these terms\r\n\/\/ into a single sequence (see definition of 'Concat' below).\r\nvar list =\r\n    from head in bit\r\n    from rest in comma.Then(_ =&gt; bit).Many()\r\n    select Concat(head, rest);\r\n\r\n\/\/ ...\r\nprivate static IEnumerable&lt;T&gt; Concat&lt;T&gt;(T head, IEnumerable&lt;T&gt; rest) =&gt;\r\n    new T&#x5B;] { head }.Concat(rest);\r\n<\/pre>\n<p>Sadly, although Sprache has improved my productivity quite a bit, I&#8217;m not really that close to finishing my translator. I guess <a href=\"https:\/\/www.quora.com\/Why-are-compilers-so-hard-to-write\">compilers are hard<\/a>, after all.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When I wrote about creating a GW-BASIC translator, I admit I didn&#8217;t fully grasp what I was getting myself into. Even a simple language like BASIC has a relatively involved syntax, which makes parsing a challenge (doubly so because I&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[91],"tags":[],"class_list":["post-5438","post","type-post","status-publish","format-standard","hentry","category-design"],"_links":{"self":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5438","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5438"}],"version-history":[{"count":5,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5438\/revisions"}],"predecessor-version":[{"id":5443,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5438\/revisions\/5443"}],"wp:attachment":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5438"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5438"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5438"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}