Parsing Clojure

I didn’t come to Clojure from a Java background, so the fact that it was on the JVM was a non-issue for me. I was taken by the language on it’s own merits. And while I’ve been more than happy with Clojure ever since, I’ve always wanted a variant that would compile to native executables. So I set out to build it and this post is the first one explaining what I’m doing.

Now, I certainly didn’t want to write my own compiler from the ground up, so this itch of mine went unscratched. But then I saw Tim Baldridge’s talk at Clojure/West about using LLVM to target the GPU with a Clojure-like language and I was hooked. His excellent Mjolnir library made it possible for me to consider writing a cross-compiler in Clojure that would target LLVM. So, starting with that, I embarked on my adventure.

And promptly realized that it would be awhile before I needed to touch anything related to LLVM. When you set out to create a new compiler/environment/whatever to run on a particular hardware/OS combination, one option is to bootstrap the process by writing everything in assembly and build up until you can write your compiler in the language that you’re building.

Or, you can use languages and tools that run on an environment that you already have operational to produce executables that run on your target system. This is called ‘cross compiling’ and Clojurescript is a perfect example. You write code in Clojurescript that is transformed by a compiler written in Clojure which produces Javascript.

Since you’re really starting from nothing, it’s hard to know where to start. You end up working on a lot of pieces simultaneously, adding to each as requirements are discovered the rest of the system. I’m not going to try to write a complete compiler in Clojure/JVM. My intention is to write the smallest cross compiler I can in order to compile the smallest implementation of the language needed to allow me to compile the compiler, at which point the compiler will be self hosting and the cross compiler can be abandoned.

Parsing a grammar

The first thing to define is the grammar of the language to compile and then write a parser that converts a body of text to a data structure. There are things like Instaparse, which allows you to specify a grammar as a string and it will generate your parser for you. But I don’t care for that approach. It seems to me something akin to building SQL commands as strings with all the pain therein.

So, if you’ve read previous posts of mine, you won’t be surprised to hear that I chose to use a monad. Specifically, the parser monad. I’ll cover the intricate details of the parser monad in a future post, but for now I want to stick to just how to use it. I chose to use the parser monad because:

Building blocks

When using the parser monad, you start with the simplest parsers possible and build up from there until you’ve declared you’re entire grammar. The simplest parser is next-char. It always succeeds and returns the next character in the text being parsed.

(defn prn-char []
  (p-for
     [c next-char]
     (prn c)))

The prn-char is a function that returns a parser which will pull one character from the text and print it. p-for is monad comprehension for the parser monad just like for is list comprehension for the list monad. And it has the same :when and :let forms. And using :when, you can write a function to produce parsers for characters that satisfy a predicate:

(defn char-test [pred]
  (p-do
   [c next-char
    :when (pred c)]
   c))

char-test takes a predicate and returns a parser that succeeds only when the next character satisfies that predicate. For instance, a parser that succeeds when it sees a digit could be:

(def digit (char-test #{\0 \1 \2 ... \9}))

Factoring out a useful bit gives us:

(defn one-of [coll]
  (char-test (set coll)))

(def alpha (one-of "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"))
(def digit (one-of "0123456789"))

Another useful parser is to succeed when the next character is a specific character:

(defn is-char [c]
  (char-test #(= c %)))

Or if you want to succeed when a particular string is seen:

(defn is-string [s]
  (if (empty? s)
    (parser s)
    (p-do
     [_ (is-char (first s))
      _ (is-string (subs s 1))]
     s)))

The parser function above is the simplest parser of all. It ignores the text being parsed and always succeeds, returning whatever value you pass it. This little function just recursively checks each character in s until s is exhausted. If all the characters succeed, the parser succeeds and returns the given string as the value parsed.

Alternatives

All that is well and good for parsing strings, but what happens when a parser fails. It’d be nice to try another one to see if it succeeds. m/plus to the rescue.

(require [monads.core :as m])
(def alpha-numeric (m/plus [digit
                            alpha]))

m/plus is a standard monad function that works on all monads like the parser monad. (a fuller explanation is here). In this case, alpha-numeric is a parser that tries to match a digit and if that fails, tries to match a letter. And it returns whichever character it matched. If the next character was neither a letter nor a number, the parser fails.

Combinations

In addition to trying multiple parsers, you also need the ability to repeatedly try a parser. There are two variants of this one-or-more and none-or-more.

(def number (one-or-more digit))

This is the first inkling of how the parser monad lets you declare your grammar. There are no loops or iterators to manage. You don’t have to keep track of a pointer into your string. The monad handles all that for you. All you have to do is declare how you want you’re text parsed.

You have options

Sometimes, your grammar has an optional element in it. For those times, you need the optional combinator. Here’s how you might define a parser for integers.

(def integer
  (p-do
   [negate? (optional (is-char \-))
    num-str (one-or-more digit)]
   (let [magnitude (str-to-int num-str)]
     (if negate?
       (* -1 magnitude)
       magnitude))))

Segue

I’ll be updating this repo with the code for this project so you can follow along and the code for the parser I’ve been talking about is there. Next up, I’ll show the actual grammar/parser for our cross compiler.

When I first started this, my idea was to do a faithful port of Clojure. But as I got into it, I decided to take the opportunity to try an improve on some things. So while this project will be ‘almost Clojure’, there’ll be some differences and I’d like to not dilute the Clojure name, so I’m looking for something to call this little project. If you have any suggestions, shoot them my way.

Jim Duey 05 October 2014
blog comments powered by Disqus