Applicative Parser

UPDATE: Right after I finished this post, I started working on the next one and promptly discovered some bugs in the code. I’ve updated this post with the corrected code.

In the next couple of posts, I’m going to show how to build a parser library based on the Free Applicative Functor and what you can do with it. To follow along, clone (or update) this repo. Then ‘lein repl’ and you should be able to copy the code to see what it does.

This post runs long (again). The code is really not that long and it’s mostly very small functions. But the amount of details hidden by the abstractions takes a lot of prose to explain. Which is actually one of the very real benefits of using these abstractions. They let you implement an enormous amount of functionality in very few lines of code with fewer bugs.

If you want to see what the point of all this prose is, skip to the “Other Interpretations” section at the bottom and then come back to see how it was done.

The story so far …

For a couple of years now, I’ve wanted a Clojure environment that compiled to native code. One of the first tasks is be able to parse a source file of text into an intermediate form. While I was looking at that, I came across references to Applicative Parsers. So my side project took a long detour that has finally wound it’s way back to actually implementing an applicative parser.

The foundation

First, the preamble

(require '[effects :refer :all]
         '[effects.free :refer :all]
         '[effects.maybe :refer :all]
         '[effects.state :refer :all :exclude [set-val update-state]])

Now, the core of any parser combinator library are the actual combinators used to compose simpler parsers into more complicated ones. And there are really only a few that are needed. We’ll define record types for two of them

(deftype Opt [expr])
(defn optional [expr]
  (free (Opt. expr)))

(deftype NoneOrMore [expr])
(defn none-or-more [parser]
  (free (NoneOrMore. parser)))

Opt makes the parser expr optional and NoneOrMore makes a parser match zero or multiple instances of expr in the source text.

Then we also define two record types for some special combinators.

(deftype Ignore [expr])
(defn ignore [expr]
  (free (Ignore. expr)))

Ignore is a special combinator that is unique to Applicative parsers. It goes ahead and matches the parser expr in the source text, but discards the result.

(deftype Rule [name expr handler-fn])
(defmacro defrule
  ([sym]
     `(def ~sym (effects.free/free (->Rule '~sym nil nil))))
  ([sym rule]
     `(let [~sym (effects.free/free (->Rule '~sym nil nil))]
        (def ~sym (effects.free/free (->Rule '~sym ~rule nil)))))
  ([sym rule handler-fn]
     `(let [~sym (effects.free/free (->Rule '~sym nil nil))]
        (def ~sym (effects.free/free (->Rule '~sym ~rule ~handler-fn))))))

Rule is how we’re going to assign names to parsers so we can refer to them later in other parsers. We’ll talk more about handler-fn later. defrule is just some syntax sugar to make things look nice. One interesting point to note about defrule is that it binds the name given to a Rule record that has nothing in it before expanding to a def of that name to the given parser expression. This lets you write recursive rules in a natural fashion as we’ll see in a little bit.

Building up

Now that we have the combinators, we need something to compose. The Free Applicative gives us the rest of the combinators we need. First off, terminal characters are the simplest parser possible. They succeed if they are the next character in the text being parsed. We can use the Pure type to represent those terminals.

(defn term [v]
  (pure (str v)))

Once we have terminals, we need a way to sequentially compose them. We’ll use the Free implementation of fapply* to do this. What all does is concat all the results of the given parsers together. This implies that a parser result is not just a single value, but a sequence of values.

(defn all [& parsers]
  (fapply* (pure concat) parsers))

For our final combinator, we need to be able to try a parser, and if it fails try another and another until we find one that matches. So far, we haven’t seen any Free structure that would fit the bill. But if you think about it, what we really want is just a Free structure that holds a list of parsers. And the simplest Free Monoid is just such a list. So we add the FreePlus record type.

(defn one-of [& parsers]
  (free-plus parsers))

And one more convenience combinator will round out our parser library.

(defn one-or-more [parser]
  (all parser (none-or-more parser)))

Using it in anger (or mild annoyance)

Here’s a small grammar to parse some very simple s-expressions. You’ll notice that some of the rules have handler functions specified.

(defrule upper-case (apply one-of (map term "ABCDEFGHIJKLMNOPQRSTUVWXYZ")))
(defrule lower-case (apply one-of (map term "abcdefghijklmnopqrstuvwxyz")))
(defrule letter (one-of upper-case lower-case))

(defrule integer (all (optional (one-of (term \+) (term \-)))
                      (apply one-of (map term "123456789"))
                      (none-or-more (apply one-of (map term "0123456789"))))
  (fn [& digits]
    (read-string (apply str digits))))

(defrule whitespace (apply one-of (map term " \n\t\f")))

(defrule word (one-or-more letter)
  (fn [& chars]
    (->> chars
         (apply str)
         symbol)))

(defrule form)

(defrule s-expr (all (ignore (term \())
                     (ignore (none-or-more whitespace))
                     (none-or-more
                      (all form
                           (ignore (none-or-more whitespace))))
                     (ignore (term \))))
  list)

(defrule form
  (one-of integer
          word
          s-expr))

One thing to notice is the way form and s-expr are mutually recursive and how that is described by pre-declaring an empty rule for form first, then supplying the definition later. Also note that form is self recursive.

At this point, the most important thing to keep in mind is that form can’t actually parse anything. It is only a data structure that encodes the structure of the parser. By adding some toString implementations to the various record types, you could explore what this data structure looks like.

Effects

Before we can convert form to an actual parsing function, we have to have some kind of Applicative data type that implements parsing. Parsing involves two things; a text that gets consumed and the possibility that a parser will fail. Each of these two aspects are represented by a different effect.

The repo I’ve been using for these investigations is built around the idea of effects which have different expressions. An effect may be any or all of; Endofunctor, Applicative, Monad, Comonad or Arrow. Each of these aspects are represented as a Clojure protocol and the various effects implement the protocols that are appropriate.

One of these effects is maybe, which implements the semantics of conditional success.

Another one is the state effect, which implements the semantics of mutable state.

The main idea I’m investigating with this repo is how these various effects can be composed to produce other effects. The idea is that when two effects are composed, the composite effects implement the expressions that are valid. For instance, since maybe and state both have Monad expressions (and by implication Endofunctor and Applicative), their composition would also have Endofunctor and Applicative expressions, since both of those are closed under composition. The composition of two effects that are Monads might also have a Monad expression. But that’s not guaranteed since Monad composition is not closed.

I have much more to learn about the composition of these kinds of effects, so stay tuned.

So, how do we compose state and maybe to make a parsing effect? It couldn’t be simpler since all effects implement the Monoid interface

(def state-maybe (plus state maybe))

But realize that in this case, plus is not commutative. That is (plus state maybe) is not the same as (plus maybe state). They both might be useful in different cases, but they’re not equivilent.

Also, much of the usefulness of an effect lies in additional functions. In particular, the state effect has two of these functions, update-state and set-val. If you look back up at the top of this post, you’ll see that we :excluded these two functions when we brought in the effects.state namespace. That’s because we’re going to have to reimplement them for the composed state-maybe effect. I haven’t figured out a way to do this cleanly yet, though I have some ideas. For now, we’ll just do it the hard way. This involves some implementation details that I’ll skip.

(defn update-state [f]
  (->State maybe
           (fn [s] (maybe (list s (f s))))
           (pr-str "update-state")))

(defn set-val [k v]
  (update-state #(assoc % k v)))

update-state lets us update the mutable state value by providing a function f that takes a state value and returns an updated one. set-val then uses update-state to implement a stateful key/value store with a hash-map.

An Actual Parser

The first step in evaluating Free structures is to write a function to evaluate the Pure values. In our case, our Pure values are either functions, which we’ll ignore, or terminal characters. This function must accept a value and return a state-maybe value. If the input value is a function, we’ll just wrap it in state-maybe.

If its a terminal character, we need to return a state-maybe value that will pull the first character from the state, which is going to be a string, and tests it against the terminal character. The rest of the string also needs to become the new value of the state which is accomplished using Clojure’s substring function subs.

(defn pure-parser [term]
  (if (fn? term)
      (state-maybe term)
      (let [term-len (count term)]
        (for [text (update-state identity)
              :when (.startsWith text term)
              _ (update-state #(subs % term-len))]
          [term]))))

In this function, we’re using the for macro from the effects repo which implements generalized monad comprehension, not the one from clojure.core which implements comprehension only for sequences. Also, update-state returns the previous state value and Clojure’s list destructuring syntax provides a succinct way to extract the first character.

Now we can get down to the business of converting form to an actual parser.

First we need a protocol function we can use to extend our basic combinators so they generate parsers.

(defprotocol GenParser
  (gen-parser [_]))

(extend-type Opt
  GenParser
  (gen-parser [p]
    (plus (evaluate (.expr p) pure-parser gen-parser)
          (state-maybe []))))

The first thing we do for the Opt combinator is extract the expr field from the Opt record value p. This parser expression is a Free structure, so we call evaluate on it to convert it to an actual parser. With that parser in hand, we then take advantage of the Monoid expression of the state-maybe effect by plusing it with a parser that always succeeds but doesn’t contribute any parsed value.

Next up, we need to generate a parser from a NoneOrMore record. But we need a little helper function. repeat-parser takes parser and attempts to match it, capturing the successful match in x. And then it calls itself recursively to attempt to match parser again. Eventually parser will fail and the stack built by the recursion unwinds with all the results concated together.

(defn repeat-parser [parser]
  (for [x parser
        xs (plus (repeat-parser parser)
                 (state-maybe []))]
    (concat x xs)))

(extend-type NoneOrMore
  GenParser
  (gen-parser [p]
    (plus (repeat-parser (evaluate (.expr p) pure-parser gen-parser))
          (state-maybe []))))

And then the implementation for NoneOrMore uses plus to either repeatedly match the parser constructed from (.expr p) or the empty-results parser.

Ignore has an interesting implementation of gen-parser. It uses the Applicative expression of state-maybe to first match a parser constructed from p. Then it throws the result away and always returns the empty results value.

(extend-type Ignore
  GenParser
  (gen-parser [p]
    (fapply (constantly []) (evaluate (.expr p) pure-parser gen-parser))))

Rules

Generating a parser from a rule is the most complex of all the combinators, but it’s still pretty easy. First, we have to extract the expr and handler-fn from the Free parser structure p.

Then, if expr is nil, we have to use Clojure’s resolve function to look up the symbol for p’s name to get the Free parser structure to evaluate. Otherwise, we just evaluate the Free structure expr to build a parser. In either case, we use the for macro to generate a parser that matches the just created parser and captures the successful match into parsed. Then, the handler function is applied to the parsed items so that other kinds of data values can be returned rather than just character sequences.

(extend-type Rule
  GenParser
  (gen-parser [p]
    (let [expr (.expr p)
          handler-fn (or (.handler-fn p)
                         identity)]
      (if (nil? expr)
        (for [parsed (evaluate (deref (resolve (symbol (.name p))))
                               pure-parser gen-parser)]
          [(apply handler-fn parsed)])
        (for [parsed (evaluate expr pure-parser gen-parser)]
          [(apply handler-fn parsed)])))))

And finally, a little function to generate a parser when given a Free parser expression.

(defn parser [expr]
  (evaluate expr pure-parser gen-parser))

Parse something!

(prn (extract ((parser form) "(str (add 15 92))")))
;; => ([(str (add 15 92))] "")

Lets break this down. form is Free parser expression as defined above. We create a parser function from it using parser. This function takes a string as input, so we call it with one. The value returned by the parser is a maybe effect value that contains a state effect value. To unwrap the maybe we use extract. This leaves a state effect value which is just a list of two items. The second item is the text remaining to parse. In this case, the entire input string was parsed so nothing is left.

The first item of the state value is a vector containing the final parsed value. Which in this case is a nested list structure.

Other interpretations

Since you’re still with me, I’ll show the real payoff of using the Free Applicative parser.

With a normal parser, all you have is a function that parses a string. The only thing you can do with it is use it to parse strings. A useful result, of course, but we can do much better. Here’s just one example. You should have enough understanding now to figure out how it works. But if not be sure and ask in the comments.

(defprotocol GenEBNF
  (gen-ebnf [_]))

(deftype EBNF [x]
  Applicative
  (fapply* [_ args]
    (EBNF. (fapply* (state-maybe (fn [& args]
                                   (apply str (interpose " , " args))))
                    (map #(.x %) args))))

  Monoid
  (plus* [v vs]
    (EBNF. (fapply* (state-maybe (fn [& args]
                                   (apply str (interpose " | " args))))
                    (map #(.x %) (cons v vs))))))


(defn pure-ebnf [v]
  (EBNF. (state-maybe (condp = v
                        "\n" "\"\\n\""
                        "\t" "\"\\t\""
                        " " "\" \""
                        "\f" "\"\\f\""
                        (format "\"%s\"" v)))))


(extend-type Opt
  GenEBNF
  (gen-ebnf [p]
    (EBNF. (for [expr-ebnf (.x (evaluate (.expr p) pure-ebnf gen-ebnf))]
             (str "[ " expr-ebnf " ]")))))

(extend-type NoneOrMore
  GenEBNF
  (gen-ebnf [p]
    (EBNF. (for [expr-ebnf (.x (evaluate (.expr p) pure-ebnf gen-ebnf))]
             (str "{ " expr-ebnf " }")))))

(extend-type Rule
  GenEBNF
  (gen-ebnf [p]
    (if (.expr p)
      (EBNF. (for [expr-ebnf (.x (evaluate (.expr p) pure-ebnf gen-ebnf))
                   _ (set-val (.name p) (str expr-ebnf " ;"))]
               (.name p)))
      (EBNF. (state-maybe (.name p))))))

(extend-type Ignore
  GenEBNF
  (gen-ebnf [p]
    (evaluate (.expr p) pure-ebnf gen-ebnf)))


(defn ebnf [rule]
  (second (extract ((.x (evaluate rule pure-ebnf gen-ebnf)) {}))))

(defn print-ebnf [rule]
  (doseq [[name rule] (reverse (ebnf rule))]
    (println name ":=" rule)))


(print-ebnf form)

Moar amaze

Generating an EBNF grammar from a parser is pretty cool. But that’s just the beginning of how you can interpret Free Applicative parsers. Next time I plan on showing some of those. But you might consider how you could interpret a Free parser structure to implement optimized parsers depending on the grammar specified.

Jim Duey 14 June 2014
blog comments powered by Disqus