Extensible Matching

The pattern matcher in match.lisp works for many purposes, but the simplicity of the code introduces some limitations. For example

(start (?and ?* ?x) end)

will return nil. It will not match a list starting with start and ending with end, and bind ?x to the list of elements in between.

The Lisp Critic uses a similar but somewhat more complicated pattern matcher, defined in extend-match.lisp.

An Extensible Pattern Syntax

The extensible matcher uses this format to specify all pattern elements:

(?pattern-type arg1 arg2 ...)

For example,

match any thing that does not match pattern becomes
(?not pattern)

using ?not to indicate "not this pattern"
match a list beginning with item1 followed by any number of other items followed by item2 becomes
```
(item1 (?* var) item2)
```
using ?* to indicate "match zero or more items"
match a number becomes
```
(?is predicate)
```
using ?is to indicate "match anything that satisfies this predicate

Two other useful extensions that fit this format are:

(?and pat1 pat2 ...)
to mean "match anything that can matches all these patterns with one set of bindings"
(?or pat1 pat2 ...)
to mean "match anything that matches any of the patterns, and each match can have different bindings"

This format has two nice properties:

It's easy to give any number of "arguments" to the pattern, as can be seen in the examples above.
It's a flexible format for future extensions that might come up.

It does have a downside however. The simple pattern variable form we used before, e.g., ?X, doesn't fit the above format, and in fact can cause patterns to be ambiguous and hard to maintain:

Is (?foo ...) a list beginning with the variable ?foo or is ?foo an extension to the pattern language?
If ?foo is not an extension when we wrote the pattern, but someone defines it as such later, our pattern that used to work will now behave differently, with no warning.

To avoid these problems, we drop the convenient simple symbolic form for pattern variables in favor of the list (? variable-name), e.g., (? X). Now there will be no ambiguity:

If we want the variable foo, we write (? foo).
If we want a pattern extension, we write (?foo ...).
If someone else defines an extended meaning for ?foo, it doesn't affect patterns with (? foo).

Implementing Extensible Patterns

One way to implement all these extensions to our matcher would be to use a conditional, e.g.,

(defun pat-match (pat form &optional (blists '(nil)))
      (cond ((eql pat form) blists)
            ((or (atom pat) (atom form)) nil)
            ((eql (first pat) '?) (match-variable pat form blists))
            ((eql (first pat) '?is) (match-predicate pat form blists))
            ((eql (first pat) '?and) (match-and pat form blists))
            (t match car and cdr)))

The problem is that pat-match will get pretty long, but never long enough. If we or some programmer using our matcher wants a new extension, they're going to have to edit this definition.

The right way to do this is to replace the above with

(defun pat-match (pat form &optional (blists '(nil)))
      (cond ((pat-extension-p pat)
            (match-extension pat form blists))
            ((eql pat form) blists)
            ((or (atom pat) (atom form)) nil)
            (t (pat-match (cdr pat) (cdr form)
              (pat-match (car pat) (car form) blists)))))

This code presumes that pat-extension-p can determine if an item is a defined pattern extension. This is the approach taken in extend-match.lisp.

Although pat-match is simpler, don't we have to implement pat-extension-p and match-extension with a conditional, in fact with two, one each function? Won't this actually be harder to maintain than the original version?

The key is to define pat-extension-p and match-extension to look in a table. The table associates a symbol, e.g., ?is, with a function that knows how to match patterns involving that symbol. How we implement this table isn't important. We could use a-lists or hashtables or whatever.

For example, if the function (pat-functionsymbol) returns the pattern matcher for the symbol, if any, or NIL, then pat-extension-p is pretty easy to define:

(defun pat-extension-p (name)
      (not (null (pat-function name)))

match-extension is only a little more complex:

(defun match-extension (pat form blists)
      (let ((fn (pat-function (first pat))))
          (when (null fn)
            (error "Undefined pattern extension ~S" pat))
          (funcall fn (rest pat) form blists)))

This will call the matching function attached to the pattern extension, e.g., ?is, passing it the arguments to the extension, e.g., the list (numberp) if the pattern were (?is numberp), the form to be matched against, and the current list of binding lists. This should be enough for the matching function to do its job.

Putting things in the table is also easy. We define a function add-extension to associate a pattern matching function with a pattern label. If match-predicate is the function that does the matching work for ?is, we'd write

(add-extension '?is :single 'match-predicate)

The :single says that an ?is pattern matches one input pattern. Most patterns are of this type.

This approach is called data-driven programming because how our code runs (which code gets called) is driven by the data (in this case, symbols in the pattern).

Data-driven programming is a very useful technique. Consider using it whenever you start writing a COND or CASE that chooses between various pieces of code based on some input data. It's a technique that simultaneously makes code simpler and more flexible.

Segment Matching Patterns

Segment matching is involved for any pattern that can match zero or more items. Two examples are:

(?* L) which can match any number of items
(?optional pat) which can zero or one occurrences of pat

Just adding segment variables, i.e, patterns of the form (?* L), changes how things work in several ways.

First, we can have more than one answer:

> (pat-match '((?* l) (?* r)) '(a b c))
    (((L) (R A B C)) ((L A) (R B C))
    ((L A B) (R C)) ((L A B C) (R)))

There are four binding lists returned, each representing a different way that the two segment variables can be matched to the input list.

Second, if a segment variable appears twice, it must match the same sequence of items as it did before, e.g.,

> (pat-match '((?* l) (?* r) (?* l)) '(a b c)))
    (((R A B C) (L)))

There is only one answer now, because the only way the first segment variable can match a segment before and after some other segment in (A B C) is if the first segment variable matches zero items. On the other hand,

> (pat-match '((?* l) (?* r) (?* l)) '(a b c a b)))
    (((R A B C A B) (L)) ((R C) (L A B)))

has two possible answers.

An important point about segment patterns is that they can only occur inside lists. The following call to pat-match makes no sense:

(pat-match '(?* l) '(a b c))

However, the following does:

(pat-match '((?* l)) '(a b c))

The latter example will bind L to the list (A B C). It's not a particularly efficient way to do it, but it does make sense conceptually.

Implementing Segment Matching

In order for the pattern matcher to match a segment pattern against zero or more input items, the matcher has to check to see if it has a pattern that is a list whose first element is a segment matching pattern.

Here's the code from extend-match.lisp:

(defun pat-match (pat form &optional (blists '(nil)))
      (cond ((pat-extension-p pat)
            (match-extension pat form blists))
            ((eql pat form) blists)
            ((atom pat) nil)
            ((segment-pat-extension-p (first pat))
            (match-segment-extension pat form blists))
            ((atom form) nil)
            (t (pat-match (cdr pat) (cdr form)
              (pat-match (car pat) (car form) blists)))))

The fourth branch is the important one. It checks for a list starting with a pattern labelled as a segment matching pattern. If found, it calls match-segment-extension, the segment counterpart to match-extension.

(defun match-segment-extension (pats form blists)
      (let ((pat (first pats))
            (rest-pats (rest pats)))
        (let ((fn (pat-function (first pat))))
            (when (null fn)
              (error "Undefined pattern extension ~S" pat))
            (funcall fn (rest pat) rest-pats form blists))))

The parameter pats is a list of a patterns, starting with a segment pattern. form is a list of input items. The code above gets the matching function and passes it

the segment pattern's arguments, e.g., (L) for (?* L),
the remaining patterns in the list
the list of input items
the list of binding lists

The list of remaining patterns is needed because a segment matcher like ?* will consider many different bindings, each differing in how many input items are matched, but only the bindings that allow the rest of the patterns to match will be kept.

Here's the code for the segment variable matcher:

(defun match-segment-variable (args pats input blists)
      (destructuring-bind (&optional name) args
        (loop for tail = input then (rest tail)
              append (match-tail tail name pats input blists)
              until (null tail))))

    (defun match-tail (tail name pats input blists)
      (let ((blists (pat-match pats tail blists)))
        (if (null blists)
          nil
          (bind-variable name (ldiff input tail) blists))))

Basically, match-segment-variable calls match-tail with the input list, then the input list minus the first element, then the input list minus the first two elements, and so on, up to and including the input list minus all its elements. Successive cdr's of a list are called the tails of the list.

match-tail simply calls pat-match to see if the remaining patterns match the given tail of the input list. If they do, then match-tail tries to bind the segment variable to the elements in the input list preceding the tail.

ldiff is one of the great unknown functions of Common Lisp. Given a list and a tail of that list (not a list that has the same elements, but an actual tail), ldiff returns a list of the items preceding the tail. This function turns out to be surprisingly handy but often overlooked, even by veteran Lisp hackers.

To see why this works, consider

(pat-match '((?* l) (? x)) '(a b c))

match-tail will be called with (A B C), (B C), (C), and (). It will try matching each of these with the remaining patterns, namely ((? X)). Following the normal rules of matching, the only match that succeeds is the one that binds X to C, i.e., where the tail was (C). Hence, L will be bound to (A B).

The importance of bind-variable becomes clear when we consider

(pat-match '((?* l) (?* r) (?* l)) '(a b c)))

The sequence of events is roughly this:

The first occurrence of (?* L) first tries passing the entire list (A B C) to the remaining patterns ((?* R) (?* L)).
The (?* R) in turn first tries passing (A B C) to ((?* L)).
After numerous tries, (?* L) binds itself to (A B C). The list of binding lists (((?* L) (A B C))) is returned to (?* R).
(?* R) adds the binding (R) to each of these binding lists.
(?* R) then tries (B C) to ((?* L)) and gets back the answer (((?* L) (B C))).
(?* R) adds the binding (R A) to each of these binding lists.
Eventually, (?* R) has tried every tail, and returns the list of binding lists (((R) (L A B C)) ((R A) (L B C)) ((R A B) (L C)) ((R A B C) (L))).
(?* L) now tries to add the binding (L) to each of these lists. This succeeds only on the last binding list.
The steps above now repeat with (?* L) passing (B C) to the remaining patterns ((?* R) (?* L)).
Eventually, the binding lists returned are (((R) (L B C)) ((R B) (L C)) ((R B C) (L))).
(?* L) is unable to add the binding (L A) to any of these lists. And similarly for the remaining passes.