The pattern matcher in match.lisp works for many purposes, but the simplicity of the code introduces some limitations. For example
(start (?and ?* ?x) end)
will return nil. It will not match a list starting with start
and ending with end
, and bind ?x
to the list of
elements in between.
The Lisp Critic uses a similar but somewhat more complicated pattern matcher, defined in extend-match.lisp.
An Extensible Pattern Syntax
The extensible matcher uses this format to specify all pattern elements:
(?pattern-type arg1 arg2 ...)
For example,
- match any thing that does not match pattern
becomes
(?not
pattern
)
using
?not
to indicate "not this pattern" - match a list beginning with item1 followed by any
number of other items followed by item2 becomes
(item1 (?* var) item2)
using
?*
to indicate "match zero or more items" - match a number becomes
(?is predicate)
using
?is
to indicate "match anything that satisfies this predicate
Two other useful extensions that fit this format are:
(?and pat1 pat2 ...)
to mean "match anything that can matches all these patterns with one set of bindings"
(?or pat1 pat2 ...)
to mean "match anything that matches any of the patterns, and each match can have different bindings"
This format has two nice properties:
- It's easy to give any number of "arguments" to the pattern, as can be seen in the examples above.
- It's a flexible format for future extensions that might come up.
It does have a downside however. The simple pattern variable form
we used before, e.g., ?X
, doesn't fit the above format,
and in fact can cause patterns to be ambiguous and hard to maintain:
- Is
(?foo ...)
a list beginning with the variable?foo
or is?foo
an extension to the pattern language? - If
?foo
is not an extension when we wrote the pattern, but someone defines it as such later, our pattern that used to work will now behave differently, with no warning.
To avoid these problems, we drop the convenient simple
symbolic form for pattern variables in favor of the list (?
variable-name)
, e.g.,
(? X)
. Now there will be no ambiguity:
- If we want the variable
foo
, we write(? foo)
. - If we want a pattern extension, we write
(?foo ...)
. - If someone else defines an extended meaning for
?foo
, it doesn't affect patterns with(? foo)
.
Implementing Extensible Patterns
One way to implement all these extensions to our matcher would be to use a conditional, e.g.,
(defun pat-match (pat form &optional (blists '(nil))) (cond ((eql pat form) blists) ((or (atom pat) (atom form)) nil) ((eql (first pat) '?) (match-variable pat form blists)) ((eql (first pat) '?is) (match-predicate pat form blists)) ((eql (first pat) '?and) (match-and pat form blists)) (t match car and cdr)))
The problem is that pat-match
will get pretty long,
but never long enough. If we or some programmer using our matcher
wants a new extension, they're going to have to edit this definition.
The right way to do this is to replace the above with
(defun pat-match (pat form &optional (blists '(nil))) (cond ((pat-extension-p pat) (match-extension pat form blists)) ((eql pat form) blists) ((or (atom pat) (atom form)) nil) (t (pat-match (cdr pat) (cdr form) (pat-match (car pat) (car form) blists)))))
This code presumes that pat-extension-p
can determine
if an item is a defined pattern extension. This is the approach taken
in
extend-match.lisp.
Although pat-match
is simpler, don't we have to
implement pat-extension-p
and
match-extension
with a conditional, in fact with two,
one each function? Won't this actually be harder to maintain than the
original version?
The key is to define pat-extension-p
and
match-extension
to look in a table. The table associates
a symbol, e.g., ?is
, with a function that knows how to
match patterns involving that symbol. How we implement this table
isn't important. We could use a-lists or hashtables or whatever.
For example, if the function (pat-function
symbol
)
returns the pattern
matcher for the symbol, if any, or NIL
, then
pat-extension-p
is pretty easy to define:
(defun pat-extension-p (name) (not (null (pat-function name)))
match-extension
is only a little more complex:
(defun match-extension (pat form blists) (let ((fn (pat-function (first pat)))) (when (null fn) (error "Undefined pattern extension ~S" pat)) (funcall fn (rest pat) form blists)))
This will call the matching function attached to the pattern
extension, e.g., ?is
, passing it the arguments to the
extension, e.g., the list (numberp)
if the pattern were
(?is numberp)
, the form to be matched against, and the
current list of binding lists. This should be enough for the matching
function to do its job.
Putting things in the table is also easy. We define a function add-extension to associate a pattern matching function with a pattern label. If match-predicate is the function that does the matching work for ?is, we'd write
(add-extension '?is :single 'match-predicate)
The :single says that an ?is pattern matches one input pattern. Most patterns are of this type.
This approach is called data-driven programming because how our code runs (which code gets called) is driven by the data (in this case, symbols in the pattern).
Data-driven programming is a very useful technique. Consider using it whenever you start writing a
COND
orCASE
that chooses between various pieces of code based on some input data. It's a technique that simultaneously makes code simpler and more flexible.
Segment Matching Patterns
Segment matching is involved for any pattern that can match zero or more items. Two examples are:
(?* L)
which can match any number of items(?optional
pat
)
which can zero or one occurrences of pat
Just adding segment variables, i.e, patterns of the form (?*
L)
, changes how things work in several ways.
First, we can have more than one answer:
> (pat-match '((?* l) (?* r)) '(a b c)) (((L) (R A B C)) ((L A) (R B C)) ((L A B) (R C)) ((L A B C) (R)))
There are four binding lists returned, each representing a different way that the two segment variables can be matched to the input list.
Second, if a segment variable appears twice, it must match the same sequence of items as it did before, e.g.,
> (pat-match '((?* l) (?* r) (?* l)) '(a b c))) (((R A B C) (L)))
There is only one answer now, because the only way the first
segment variable can match a segment before and after some other
segment in (A B C)
is if the first segment variable
matches zero items. On the other hand,
> (pat-match '((?* l) (?* r) (?* l)) '(a b c a b))) (((R A B C A B) (L)) ((R C) (L A B)))
has two possible answers.
An important point about segment patterns is that they can only
occur inside lists. The following call to
pat-match
makes no sense:
(pat-match '(?* l) '(a b c))
However, the following does:
(pat-match '((?* l)) '(a b c))
The latter example will bind L
to the list (A B
C)
. It's not a particularly efficient way to do it, but it
does make sense conceptually.
Implementing Segment Matching
In order for the pattern matcher to match a segment pattern against zero or more input items, the matcher has to check to see if it has a pattern that is a list whose first element is a segment matching pattern.
Here's the code from extend-match.lisp:
(defun pat-match (pat form &optional (blists '(nil))) (cond ((pat-extension-p pat) (match-extension pat form blists)) ((eql pat form) blists) ((atom pat) nil) ((segment-pat-extension-p (first pat)) (match-segment-extension pat form blists)) ((atom form) nil) (t (pat-match (cdr pat) (cdr form) (pat-match (car pat) (car form) blists)))))
The fourth branch is the important one. It checks for a list
starting with a pattern labelled as a segment matching pattern. If
found, it calls match-segment-extension
, the segment
counterpart to match-extension
.
(defun match-segment-extension (pats form blists) (let ((pat (first pats)) (rest-pats (rest pats))) (let ((fn (pat-function (first pat)))) (when (null fn) (error "Undefined pattern extension ~S" pat)) (funcall fn (rest pat) rest-pats form blists))))
The parameter pats
is a list of a patterns, starting
with a segment pattern. form
is a list of input items.
The code above gets the matching function and passes it
- the segment pattern's arguments, e.g.,
(L)
for(?* L)
, - the remaining patterns in the list
- the list of input items
- the list of binding lists
The list of remaining patterns is needed because a segment matcher
like ?*
will consider many different bindings, each
differing in how many input items are matched, but only the bindings
that allow the rest of the patterns to match will be kept.
Here's the code for the segment variable matcher:
(defun match-segment-variable (args pats input blists) (destructuring-bind (&optional name) args (loop for tail = input then (rest tail) append (match-tail tail name pats input blists) until (null tail)))) (defun match-tail (tail name pats input blists) (let ((blists (pat-match pats tail blists))) (if (null blists) nil (bind-variable name (ldiff input tail) blists))))
Basically, match-segment-variable
calls
match-tail
with the input list, then the input list
minus the first element, then the input list minus the first two
elements, and so on, up to and including the input list minus all its
elements. Successive cdr
's of a list are called the
tails of the list.
match-tail
simply calls pat-match
to see
if the remaining patterns match the given tail of the input list. If
they do, then match-tail
tries to bind the segment
variable to the elements in the input list preceding the tail.
ldiff
is one of the great unknown functions of Common
Lisp. Given a list and a tail of that list (not a list that
has the same elements, but an actual tail), ldiff
returns a list of the items preceding the tail. This function turns
out to be surprisingly handy but often overlooked, even by veteran
Lisp hackers.
To see why this works, consider
(pat-match '((?* l) (? x)) '(a b c))
match-tail
will be called with (A B C)
,
(B C)
, (C)
, and ()
. It will
try matching each of these with the remaining patterns, namely
((? X))
. Following the normal rules of matching, the
only match that succeeds is the one that binds X
to
C
, i.e., where the tail was (C)
. Hence,
L
will be bound to (A B)
.
The importance of bind-variable
becomes clear when we
consider
(pat-match '((?* l) (?* r) (?* l)) '(a b c)))
The sequence of events is roughly this:
- The first occurrence of
(?* L)
first tries passing the entire list(A B C)
to the remaining patterns((?* R) (?* L))
. - The
(?* R)
in turn first tries passing(A B C)
to((?* L))
. - After numerous tries,
(?* L)
binds itself to(A B C)
. The list of binding lists(((?* L)
(A B C)))
is returned to(?* R)
. (?* R)
adds the binding(R)
to each of these binding lists.(?* R)
then tries(B C) to ((?* L))
and gets back the answer(((?* L)
(B C)))
.(?* R)
adds the binding(R A)
to each of these binding lists.- Eventually,
(?* R)
has tried every tail, and returns the list of binding lists(((R) (L A B C)) ((R A) (L B C)) ((R A B) (L C)) ((R A B C) (L)))
. (?* L)
now tries to add the binding(L)
to each of these lists. This succeeds only on the last binding list.- The steps above now repeat with
(?* L)
passing(B C)
to the remaining patterns((?* R) (?* L))
. - Eventually, the binding lists returned are
(((R) (L B C)) ((R B) (L C)) ((R B C) (L)))
. (?* L)
is unable to add the binding(L A)
to any of these lists. And similarly for the remaining passes.