The Scanning Problem

Here's a very common programming problem:

Repeatedly collect items from an input stream into chunks, until the stream is empty.

For example, compilers for C and Pascal, and the Lisp reader, collect characters from the input stream to form tokens, where each token is either a name or number or punctuation element.

Another example is compress on page 37 of Graham's ANSI Common Lisp. This function collects items from a list into pairs of the form (number element), indicating how often each item is repeated.

Ideally, we'd like our code to break down into two pieces:

The key problem in defining these scanning functions is that the second function needs two values from the first function:

Doing this cleanly, following the Cardinal Rule of Functions, takes some thought.

Approach 1: Extra Parameters

Graham's definition of compress defined a new function, compr, that took three arguments:

The second and third arguments hold the two values needed, and recursion was used to repeat the process until the list was emptied.

Unfortunately, the function compr is not very clear. It has a terrible name, but it's hard to give it a good name because it violates the Cardinal Rule of Functions and mixes several tasks together.

Approach 2: Chunk then Trim

In this particular example, it turns out to not be too hard to use a scanning function that takes an object and a list and returns the length of the run of that element. We'll give a recursive definition of the scanning function here:

(defun get-run (x l)
  (cond ((null l) 0)
        ((not (eql x (first l))) 0)
        (t (1+ (get-run x (rest l))))))

An iterative definition would be better here, because it would handle runs of any length without overflowing the function call stack.

Even better would be to take advantage of Common Lisp's large library of sequence functions, e.g.,

(defun get-run (x l)
  (or (position x l :test-not #'eql))
      (length l)))

Or, if you want to avoid the "deprecated" :test-not feature:

(defun get-run (x l)
  (or (position x l :test (complement #'eql))
      (length l)))

compress can then be defined as a repeated call to get-run:

(defun compress (l)
  (cond ((null l) nil)
        (t (let ((run (1+ (get-run (first l) (rest l)))))
             (cons (list run (first l))
                   (compress (nthcdr run l)))))))

Unfortunately, this version of compress is still a little messy.

More importantly, it doesn't generalize well. After get-run scans the list, it returns how many items it found. This number is then used to trim the list. It works because it's easy and relatively cheap to use the run-length to trim off the items used so far, but this isn't always the case, plus it leads to the messy nesting.

Approach 3: Generators

Because chunking streams is a common problem, we could go for a general solution and define some functions that capture the procedural flow underlying the general solution to this kind of problem.

Note. generators.lisp contains everything given below, except for the compression-specific functions, plus a few extras.

First, we define a new concept, a generator. A generator is a function that you call to get elements from some source. That source can be an input stream, a list of elements, an incrementing counter, whatever.

A generator takes one argument:

(Odd as the last rule may seem, it tends to be the most useful of the obvious options, just as i++ is more commonly used than ++i in C and C++.)

We can define generators using closures. We can do this on as needed basis, but a couple of generator-creating functions are particularly useful and easy to define.

For example, make-number-gen makes generators that generate numbers from M to N. If no N is given, the generator never runs out, i.e, never returns nil to :test.

(defun make-number-gen (m &optional n)
  #'(lambda (cmd)
      (ecase cmd
        (:test (and (numberp n) (> m n)))
        (:get m)
        (:next (prog1 m
                    (unless (and (numberp n) (> m n))
                      (incf m)))))))

Note the use of prog1 to save and return the old value of m.

The function make-list-gen makes generators that generate elements from a list, from left to right.

(defun make-list-gen (l)
  #'(lambda (cmd)
      (ecase cmd
        (:test (null l))
        (:get (car l))
        (:next (pop l)))))

We could define more generator-making functions, but now let's define three simple functions that will make code using generators more readable, and less dependent on how they're implemented.

(defun empty-gen-p (gen)
  (funcall gen :test))
(defun gen-elt (gen)
  (funcall gen :get))
(defun advance-gen (gen)
  (funcall gen :next))

A good example of a generator-using function, and a handy one for testing generators, is extrude. (extrude n gen) returns a list of the first N elements generated by the generator gen.

(defun extrude (n gen)
  (do ((l nil (cons (advance-gen gen) l))
       (i 0 (1+ i)))
      ((or (empty-gen-p gen) (>= i n))
       (nreverse l))))

Here are some examples:

? (extrude 5 (make-number-gen 3))
(3 4 5 6 7)
? (extrude 5 (make-number-gen 3 6))
(3 4 5)
? (extrude 5 (make-list-gen '(a b c d e f)))
(A B C D E)
? (extrude 5 (make-list-gen '(a b c)))
(A B C)

Now, with all this background out of the way, here is how we would use generators for our compression problem.

First, we define get-run. It keeps calling the generator until an element different than the given element is returned. It returns the pair (number elt) indicating how many occurrences of the element it saw, counting the given one.

(defun get-run (elt gen)
  (do ((n 1 (1+ n)))
      ((or (empty-gen-p gen)
           (not (eql elt (gen-elt gen))))
       (list n elt))
    (advance-gen gen)))

Finally, compress just calls get-run until the generator is exhausted.

(defun compress (l)
  (let ((gen (make-list-gen l)))
    (do ((run-pairs nil
                    (cons (get-run (advance-gen gen) gen)
        ((empty-gen-p gen)
         (nreverse run-pairs)))))


Generators have a long history in Lisp and other functional programming languages. The Common Lisp standard defines a particularly sophisticated set of series, generators and gatherers. See Common Lisp The Language, 2nd edition, by Guy L. Steele Jr., Appendices A and B (Digital Press, 1990).

In Scheme, they're called streams. See Structure and Interpretation of Computer Programs, Harold Abelson and Gerald Jay Sussman with Julie Sussman, Section 3.4 (MIT Press, 1985)

C++ has the same thing, only they call them iterators. Mitchell Model explains quite nicely why C++ iterators are really not iterators, and why real iterators would be hard to do in C++. See Data Structures, Data Abstraction: A Contemporary Introduction Using C++, Mitchell L. Model, p. 57 and Appendix C (Prentice-Hall, 1994).

Comments? Send mail to Chris Riesbeck.