"~& ~:[?~;~:*~S~]: ~:[?~;~:*~S~] -> ~:[?~;~:*~S~]~%"

Some obscure things in Lisp, specifically Common Lisp.

The aim of this is to try and summarise some of the more obscure pitfalls which I and others have fallen into, and to try and provide some rationale for why (I think) things are the way are in some cases. This document is growing slowly as I find new obscurities.

Printing unreadably

It's important that user-defined code that prints representations of objects should conform to the requirements of the system. The thing that really matters is listening to *print-readably*: if *print-readably* is true, then any printing must be done in such a way that it can be read back, and if this cannot be done then an error of type print-not-readable must be signalled.

It's tempting to write something like this:

(defclass mux ()
  ((connections :initform '()
		:accessor mux-connections)))


(defmethod print-object ((m mux) stream)
  (format stream "#<MUX with ~A connections>"
	  (length (mux-connections m))))
    

But this is broken: if *print-readably* is true this will happily generate unreadable output, causing later problems.

If you need to write completely user-defined print methods like this, you need to do something like this:

(defmethod print-object ((m mux) stream)
  (if *print-readably*
      (call-next-method)
      (format stream "#<MUX with ~A connections>"
	      (length (mux-connections m)))))

- the call-next-method will arrange for the default method to be called, which will signal the appropriate error.

Better still, you can often use print-unreadable-object, which will arrange for an error to be signalled:

(defmethod print-object ((m mux) stream)
  (print-unreadable-object (m stream :type t :identity t)
    (format stream "with ~A connections"
	    (length (mux-connections m)))))

This method doesn't provide completely user-defined output, and so isn't always suitable. An example of where you might want more control than this is in printing condition objects.

Checking the value of *print-escape* is also sometimes useful, although unlike *print-readably* its effect is not mandatory.

Erik Naggum & Kent Pitman taught me this.

Snarfing files into strings

It's sad that this is an obscure thing, but it seems to be at least little known, surprisingly enough.

People brought up on Unix/C-based systems have a tendency to treat characters and octets (often called `bytes') as the same thing: a character is simply an 8-bit number. This is hopelessly wrong, of course, but on Unix systems it's long been possible to get away with this confusion.

However, characters in Lisp are not 8-bit numbers, and on non-Unix systems, such as Windows, or on systems where the character set does not fit into 8 bits, you can't get away with this pretense. If we accept the Unix/Windows notion of a file as a sequence of octets, then these octets encode the characters that the file consists of, and that encoding may be non-trivial. This is what the `external format' option to the open is for - it provides a placeholder to specify how the file is encoded. It is not safe to assume that the default value of the encoding is one-octet-to-one-character.

This means that the length of a file in octets may be different than the length of the file when read into a string.

Why does this matter? Well, Common Lisp specifies a function, file-length which is meant to tell you the length of a file. If you want to read a file into a string, this might seem like a good way to do it:

(defun snarf-file (file)
  ;; Read a file into a string
  ;; BROKEN
  (with-open-file (in file :direction :input)
    (let ((s (make-string (file-length in))))
      (read-sequence s in)
      s)))
    

But this almost certainly will not work reliably. file-length will almost certainly tell you the length of the file in octets, not characters, and if the encoding is not trivial, this will mean that the string allocated will be the wrong length (typically it will be too long). To see why this is likely to be true, consider how you would make things work `right' on a Unix-like system: since the file is actually just a sequence of octets - there is no useful metadata - then in order for file-length to calculate the character length of the file, it would have to read the whole file, decoding it into characters. So in order to work, this code has to read the whole file twice.

This version of the function works, although it is probably possible to write it more efficiently than this

(defun snarf-file (file)
  ;; encoding-resistant file reader.  You can't use FILE-LENGTH
  ;; because in the presence of variable-length encodings (and DOS
  ;; linefeed conventions) the length of a file can bear little resemblance
  ;; to the length of the string it corresponds to.  Reading each line 
  ;; like this wastes a bunch of space but does solve the encoding
  ;; issues.
  (with-open-file (in file :direction :input)
    (loop for read = (read-line in nil nil)
          while read
          for i upfrom 1
          collect read into lines
          sum (length read) into len
          finally (return
                   (let ((huge (make-string (+ len i))))
                     (loop with pos = 0
                           for line in lines
                           for len = (length line)
                           do (setf (subseq huge pos) line
                                    (aref huge (+ pos len)) #\Newline
                                    pos (+ pos len 1))
                           finally (return huge)))))))
    

Of course this all seems like some hopeless obscurity that only people in countries with weird character sets need to worry about: 7-bit ASCII was good enough for my grandfather and it's good enough for me.

Well, no. This matters even in the US. If you ever use DOS or Windows machines, you'll have discovered that the convention there is to use a two character sequence - CRLF - to end lines. But Common Lisp (and Unix) conventionally uses a single character - LF or #\Newline. Decent CL implementations therefore need to translate DOS files as they are read into strings: replacing CRLF sequences by LF. This means that, on DOS, the octet-length of a file is longer than the character length by the number of lines in the file (-1 if the last line does not end with CRLF). And DOS is a Unix-like system - its files are just octet sequences, so file-length will almost certainly return the octet length.

A remarkable number of programs developed under Unix or Linux simply get this wrong, resulting in obscure and inexcusable bugs.

Default values for get &c

The CL property-list accessors get and getf don't provide an obvious way of telling whether a property is present or not. For code like this

(defvar *plist* '(:foo 1 :bar nil))

(getf *plist* ':bar)

You will get nil whether a property is present or not. get provides an optional default value, but this doesn't obviously help, because anything can be in a property list. get-properties does provide a way of telling whether something is present or not, but it is an over-elaborate solution.

Fortunately this is not a hard problem to solve. What needs to be done is to provide a default value which absolutely can not be in the property list - for instance a freshly-allocated cons. The following does this:

(defun getf/default (place indicator &optional (default nil))
  ;; first value is what was found, second if t if it was present.
  (let* ((magic (cons nil nil))
	 (r (getf place indicator magic)))
    (if (eq r magic)
	(values default nil)
      (values r t))))

;;; SETF method left as an exercise

This version allocates a cons for each call. This can quite easily be avoided though:

(defun getf/default (place indicator &optional (default nil))
  ;; first value is what was found, second if t if it was present.
  (let* ((magic (load-time-value (cons nil nil))))
	 (r (getf place indicator magic)))
    (if (eq r magic)
	(values default nil)
      (values r t))))

Similar versions of get can be defined.

This technique works in any case where you need to distinguish `not present' from `present with default value' in any kind of accessor that provides a default value option.

What read-delimited-list will and won't do, and changing the reader syntax in CL

read-delimited-list: what it will and won't do

A fairly common trick when defining `special' syntaxes, while remaining in the Lisp style, is to define a pair of delimiter characters which behave like ( and ). This is generally done with code that looks something like this:

(defvar *my-readtable* (copy-readtable))

(set-syntax-from-char #\] #\) *my-readtable* *readtable*)

(set-macro-character
 #\[
 #'(lambda (stream char)
     (declare (ignore char))
     (process-bracketed-form (read-delimited-list #\] stream t))))

This doesn't quite do what most people expect. read-delimited-list reads a list, and not a general form. In particular, with the above syntax, the string "[foo . bar]" is not legal read syntax, because it contains a single dot.

As far as I know there is simply no easy way to get at the general `form reader' which will handle consing dots. Fortunately this isn't very often needed, but occasionally it is a pain. It would be nice if there was a read-delimited-form function which would handle consing dots.

Changing the reader syntax in CL

Some people get magnificently confused about this. In particular they think that the problem in the preceding section can be overcome by a trick like this:

(set-syntax-from-char #\{ #\()
(set-syntax-from-char #\} #\))

However, this does not work. The readmacro for ( is defined to look for ), not a character which has merely has the same syntax. In effect, the readmacro for ( somewhere calls (read-delimited-form stream #\) ... ). This is quite clear from the specification. The first paragraph of section 2.4.1 states (in part):

The left-parenthesis initiates reading of a list. read is called recursively to read successive objects until a right parenthesis is found in the input stream. A list of the objects read is returned.

It's hard to be more explicit than that. The definition of set-syntax-from-char also states:

A macro definition from a character such as " can be copied to another character; the standard definition for " looks for another character that is the same as the character that invoked it. The definition of ( can not be meaningfully copied to {, on the other hand. The result is that lists are of the form {a b c), not {a b c}, because the definition always looks for a closing parenthesis, not a closing brace.

One reason why things are like this is because there isn't enough information in the syntax of a character in CL to say which character should match. If copying syntaxes like this worked, there would be no mechanism to prevent braces matching parens and vice versa. The function that implements the readmacro is given the character that caused it to be called, but there is no table of matching characters. Emacs' syntax tables (used for buffer contents, not for Emacs Lisp itself) do allow this - you can specify that a character is a delimiter and what other character it matches - see the Emacs Lisp function modify-syntax-entry if you want to see how this is done.

Claim
In order to correctly read a `bracketed form' (I made this term up) given only the opening character, you must know which character matches the opening character. The Common Lisp readtable does not contain this information, and therefore it is not possible to do this with character-syntax copying alone.
`Proof'
To see why this must be true, imagine I want to cause the system to read forms opening with a character a, and closing with another character b. I might try to do this as follows:
(set-syntax-from-char a #\()
(set-syntax-from-char b #\))
Now consider what happens when the reader sees a in an unquoted context. The reader function is called with two arguments: the stream being read, and the character a. It needs to somehow read a form delimited by b. But how can it know that it should read b and not, say #\)? It can't: there is no table of which pairs match anywhere in the system. Therefore this trick cannot work: there is simply no mechanism in the language which will allow correct matching to be enforced.
Note
To allow this trick to work as an extension to the standard it would be enough to cause the paren reader not to look for #\), but instead simply to look for a character with the same syntax. This is very undesirable, because there is still no mechanism of ensuring matching - any opening delimiter will match any closing delimiter.

So one way of allowing this kind of `nonstandard paren' in CL would be to change the language to support the notion of matching characters, and to allow users to define which pairs matched.

However, a much simpler way is simply to have read-delimited-form. With the addition of this function, the current system is adequate. I think that read-delimited-form is implementable in portable CL, and I think I have such an implementation, but I'm not sure it works. If anyone would like to test it, please let me know.

[TFEB]