Some obscure things in Lisp, specifically Common Lisp.
The aim of this is to try and summarise some of the more obscure pitfalls which I and others have fallen into, and to try and provide some rationale for why (I think) things are the way are in some cases. This document is growing slowly as I find new obscurities.
get
&cread-delimited-list
will and won't do, and changing the reader syntax in
CLIt's important that user-defined code that prints
representations of objects should conform to the requirements of
the system. The thing that really matters is listening to
*print-readably*: if *print-readably* is
true, then any printing must be done in such a way that
it can be read back, and if this cannot be done then an error of
type print-not-readable must be signalled.
It's tempting to write something like this:
(defclass mux ()
((connections :initform '()
:accessor mux-connections)))
(defmethod print-object ((m mux) stream)
(format stream "#<MUX with ~A connections>"
(length (mux-connections m))))
But this is broken: if *print-readably* is true
this will happily generate unreadable output, causing later
problems.
If you need to write completely user-defined print methods like this, you need to do something like this:
(defmethod print-object ((m mux) stream)
(if *print-readably*
(call-next-method)
(format stream "#<MUX with ~A connections>"
(length (mux-connections m)))))
- the call-next-method will arrange for the
default method to be called, which will signal the appropriate
error.
Better still, you can often use
print-unreadable-object, which will arrange for an
error to be signalled:
(defmethod print-object ((m mux) stream)
(print-unreadable-object (m stream :type t :identity t)
(format stream "with ~A connections"
(length (mux-connections m)))))
This method doesn't provide completely user-defined output, and so isn't always suitable. An example of where you might want more control than this is in printing condition objects.
Checking the value of *print-escape* is also
sometimes useful, although unlike *print-readably*
its effect is not mandatory.
Erik Naggum & Kent Pitman taught me this.
It's sad that this is an obscure thing, but it seems to be at least little known, surprisingly enough.
People brought up on Unix/C-based systems have a tendency to treat characters and octets (often called `bytes') as the same thing: a character is simply an 8-bit number. This is hopelessly wrong, of course, but on Unix systems it's long been possible to get away with this confusion.
However, characters in Lisp are not 8-bit numbers, and on
non-Unix systems, such as Windows, or on systems where the
character set does not fit into 8 bits, you can't get away with
this pretense. If we accept the Unix/Windows notion of a file
as a sequence of octets, then these octets encode the
characters that the file consists of, and that encoding may be
non-trivial. This is what the `external format' option to the
open is for - it provides a placeholder to specify
how the file is encoded. It is not safe to assume that the
default value of the encoding is one-octet-to-one-character.
This means that the length of a file in octets may be different than the length of the file when read into a string.
Why does this matter? Well, Common Lisp specifies a function,
file-length which is meant to tell you the length of
a file. If you want to read a file into a string, this might seem
like a good way to do it:
(defun snarf-file (file)
;; Read a file into a string
;; BROKEN
(with-open-file (in file :direction :input)
(let ((s (make-string (file-length in))))
(read-sequence s in)
s)))
But this almost certainly will not work reliably.
file-length will almost certainly tell you the
length of the file in octets, not characters, and if
the encoding is not trivial, this will mean that the string
allocated will be the wrong length (typically it will be too
long). To see why this is likely to be true, consider how you
would make things work `right' on a Unix-like system: since the
file is actually just a sequence of octets - there is no useful
metadata - then in order for file-length to
calculate the character length of the file, it would have to
read the whole file, decoding it into characters. So in order
to work, this code has to read the whole file
twice.
This version of the function works, although it is probably possible to write it more efficiently than this
(defun snarf-file (file)
;; encoding-resistant file reader. You can't use FILE-LENGTH
;; because in the presence of variable-length encodings (and DOS
;; linefeed conventions) the length of a file can bear little resemblance
;; to the length of the string it corresponds to. Reading each line
;; like this wastes a bunch of space but does solve the encoding
;; issues.
(with-open-file (in file :direction :input)
(loop for read = (read-line in nil nil)
while read
for i upfrom 1
collect read into lines
sum (length read) into len
finally (return
(let ((huge (make-string (+ len i))))
(loop with pos = 0
for line in lines
for len = (length line)
do (setf (subseq huge pos) line
(aref huge (+ pos len)) #\Newline
pos (+ pos len 1))
finally (return huge)))))))
Of course this all seems like some hopeless obscurity that only people in countries with weird character sets need to worry about: 7-bit ASCII was good enough for my grandfather and it's good enough for me.
Well, no. This matters even in the US. If you ever use
DOS or Windows machines, you'll have discovered that the
convention there is to use a two character sequence - CRLF - to
end lines. But Common Lisp (and Unix) conventionally uses a
single character - LF or #\Newline. Decent CL
implementations therefore need to translate DOS files as they
are read into strings: replacing CRLF sequences by LF. This
means that, on DOS, the octet-length of a file is longer than
the character length by the number of lines in the file (-1 if
the last line does not end with CRLF). And DOS is a Unix-like
system - its files are just octet sequences, so
file-length will almost certainly return the octet
length.
A remarkable number of programs developed under Unix or Linux simply get this wrong, resulting in obscure and inexcusable bugs.
get
&cThe CL property-list accessors get and
getf don't provide an obvious way of telling whether
a property is present or not. For code like this
(defvar *plist* '(:foo 1 :bar nil)) (getf *plist* ':bar)
You will get nil whether a property is present or
not. get provides an optional default value, but
this doesn't obviously help, because anything can be in a
property list. get-properties does provide a way of
telling whether something is present or not, but it is an
over-elaborate solution.
Fortunately this is not a hard problem to solve. What needs to be done is to provide a default value which absolutely can not be in the property list - for instance a freshly-allocated cons. The following does this:
(defun getf/default (place indicator &optional (default nil))
;; first value is what was found, second if t if it was present.
(let* ((magic (cons nil nil))
(r (getf place indicator magic)))
(if (eq r magic)
(values default nil)
(values r t))))
;;; SETF method left as an exercise
This version allocates a cons for each call. This can quite easily be avoided though:
(defun getf/default (place indicator &optional (default nil))
;; first value is what was found, second if t if it was present.
(let* ((magic (load-time-value (cons nil nil))))
(r (getf place indicator magic)))
(if (eq r magic)
(values default nil)
(values r t))))
Similar versions of get can be defined.
This technique works in any case where you need to distinguish `not present' from `present with default value' in any kind of accessor that provides a default value option.
read-delimited-list
will and won't do, and changing the reader syntax in
CLread-delimited-list: what it will and won't
doA fairly common trick when defining `special' syntaxes, while remaining in the Lisp style, is to define a pair of delimiter characters which behave like ( and ). This is generally done with code that looks something like this:
(defvar *my-readtable* (copy-readtable))
(set-syntax-from-char #\] #\) *my-readtable* *readtable*)
(set-macro-character
#\[
#'(lambda (stream char)
(declare (ignore char))
(process-bracketed-form (read-delimited-list #\] stream t))))
This doesn't quite do what most people expect.
read-delimited-list reads a list, and not a
general form. In particular, with the above syntax, the
string "[foo . bar]" is not legal read syntax,
because it contains a single dot.
As far as I know there is simply no easy way to get at the
general `form reader' which will handle consing dots.
Fortunately this isn't very often needed, but occasionally it is
a pain. It would be nice if there was a
read-delimited-form function which would handle
consing dots.
Some people get magnificently confused about this. In particular they think that the problem in the preceding section can be overcome by a trick like this:
(set-syntax-from-char #\{ #\()
(set-syntax-from-char #\} #\))
However, this does not work. The readmacro for
( is defined to look for ), not a character
which has merely has the same syntax. In effect, the readmacro
for ( somewhere calls
(read-delimited-form stream #\) ... ).
This is quite clear from the specification. The first paragraph
of section 2.4.1 states (in part):
The left-parenthesis initiates reading of a list. read is called recursively to read successive objects until a right parenthesis is found in the input stream. A list of the objects read is returned.
It's hard to be more explicit than that. The definition of
set-syntax-from-char also states:
A macro definition from a character such as " can be copied to another character; the standard definition for " looks for another character that is the same as the character that invoked it. The definition of ( can not be meaningfully copied to {, on the other hand. The result is that lists are of the form {a b c), not {a b c}, because the definition always looks for a closing parenthesis, not a closing brace.
One reason why things are like this is because there isn't
enough information in the syntax of a character in CL to say
which character should match. If copying syntaxes like
this worked, there would be no mechanism to prevent braces
matching parens and vice versa. The function that
implements the readmacro is given the character that caused it
to be called, but there is no table of matching characters.
Emacs' syntax tables (used for buffer contents, not for Emacs
Lisp itself) do allow this - you can specify that a character is
a delimiter and what other character it matches - see the Emacs
Lisp function modify-syntax-entry if you want to
see how this is done.
(set-syntax-from-char a #\() (set-syntax-from-char b #\))Now consider what happens when the reader sees a in an unquoted context. The reader function is called with two arguments: the stream being read, and the character a. It needs to somehow read a form delimited by b. But how can it know that it should read b and not, say #\)? It can't: there is no table of which pairs match anywhere in the system. Therefore this trick cannot work: there is simply no mechanism in the language which will allow correct matching to be enforced.
So one way of allowing this kind of `nonstandard paren' in CL would be to change the language to support the notion of matching characters, and to allow users to define which pairs matched.
However, a much simpler way is simply to have
read-delimited-form. With the addition of this
function, the current system is adequate. I think that
read-delimited-form is implementable in portable
CL, and I think I have such an implementation, but I'm not sure
it works. If anyone would like to test it, please let me
know.