[Openmcl-devel] Error on macro character after sharpsign colon

Wed Feb 3 07:13:19 PST 2010

This is very long and likely overly cruel to dead horses, but there may be
a simple solution at the bottom that'll let this thread die.

On Mon, 1 Feb 2010, Ron Garret wrote:

>
> On Feb 1, 2010, at 10:58 AM, Gary Byers wrote:
>
>
>> I don't know how often (if ever) it comes up in practice, but I can
>> easily imagine someone expecting
>>
>> (read-from-string "#:1234")
>>
>> to return an uninterned symbol assuming "standard syntax" and regardless
>> of the value of *READ-BASE*, and that anyone doing that would have
>> every right to scream bloody murder if it didn't work that way in a given
>> implementation.  (I don't know of any implementation in which this
>> wouldn't behave that way; what we obtain by interpreting the sequence of
>> characters that follow the #\: as "something with the syntax of a symbol"
>> isn't the same as what calling READ would return (READ's behavior would
>> be sensitive to the value of *READ-BASE*, at the very least.)
>
> I think I got lost in all the double-negatives.  In CCL the result of reading "#:1234" does depend on *print-base*.  But surely you knew that.
>

I'm mistaken on two counts here: In CCL, whether "#:1234" is accepted
or not does depend on *READ-BASE*, and it should.  A token has "the
syntax of a symbol" if it doesn't have the syntax of a number as defined
in section 2.3.1.

I have no idea what I was thinking here, but whatever it was it was clearly
wrong.  I blame it on unnaturally low caffeine levels.

>> Again assuming standard syntax and that *READ-EVAL* is true, it seems
>> obvious that:
>>
>> (read-from-string "#.(intern \"ABC\")")
>>
>> will return a symbol.  It seems equally obvious that the string "#.(intern \"ABC\")" doen't have the syntax of a symbol; I hope
>> that I'm correct in characterizing that as obvious.
>

> It's not obvious to me.  I think it's perfectly defensible to
> interpret the phrase "the syntax of a symbol" to mean "those strings
> which return symbols when passed as the first argument to
> READ-FROM-STRING".  Mind you I'm not *advocating* this
> interpretation, I'm just saying it's defensible. 
>

Phrases of the form "the syntax of ..." are used elsewhere in the spec
(such as in the definitions of #b, #o, and #x".)  In at least these
cases, the phrase is used to describe the required syntax of a token;
this is also true of of the use of the phrase in section 2.3.1.  I
don't know of any use of "the syntax of ..." that clearly refers to
anything but a token.  In the description of #:, 2.4.8.5 doesn't explicitly
state that the <<symbol-name>> must be a token, but it's very hard to see
how the qualification ("the syntax of a symbol with no package prefix"
could apply to anything but a token.  If someone wanted to defend an
interpretation that involved something other than a token following #:,
I think that they should probably plead insanity; trying to shift the
blame to poor wording in the spec would likely fail, because there's
just too much persuasive evidence to the contrary.

> The phrase "the syntax of a symbol" is inherently ambiguous in a
> language where the syntax can be changed by the user.  And not just
> by munging the readtable.  *READ-BASE* also affects which character
> sequences are and are not symbols.  Does "123" have the syntax of a
> symbol?  How about "CAFEBABE"?

I think that chapter 2 ("Syntax") of the spec is pretty rigorous (if
somewhat complicated) in defining what sequences of characters (with
syntax types defined by the current readtable) are tokens and which
tokens (influenced by *READ-BASE*) are symbols.

In most programming languages, "syntax" concerns itself with issues
like "which IF clause an ELSE is matched with when braces don't make
that clear"; in comparison and in that use of the term, Lisp syntax is
trivial (everything's an S-expression, many things are expressed as
function calls); what Chapter 2 of the spec calls "syntax" might be
called "lexical structure" or "lexical grammar" in other languages.

I don't agree that there's ambiguity as such, but there's certainly
context-sensitivity.  In order to understand how a sequence of
characters will be parsed into S-expressions (or if it can be ...), we
need to know quite a bit about the context in which that parsing takes
place: the values of *PACKAGE* and *READ-BASE* and other things, the
syntax types and macros defined in *READTABLE* ... The invariants
in all of this - the things that aren't context-sensitive - are the
rules described in sections 2.2 and 2.3 (mostly) of CLHS.  I don't
think that those rules are particularly ambiguous (if there's actual
ambiguity there, I think that it's pretty minor and obscure.)  If we
do know those contextual things, I think that we can reliably predict
the (correct, defined) behavior of READ.  (In practice, we generally
assume that that context info is "standard syntax" or something very,
very close to it; if that assumption's invalid, and we're back to
needing to understand how the contextual info differs from standard
syntax.

>
>> That particular
>> sequence of characters stopped having the syntax of a symbol as soon
>> as the first character was determined to be a macro character and not
>> a simple constituent.
>
> That is a defensible position.  But it is also a defensible position
> that "the syntax of a symbol" in the context of reading uninterned
> symbols should be interpreted with respect to the standard
> readtable.

If I thought that the previous defensible interpretation had a small
chance of a "not guilty by reason of insanity" acquittal, I think
that I'd advocate that this position be taken out and shot ASAP.

Readtables other than the current one are not defined to implicitly
influence the behavior of the reader algorithm defined in the spec or
the behavior of any standard reader macros.  (Neither is the day
of the week, the phase of the moon, or an infinite number of other
irrelevant things.)

>> The spec says that the symbol-name that follows #: must have the
>> syntax of a symbol; that's not the same as saying that it is any
>> sequence which would cause READ to return a symbol.  (Fortunately, all
>> existing implementations agree on this and no implementations process
>> those initial macro characters; less fortunately, some implementations
>> will incorporate initial macro characters into the token they collect
>> and other implementations consider the presence of macro characters
>> in that context to be a violation of a "must have the syntax of a symbol"
>> requirement.
>
> Right.  Some interpretations interpret "the syntax of a symbol" to be with respect to the current readtable, and others interpret it with respect to the standard readtable.  Both are defensible positions.

The relevant differing behavior between implementations has to do with
cases where the first character after #: is defined as a non-terminating
macro character in the current readtable.  Some implementations accept
such an initial non-terminating macro character and incorporate it into
the symbol-name token being constructed; other implementations signal
an error in this case, because an initial non-terminating doesn't cause
a token to be created, much less a token with the specified syntax.

This description of differing behavior isn't a position that can or
can't be defended; it's an assertion that can be empirically verified.

In other words: if the standard readtable and current readtable
differ, the syntax types of the characters in the standard readtable
aren't relevant to #: (or the reader in general or any standard
character macro function ...); the syntax types of characters in the
current readtable are extremely relevant.

If a macro character function is defined on #\a, then #\a is no longer
a constituent character.  The spec's clear that #x (and #o and #b)
should be followed by tokens with certain constraints on their lexical
structure, and entirely clear on the fact that a token can't start with
a non-constituent character, so with that macro definition in effect
#xabc has "undefined results".  No implementation that I checked noticed
this, and different implementations return different results and differ
in whether or errors are signaled (depending on the behavior of the
irrelevant macro function and on whether the implementation called it
or not.)

I disagreed with your earlier assertion that the fact that the reader
was extensible made Lisp syntax inherently ambiguous.  I think that
the lexical structure of language as specified is specified, but deviating
(intentionally or otherwise) from that specification introduces unpredictability
and a kind of ambiguity that needn't exist.

>>
>>> Whatever the value of having these character sequences produce
>>> errors might be, I think those are vastly outweighed by the ability
>>> to define reader macros on alphabetic characters without breaking
>>> uninterned symbols.
>>
>> I think that you're greatly underestimating the value of tractable,
>> well-defined behavior.
>
> Not at all.  All I'm saying is that when faced with two defensible
> ways to interpret an ambiguous requirement, one should chose the one
> that produces the more useful results.

It's certainly desirable that people be able to extend the reader to
do ambitious things (parse other programming languages with non-trivial
lexical structure, create parsers for arbitrarily complex domain- or
application-specific languages, etc) and obtain useful results.  This
sort of thing can get real complicated real fast, but if the implementation
offers strict, anal-retentive adherence to obscure details - like whether
or not a character can legally start a token - that task might be easier
than it would be if the implementation was intentionally DWIMmy about that
sort of thing. (Or unintentionally sloppy, as the case may be.)

If I understand your specific issue correctly, your macro functions on
alphabetic characters already handle the case where an alphabetic macro
character would start a symbol (and maybe a number), but #: isn't DWIMmy/
sloppy enough and #b/#o/#x are in imminent danger of getting fixed, and
their insistence on proper tokenness prevents these reader macros from
working as they do in standard syntax.

It's generally hard to reimplement these macro functions in portable code,
since CL doesn't offer a portable way to determine the syntax type of a
character.

Wouldn't a simple solution be to shadow these functions in your readtable
and let the original/standard versions do the heavy lifting with a standard
readtable in effect ?

(defun shadowed-dispatch-macro-function (stream subchar numarg)
   (let* ((*readtable* *copy-of-standard-readtable*))
      (funcall (get-dispatch-macro-character #\# subchar) stream subchar numarg)))

(dolist (ch '(#\: #\b #\o #\x))
   (set-dispatch-macro-character #\# ch #'shadowed-dispatch-macro-function *your-readtable*))

If I understand correctly, this provides a solution for your issue and Terje's,
and doesn't involve debating whether wrong is right or not.