[Openmcl-devel] Error on macro character after sharpsign colon

Mon Feb 1 10:58:17 PST 2010

On Sun, 31 Jan 2010, Ron Garret wrote:

> I hate to flog a dead horse, but I have some new data.
>

> I have an experimental system called symbol-reader-macros.  This
> lets you defined reader macros on symbols rather than characters,
> which lets you do things like turn your REPL into a decent emulation
> of a unix shell (among other things).  It works by defining a
> regular reader macro on all alphabetic characters which dispatches
> to a function that calls READ recursively, checks to see if the
> resulting form is a symbol with a symbol reader macro function
> defined on it, and if so calls that function.  It works, and it lets
> you do cool things, but when it's loaded it renders CCL incapable of
> reading any uninterned symbols.

> So my sympathies in this matter have shifted, and I now believe that
> Terje is right and that this is a problem that needs to be fixed.

> FWIW, the counter-argument that #:#foo and #:() will return possibly
> unintuitive results is IMHO a weak one because these character
> sequences are unlikely to appear in actual code, and are certainly
> unlikely to appear there by accident.

So, a case is pathological and irrelevant if it doesn't involve a
corner that someone's painted themselves into, and meaningful if it
does ?

I agree that plausible, real-world cases should have more weight
than ... whatever constitutes the opposite of those cases.

I don't know how often (if ever) it comes up in practice, but I can
easily imagine someone expecting

(read-from-string "#:1234")

to return an uninterned symbol assuming "standard syntax" and regardless
of the value of *READ-BASE*, and that anyone doing that would have
every right to scream bloody murder if it didn't work that way in a given
implementation.  (I don't know of any implementation in which this
wouldn't behave that way; what we obtain by interpreting the sequence of
characters that follow the #\: as "something with the syntax of a symbol"
isn't the same as what calling READ would return (READ's behavior would
be sensitive to the value of *READ-BASE*, at the very least.)

Again assuming standard syntax and that *READ-EVAL* is true, it seems
obvious that:

(read-from-string "#.(intern \"ABC\")")

will return a symbol.  It seems equally obvious that the string 
"#.(intern \"ABC\")" doen't have the syntax of a symbol; I hope
that I'm correct in characterizing that as obvious.  That particular
sequence of characters stopped having the syntax of a symbol as soon
as the first character was determined to be a macro character and not
a simple constituent.

The spec says that the symbol-name that follows #: must have the
syntax of a symbol; that's not the same as saying that it is any
sequence which would cause READ to return a symbol.  (Fortunately, all
existing implementations agree on this and no implementations process
those initial macro characters; less fortunately, some implementations
will incorporate initial macro characters into the token they collect
and other implementations consider the presence of macro characters
in that context to be a violation of a "must have the syntax of a symbol"
requirement.

> Whatever the value of having these character sequences produce
> errors might be, I think those are vastly outweighed by the ability
> to define reader macros on alphabetic characters without breaking
> uninterned symbols.

I think that you're greatly underestimating the value of tractable,
well-defined behavior.

In the US postal system, states are denoted by a particular 2-character
abbreviation: Alaska is denoted by AK, New York by NY, etc.  A hypothetical,
hopefully plausible program that processed these codes might define a
reader macro to make them easier to recognize and validate.

(set-macro-character #\$ (lambda (stream char)
                            (declare (ignore char))
                            (let* ((name (make-string 2)))
                              (setf (schar name 0) (char-upcase (read-char stream))
                                    (schar name 1) (char-upcase (read-char stream)))
                              (or (find-symbol name "STATE-NAMES")
                                  (error ...)))))

So, $AK reads as STATE-NAMES:ALASKA. and both

$AK something-else

and

$AKsomething-else

are equivalent: the state name code is exactly 2 characters long and not delimited
by whitespace; whatever "something else" is, it's incredibly critical and can
be completely ignored in other cases.

If someone mistakenly decides that it'd be better if $AK read as an uninterned
symbol and that Alaska should be encoded as #:$AK.  Of course that doesn't work
like that, and in implementations that quietly incorporate initial macro characters
into the uninterned symbol returned by #: (rather than error) the trailing 
something-else will be incorporated into that symbol rather than processed 
separately as it ordinarily would have.  Let's assume that this causes all
mail addressed to Alaska to be delivered to Arkansas and that it's difficult
to debug the problem: after all, no error was signaled.

I don't know how contrived this example is, but I have a lot of difficulty
concluding that there's no value in signaling an error.

Things should certainly behave intuitively and match people's expectations.
Unfortunately, different people have different intuitions and expectations,
and it's more important to satisfy well-founded expectations, and in this
particular case the expectation that #:abc is meaningful (when #\a is a 
macro character) is not well-founded, and I don't think that quietly accepting
invalid syntax (so that person A's hack works by accident) is somehow desirable
because it might be harmless to do so in some cases.

>
> rg
>
>
> On Jan 27, 2010, at 5:18 AM, Gary Byers wrote:
>
>> So, you think that
>>
>> * '#:#abc
>>
>> should quietly return an uninterned symbol whose name begins with a #
>> character, and that
>>
>> * '#:()
>>
>> should be read as an uninterned symbol with a 0-length name, followed
>> by an empty list ?  (FWIW, SBCL has these behaviors; LW accepts the
>> first example and complains that the second involves a missing symbol
>> name after #:, allegro seems to treat both cases as SBCL does, CLISP
>> and CCL (and MCL) complain about both of these examples.  I wouldn't
>> be surprised if other cases or a sampling of other implementations
>> expose other differences.)
>>
>> That's indeed the behavior one would get if #: went directly to step
>> 8 of the reader algorithm via something like your interpretation of
>> 2.4.8.5.
>>
>> When you say that:
>>
>> Now I think that 2.4.8.5 -- when it talks about "must have the syntax of
>> a symbol" -- that this statement can only imply that the thing after #:
>> is interpreted as a token as an a priori decision.
>>
>> do you think that an a priori decision has also been made that the alleged symbol name also contains no package prefix ?
>>
>> "Must" can be interpreted in at least 2 ways:
>>
>> (1) as "it is a requirement that" ("The arguments to the function
>>    + must be numbers, and it's desirable that #'+ verify that to
>>    be true").  CLHS usually tries to use more precise terminology
>>    to describe requirements like this.
>> (2) as "it is a logical consequence of previously established facts or
>>    prior knowledge" ("If the sum of integers X and Y is known to be
>>    odd and X is known to be even, then Y must be odd.")
>>
>> It is certainly not established that the (alleged) symbol name doesn't
>> contain a package prefix; every implementation that I've looked at
>> treats that as something that needs to be verified at runtime. ("must
>> [1]").  I find it odd that some implementations would treat the first
>> part as impling that something's been established (the the alleged
>> symbol name HAS the syntax of a symbol and we should enter a state
>> corresponding to step 8 of the reader algorithm based on some
>> nonexistent a priori knowledge of what characters actually follow #:).
>> I can't see how it's reasonable to simultaneously apply two disjoint
>> interpretations to a single use of the word "must".
>>
>> It follows that I don't think that we have any reason to enter anything
>> other than a state corresponding to step 1 of the reader algorithm: we
>> don't know anything about the syntax types of the characters that we're
>> about to read and certainly haven't received special dispensation to
>> start in step 8; if we exit in from a state corresponding to step 10
>> (we have a token) we win and if we would exit in some other state we
>> lose (the characters following #: didn't have the syntax of a symbol);
>> if we won, we can check the additional requirement that the symbol name
>> token not have a package prefix.
>>
>> In order to believe that there's a basis for going directly to step 8
>> (and treating initial non-terminating macro characters as constituents,
>> among other things), I have to parse a single use of "must" as if it
>> simultaneously means two different things.  Every time that I try to
>> do that, I get a bad headache.  If the restriction on package prefixes
>> weren't present, I think that I'd lean pretty far towards interpreting
>> "must have the syntax of a symbol" as meaning "it is a requirement that ..."
>> rather than "there is some unspecified basis for assuming that ..."; the
>> additional package-prefix qualification  seems pretty convincing to
>> me (possibly because trying to lean in two directions at the same time
>> gives me a REALLY bad headache.)
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 27 Jan 2010, Tobias C. Rittweiler wrote:
>>
>>> Gary Byers <gb at clozure.com> writes:
>>>
>>>> There are likely other reasons why calling the macro function either
>>>> can't work or wouldn't be a good idea.  I don't think that any
>>>> implementations do that or that there's any reason to think that they
>>>> would.  The #: reader-macro in the implementations whose source I
>>>> looked at do essentially what CCL does: collect a "token" by reading a
>>>> sequence of characters from the current input stream and making an
>>>> uninterned symbol out of the sequence of characters that comprise that
>>>> token.  The "collect-token" process may involve calling some internal
>>>> function that's also called by the reader.
>>>>
>>>> There are at least a couple of approaches to this token-collection
>>>> process:
>>>>
>>>> a) read characters and process escape characters until a delimiter
>>>>    (whitespace, terminating macro, EOF) is encountered.
>>>>
>>>> b) essentially the same, but insist that the first character is
>>>>    a constituent or escape character (and not a non-terminating
>>>>    macro.)
>>>>
>>>> Some implementations follow (a); others (including CCL) follow (b).
>>>> I haven't heard any argument in favor of (a) that doesn't seem to
>>>> be based on a misunderstanding of what's happening here.
>>>
>>> SBCL behaves like a), and I'd defend that as the "better" choice -- of
>>> course, I cannot give an official position statement, it's my personal
>>> opinion:
>>>
>>>
>>>  2.4.8.5 Sharpsign Colon says
>>>
>>> "The symbol-name must have the syntax of a symbol with no package
>>>  prefix."
>>>
>>>
>>> The CLHS talks about "syntax of a symbol" only to differentiate tokens
>>> between numbers, potential numbers, and symbols. See for example:
>>>
>>>
>>> 2.3.1 Numbers as Tokens
>>>
>>> "When a token is read, it is interpreted as a number or symbol. The
>>>  token is interpreted as a number if it satisfies the syntax for
>>>  numbers specified in the next figure."
>>>
>>> or
>>>
>>> 2.2 Reader Algorithm, 2nd §.
>>>
>>> "When dealing with tokens, the reader's basic function is to
>>>  distinguish representations of symbols from those of numbers. When a
>>>  token is accumulated, it is assumed to represent a number if it
>>>  satisfies the syntax for numbers [...]. If it does not represent a
>>>  number, it is then assumed to be a potential number if it satisfies
>>>  the rules governing the syntax for a potential number. If a valid
>>>  token is neither a representation of a number nor a potential number,
>>>  it represents a symbol."
>>>
>>>
>>> Now I think that 2.4.8.5 -- when it talks about "must have the syntax of
>>> a symbol" -- that this statement can only imply that the thing after #:
>>> is interpreted as a token as an a priori decision.
>>>
>>> This means now, that #: must perform step 8 in the Reader Algorithm
>>> (2.2) which is the only place in the standard that specifies how tokens
>>> are actually read.
>>>
>>> Following step 8, a non-terminating macro character is just interpreted
>>> as constituent.
>>>
>>> I.e. taking the example in Terge's original posting, #:!foo must
>>> (following my argumentation) be read as an uninterned symbol with
>>> symbol-name "!FOO" (modulo readtable-case.)
>>>
>>> -T.
>>>
>>>
>>>
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>