[Openmcl-devel] [babel-devel] Changes

Sun Apr 12 11:42:55 PDT 2009

On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers <gb at clozure.com> wrote:
>> Suppose (code-char 237) returned NIL instead of #\í. That's allowed by
>> the CL standard, but I'm positive some Babel test should fail because
>> of that.
>
> Assuming that the implementation in question used Unicode (or some
> subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how
> this case (where a character is associated with a code in Unicode) is
> analogous to the case that we're discussing (where Unicode says that no
> character is or ever can be associated with a particular code.)

It's analogous because, in both cases, Babel is expecting CODE-CHAR to
return non-NIL. In both cases, if CODE-CHAR returns NIL, code will
break (e.g. the UTF-8B decoder). And, to be clear, the code breaks not
because of the assumption per se, but because it really needs/wants to
use some of those character codes.

> The spec does quite clearly say that CODE-CHAR is allowed to return
> NIL if no character with the specified code attribute exists or can
> be created.  CCL's implementation of CODE-CHAR returns NIL in many
> (unfortunately not all) cases where the Unicode standard says that
> no character corresponds to its code argument; other implementations
> currently do not return NIL in this case.  There are a variety of
> arguments in favor of and against either behavior, ANSI CL allows
> either behavior, and code can't portably assume either behavior.

Again, you might argue that Babel's expectation is wrong and you might
be right. But that's the current expectation and Babel's test suite
should reflect that. There's a couple of other non-portable
assumptions that Babel makes. E.g. it expects char codes to be Unicode
or a subset thereof.

> I believe that it's preferable for CODE-CHAR to return NIL in
> cases where it can reliably and efficiently detect that its argument
> doesn't denote a character, and CCL does this.  Other implementations
> behave differently, and there may be reasons that I can't think of
> for finding that behavior preferable.

The main advantage seems to be the ability to deal with mis-encoded
text non-destructively. (Through UTF-8B, UTF-32, or some other
encoding.) But perhaps that is a bad idea altogether?

> I'm not really sure that I
> understand the point of this email thread and I'm sure that I must
> have missed some context, but some part of it seems to be an attempt
> to convince me (or someone) that CODE-CHAR should never return NIL
> because of some combination of:
>
>  - in other implementations, it never returns NIL
>  - there is some otherwise useful code which fails (or its test suite
>    fails) because it assumes that CODE-CHAR always returns a non-NIL
>    value.

I'm sorry. The lack of context was entirely my fault. Should have
described what was going on when I added openmcl-devel to the Cc list.

Let me try to sum things up. Babel is a charset encoding/decoding
library. One of its main goals is to provide consistent behaviour
across the Lisps it supports, particularly with regard to error
handling. I believe it has largely succeeded to accomplish said goal;
this problem is the first inconsistency that I know of.

Which is why I thought I should present this issue to the
openmcl-devel list. I suppose I was indeed trying to get the CCL
developers to change its behaviour (or accept patches in that
direction) in the hopes of providing consistent behaviour for Babel
users. I guess I'll have to instead add a note to Babel's
documentation saying something like "UTF-8B does not work on Clozure
CL". It's unfortunate, but not that big a deal, really.

> If I understand this much correctly, then I can only say that I didn't
> personally find these arguments persuasive when I was trying to decide
> how CODE-CHAR should behave in CCL a few years ago and don't find them
> persuasive now.

Fair enough. I don't have any more arguments. (Though, I might stress
again that the main problem is not that we assume that CODE-CHAR
always returns non-NIL, it's that we really do want to use some
character codes that CCL forbids.)

> If there were a lot of otherwise useful code out there that made the
> same non-portable assumption and if it was really hard to write
> character-encoding utilities without assuming that all codes between
> 0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive
> of this than I'm being.  As it is, I'm sorry that I can't say anything
> more constructive than "I hope that you or someone will have the opportunity
> to change your code to remove non-portable assumptions
> that make it less useful with CCL than it would otherwise be."

Again, I'm curious how UTF-8B might be implemented when CODE-CHAR
returns NIL for #xDC80 through #xDCFF.

-- 
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/