[Openmcl-devel] [babel-devel] Changes

Tue Apr 21 09:01:31 PDT 2009

On Tue, 21 Apr 2009, Luís Oliveira wrote:

> On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers <gb at clozure.com> wrote:
>> If I understand this much correctly, then I can only say that I didn't
>> personally find these arguments persuasive when I was trying to decide
>> how CODE-CHAR should behave in CCL a few years ago and don't find them
>> persuasive now.
>
> It seems the discussion has run out of steam. Just to conclude it, I
> should ask: is it still the case that UTF-8B is not an argument
> compelling enough to make you consider a patch changing CODE-CHAR's
> behaviour, as well as the various encode- and decode-functions? (Such
> a patch would change CODE-CHAR to accept any code point, and deal with
> invalid code points explicitely in the UTF encoders and decoders.)
>

Yes, that is still the case.

Table 2-3 (in Section 2-4) in the Unicode spec describes how various
classes of code points do and do not map to abstract characters in
Unicode, and I think that it's undesirable for CODE-CHAR in a CL
implementation that purports to use Unicode as its internal encoding
to return a character object for codes that that table says do not
denote a Unicode character.  CCL's CODE-CHAR returns NIL for
surrogates and (in recent versions) a couple of permanant noncharacter
codes.  As I've said, I'd feel better about it if CCL's CODE-CHAR
returned NIL for all (all 66) permanent-noncharacter codes, and if it
cost nothing (in terms of time or space), I think that it'd be
desirable for CODE-CHAR to return NIL for codes that're reserved as of
the current version of the Unicode standard (or whatever version the
lisp uses.)  In the latter case, you may be able to get away with
treating reserved codes as if they denoted defined characters - you
wouldn't have the same issues with UTF-encoding them as would exist
for surrogates, for instance - but you can't meaningfully treat a
"reserved character" as if it was a defined character:

? (upper-case-p #\A) => T (in Unicode 5.1 and all prior and future versions)

? (upper-case-p (code-char #xd0000)) => unknown; as of Unicode 5.1, there's no such character

I think that it'd be more consistent to say "AFAIK, there's no such
character" than it would be to claim that there is and that it is or
is not an upper-case character.  Since CODE-CHAR is sometimes on or
near a critical performance path, it's not clear that making it
100% accurate is worth whatever that would cost in terms of time/space.
It's clear to me that catching and rejecting surrogate code points
as non-characters is worth the extra effort.

> -- 
> Luís Oliveira
> http://student.dei.uc.pt/~lmoliv/
>
>