[Openmcl-devel] [babel-devel] Changes

Tue Apr 14 02:17:04 PDT 2009

Quoting Luis Oliveira (luismbo at gmail.com):
> On Fri, Apr 10, 2009 at 4:34 PM, Gary Byers <gb at clozure.com> wrote:
> > I don't find the argument that says "since other implementations
> > don't seem to check validity at all, CCL shouldn't either" too
> > compelling.
> 
> Not very compelling, true, but that's not quite the argument I was
> trying to make since I hadn't realized there were validity issues.
> (Such as trying to encode these code points in UTF-16, as you
> described.)
> 
> But still, FWIW, I checked five other implementations (Lispworks,
> Allegro CL, Python 3, GHC and Factor) and they all allow these
> characters AFAICT.

In Allegro CL and LispWorks, the situation is very different.  They use
UTF-16 to represent Lisp strings in memory, so surrogates aren't just
forbidden in Lisp strings, user code actually needs to work with
surrogates to be able to use all of Unicode.

SBCL has 21 bit characters like CCL and currently has characters for the
surrogate code points.  But I am not aware of any consensus that this is
the right thing to do.  Personally I think it's a bug and SBCL should be
changed to do it like CCL.

As far as I understand, the only Lisp with 21 bit characters whose
author thinks that SBCL's behaviour is correct is ECL, but I failed to
understand the reasoning behind that when it was being discussed
on comp.lang.lisp.

(As a side note, I find it a huge hassle to write code portable between
the Lisp implementations with Unicode support.  For CXML, I needed read
time conditionals checking for UTF-16 Lisps.  And it still doesn't
actually work, because most of the other free libraries like Babel,
CL-Unicode, and in turn, CL-PPCRE, expect 21 bit characters and are
effectively broken on Allegro and LispWorks.)

[...]
> > I'm skeptical of the claim that all means of implementing UFF-8B
> > depend
> > on (CODE-CHAR #xdcf0) returning non-nil.
> 
> How else could it be implemented?
> 
> [This has already been answered now, of course.]

While I have no ideas regarding UTF-8b, I think it worth pointing out
that for the important use case of file names, there is a different way of
achieving a round-trip involving file names in "might be UTF-8" format.

The idea is to interpret invalid UTF-8 bytes in Latin 1, but prefix them
with the code point 0.

On encoding back to a file name, such null characters would be stripped
again.

This works, because Unix does not allow zero bytes in characters, and
every Lisp implementation I am aware of has a character with code 0.

Mono does it this way, and has some more explanation:
http://www.go-mono.com/docs/index.aspx?link=T:Mono.Unix.UnixEncoding

d.