[Openmcl-devel] *default-character-encoding* should be :utf-8

Tom Emerson tree at dreamersrealm.net
Tue Sep 25 14:53:05 UTC 2012

On Mon, Sep 24, 2012 at 2:38 PM, Ron Garret <ron at flownet.com> wrote:
> FILE is surprisingly good at figuring this out even on the basis of very little information:

UTF-8 is very easy to sniff if you have the text: this is by design
and the primary reason the Unicode Consortium recommends against
putting the BOM in a UTF-8 encoded file ---- it's entirely redundant.

> [...] This is not so much a technical decision as it is a political one.  (Politics, like it or not, is a real-world consideration.)

So what could be done is to change the default to :UTF-8 but have the
addition in the manual calling this out and giving the recipe for
getting the old behavior back in your local installation. I can
certainly understand the political side of this.

> There are essentially three arguments against UTF-8:
> 1.  Some common operations (elt, length) are less efficient than in other encodings

Which is a non-issue since strings in CCL are UTF-32... the encoding
of the transfer mechanism is irrelevant unless you are working with
individual bytes, at which point elt/length are not useful for
character selection/counting anyway.

> 2.  It breaks some legacy code

Which is a non-trivial issue but which can be addressed with
appropriate documentation.

> 3.  It doesn't have complete coverage of the binary code space, so you can't send arbitrary data through a UTF-8 encoded channel and expect it to emerge unaltered

More to the point, the semantics of the encoded character will change
when transcoding when interpreting an octet sequence as UTF-8, while
it will not when treating it as Latin-1 (but *not* CP1252!)

> I believe that UTF-8 is the right choice, not *despite* these arguments, but *because* of them.

I'm in violent agreement, but (2) needs to be recognized: there is a
lot of legacy Lisp code, as you know.

> Oh, and it turns out that latin-1 cannot encode most Greek letters, including λ.  IMHO, this fact alone means that latin-1 should be in no way endorsed (I would go so far as to say actively shunned) by any self-respecting 21st-century implementation of Lisp.

Agreed, while making transition of old code possible. I think that
problem can be solved with appropriate documentation: describing the
issues, the reason for the change, and examples of the kinds of
mojibake you can get when the encodings get fubared.


Tom Emerson
tree at dreamersrealm.net

More information about the Openmcl-devel mailing list