[Openmcl-devel] default-character-encoding should be :utf-8

Tue Sep 25 07:53:05 PDT 2012

On Mon, Sep 24, 2012 at 2:38 PM, Ron Garret <ron at flownet.com> wrote:
> FILE is surprisingly good at figuring this out even on the basis of very little information:
[...]

UTF-8 is very easy to sniff if you have the text: this is by design
and the primary reason the Unicode Consortium recommends against
putting the BOM in a UTF-8 encoded file ---- it's entirely redundant.

> [...] This is not so much a technical decision as it is a political one.  (Politics, like it or not, is a real-world consideration.)

So what could be done is to change the default to :UTF-8 but have the
addition in the manual calling this out and giving the recipe for
getting the old behavior back in your local installation. I can
certainly understand the political side of this.

> There are essentially three arguments against UTF-8:
>
> 1.  Some common operations (elt, length) are less efficient than in other encodings

Which is a non-issue since strings in CCL are UTF-32... the encoding
of the transfer mechanism is irrelevant unless you are working with
individual bytes, at which point elt/length are not useful for
character selection/counting anyway.

> 2.  It breaks some legacy code

Which is a non-trivial issue but which can be addressed with
appropriate documentation.

> 3.  It doesn't have complete coverage of the binary code space, so you can't send arbitrary data through a UTF-8 encoded channel and expect it to emerge unaltered

More to the point, the semantics of the encoded character will change
when transcoding when interpreting an octet sequence as UTF-8, while
it will not when treating it as Latin-1 (but *not* CP1252!)

> I believe that UTF-8 is the right choice, not *despite* these arguments, but *because* of them.

I'm in violent agreement, but (2) needs to be recognized: there is a
lot of legacy Lisp code, as you know.

> Oh, and it turns out that latin-1 cannot encode most Greek letters, including λ.  IMHO, this fact alone means that latin-1 should be in no way endorsed (I would go so far as to say actively shunned) by any self-respecting 21st-century implementation of Lisp.

Agreed, while making transition of old code possible. I think that
problem can be solved with appropriate documentation: describing the
issues, the reason for the change, and examples of the kinds of
mojibake you can get when the encodings get fubared.

    -tree

-- 
Tom Emerson
tree at dreamersrealm.net
http://www.dreamersrealm.net/tree

[Openmcl-devel] *default-character-encoding* should be :utf-8

[Openmcl-devel] default-character-encoding should be :utf-8