[Openmcl-devel] *default-character-encoding* should be :utf-8
Tom Emerson
tree at dreamersrealm.net
Tue Sep 25 07:53:05 PDT 2012
On Mon, Sep 24, 2012 at 2:38 PM, Ron Garret <ron at flownet.com> wrote:
> FILE is surprisingly good at figuring this out even on the basis of very little information:
[...]
UTF-8 is very easy to sniff if you have the text: this is by design
and the primary reason the Unicode Consortium recommends against
putting the BOM in a UTF-8 encoded file ---- it's entirely redundant.
> [...] This is not so much a technical decision as it is a political one. (Politics, like it or not, is a real-world consideration.)
So what could be done is to change the default to :UTF-8 but have the
addition in the manual calling this out and giving the recipe for
getting the old behavior back in your local installation. I can
certainly understand the political side of this.
> There are essentially three arguments against UTF-8:
>
> 1. Some common operations (elt, length) are less efficient than in other encodings
Which is a non-issue since strings in CCL are UTF-32... the encoding
of the transfer mechanism is irrelevant unless you are working with
individual bytes, at which point elt/length are not useful for
character selection/counting anyway.
> 2. It breaks some legacy code
Which is a non-trivial issue but which can be addressed with
appropriate documentation.
> 3. It doesn't have complete coverage of the binary code space, so you can't send arbitrary data through a UTF-8 encoded channel and expect it to emerge unaltered
More to the point, the semantics of the encoded character will change
when transcoding when interpreting an octet sequence as UTF-8, while
it will not when treating it as Latin-1 (but *not* CP1252!)
> I believe that UTF-8 is the right choice, not *despite* these arguments, but *because* of them.
I'm in violent agreement, but (2) needs to be recognized: there is a
lot of legacy Lisp code, as you know.
> Oh, and it turns out that latin-1 cannot encode most Greek letters, including λ. IMHO, this fact alone means that latin-1 should be in no way endorsed (I would go so far as to say actively shunned) by any self-respecting 21st-century implementation of Lisp.
Agreed, while making transition of old code possible. I think that
problem can be solved with appropriate documentation: describing the
issues, the reason for the change, and examples of the kinds of
mojibake you can get when the encodings get fubared.
-tree
--
Tom Emerson
tree at dreamersrealm.net
http://www.dreamersrealm.net/tree
More information about the Openmcl-devel
mailing list