[Openmcl-devel] *default-character-encoding* should be :utf-8

Gary Byers gb at clozure.com
Mon Sep 24 01:49:01 UTC 2012

Whatever the system blindly guesses a file's encoding to be, it'll 
guess wrong some percentage of the time, so there's no correct answer here.
I'm not sure that utf-8 us as pervasive as you seem to think it is, but I'm
willing to believe that there are more non-ASCII files out there that're
encoded in utf-8 than there are non-ASCII files encoded in ISO-8859-1.  (Probably.)

If you guess that a file's encoded in iso-8859-1 and you guess wrong, the result
will contain the wrong characters.

If you guess that a file's encoded in utf-8 and you guess wrong, the result
will contain the wrong characters.

If you naively copy a file a character at a time via a loop like:

(with-open-file (in ... :direction :input :external-format :default)
   (with-open-file (out ... :direction :output
                            :external-format (stream-external-format in))
     (do* ((ch (read-char in nil nil) (read-char in nil nil)))
          ((null ch))
      (write-char ch out))))

and you guess wrong about the files' encoding, then:
  1) if you treat the files as being encoded in iso-8859-1, the output file
     will be a verbatim copy of the input file (regardless of how the input
     file is actually encoded.)
  2) if you treat the files as being encoded in utf-8 (and guessed wrong),
     then it's likely that the input file doesn't valid utf-8 sequences;
     READ-CHAR has to either error or return a #\replacement_character and
     in either case it won't be possible to create a verbatim copy of the
     input file this way.

Being able to copy files this way isn't in and of itself a compelling argument
for guessing iso-8859-1 by itself, but it does illustrate one attractive property
of iso-8859-1 (namely, that less information is lost as a consequence of guessing
wrong; the same is true (IIRC) of some other encodings but is not true of utf-8.)
That may still not be compelling, but in my mind it's starting to lean that way.

In my mind, the tradeoff is:

  - if you guess utf-8, you probably have a greater chance of being right than
    if you guess some legacy encoding.
  - if you guess iso-8859-1, you're probably rarely going to be right (exactly)
    but the consequences of guessing wrong (information loss) are less severe.

I think that those tradeoffs add up in a way that makes me think that:

  - iso-8859-1 is still a better choice, simply because it's a less risky guess.
  - people who understand the benefits of using :utf-8 would almost certainly
    want to override that default in their init files (though it's possible
    to understand the benefits of using :utf-8 and still have to deal with
    other encodings.)
  - alternatives to guessing about how an existing file is encoded are likely
    going to be preferable to guessing.  The alternative that Robert Goldman
    proposed - parsing the file-attributes line - is attractive because that
    piece of metadata is stored explicitly in each file (and not stored in
    extended file attributes, resource forks, or other secret places that
    file transfer and archival utilities generally ignore).

The arguments that I've heard in favor of utf-8 seem to center on the fact
that it's capable of encoding any Unicode character in a relatively compact
way.  That's entirely true and is a strong argument for using utf-8, but it's
not clear that that ("a strong argument for using utf-8") is the same thing
as "a strong argument for guessing that a file whose encoding is unknown is
encoded in utf-8".  I think that those are separate things.

On Sun, 23 Sep 2012, Ron Garret wrote:

> More generally, if there were a universal way of designating the encoding of a unicode text file (not just a unicode .lisp file) that would make the world a Better Place too.  More^2 generally, if there were a universal way of encoding general metadata about a file (e.g. "This is a jpg image encoded in base64" or "This is an x68 executable for OSX version 10.5 and 10.6") that would make the world a Much Better Place.  Alas, such projects are historically fraught with peril.  The poster child for this is DER format, which tries to be a universal container for all things PKI.  DER is a colossal, horrible mess, which has real consequences besides making the writing of DER parsers an incredibly expensive and painful undertaking.  DER is actually a very serious security risk.  I refer anyone wishing to dive into this rabbit hole to this paper:
> http://www.cosic.esat.kuleuven.be/publications/article-1432.pdf
> Gary, why are you so resistant to adopting UTF8?  I really don't get it.
> rg
> On Sep 23, 2012, at 3:53 PM, Gary Byers wrote:
>> One good suggestion that Robert Goldman made (and that everyone - including me -
>> ignored) in the discussion last spring is to have LOAD and COMPILE-FILE (at least)
>> honor a coding: attribute [*] in the file attributes line (aka the modeline).  E.g.:
>> ;;; -*- Mode: lisp; Coding: utf-8 -*-
>> at the top of a .lisp source file makes it pretty clear that the file's author
>> intends for the file to be processed in utf-8 and makes that fact obvious to
>> a human reader as well.
>> Emacs (generally) supports this; other environments (the Cocoa IDE)
>> could be made to if they don't already, and LOAD and COMPILE-FILE
>> could do so in CCL (and may already do so in other implementations) at
>> least when their :EXTERNAL-FORMAT argument isn't explicitly specifed.
>> (OPEN could also do so, but might not find an attribute line as often.)
>> Things like *DEFAULT-FILE-CHARACTER-ENCODING* would still have to exist
>> and we could continue to argue about what value it should take, but following
>> Robert's suggestion would mean that that wouldn't matter as often.
>> ---
>> [*] IIRC.  The point here is to use whatever attribute name Emacs uses.
>> On Sun, 23 Sep 2012, Ron Garret wrote:
>>> As the instigator of this thread, I think it's worth recapping the original argument, which had nothing to do with moral failings and everything to do with real-world considerations.
>>> The sad fact of the matter is that the Internet is lousy with texts that cannot be encoded in Latin-1.  Some people (notably those whose native language is not English) even write code that contain characters that cannot be encoded in latin-1 (the nerve!)  There are three -- and only three -- ways to deal with this situation:
>>> 1.  Use Latin-1 exclusively, and lock yourself out of being able to deal with code and texts the contain non-European glyphs.
>>> 2.  Use Latin-1 and some other encoding(s), and deal with the confusion that inevitably results.
>>> 3.  Use an encoding that covers all (or at least most) of the unicode code point space.
>>> I advocate #3 in general, and UTF-8 in particular, because I don't like provincialism and I don't like unnecessary complication.  But this is a value judgement, and reasonable people can disagree.  UTF8 is no panacea.  There are drawbacks, most notably that ELT is no longer O(1), and the length of a string is not a linear function of its size in memory.
>>> There is one aspect of Latin-1 that I find particularly annoying in the context of choosing an encoding for Lisp code, and that is that the encoding of lower-case lambda (?) is incompatibly different between Latin-1 and UTF-8.  Since it is no longer 1978, I sometimes like to spell lambda as "?".  Because I use the ? character, and because I don't want to close the door on non-European texts, and because I don't like unnecessary complication, I choose to use UTF8 exclusively, and I think the world would be a better place if everyone did likewise.
>>> Again, reasonable people can disagree, and clearly the fate of civilization does not hinge on this decision.  But if CCL is going to revert to latin-1 I would hope it would not be because the argument for UTF8 had been misunderstood.
>>> rg
>>> On Sep 23, 2012, at 12:49 PM, Gary Byers wrote:
>>>> changed (experimentally) in the trunk in r15236, largely as an attempt to silence
>>>> an apparently endless discussion started in:
>>>> <http://clozure.com/pipermail/openmcl-devel/2012-March/013401.html>
>>>> Both of those variables have historically been initialized to NIL (which
>>>> is equivalent to :ISO-8859-1.)
>>>> A careful reading of that thread will reveal that if you have files that
>>>> aren't encoded in :UTF-8 that's because of sloth, avarice, or some other
>>>> personal failing on your part (and would have nothing to do with real-world
>>>> issues.)
>>>> That change was intentionally not incorporated into 1.8; all other things
>>>> being equal, I think that I'd prefer that 1.9 revert back to using NIL/:ISO-8859-1
>>>> (but that might cause that discussion to start up again.)
>>>> In the bigger picture: as I understand how things currently stand, the trunk
>>>> contains workarounds for a couple of Mountain Lion issues (the mechanism used
>>>> by things like GUI:EXECUTE-IN-GUI and the problems with #_mach_port_allocate_name)
>>>> that have not yet been propagated to 1.8.  Assuming that the fixes/workarounds
>>>> are correct and have been smoke-tested to at least some degree, the changes
>>>> do need to be incorporated into 1.8 ASAP, and once they are I would think that
>>>> you'd want to base your application on 1.8.
>>>> I'm a little nervous about the hashing scheme that's being used to avoid
>>>> #_mach_port_allocate_name (I don't know how well it scales and don't know
>>>> whether or not there are non-obvious race conditions or other thread-safety
>>>> issues in the code) and I'd ordinarily be a little reluctant to push something
>>>> like that to the release: the consequences of using #_mach_port_allocate_name
>>>> are so horrible on Mountain Lion that the hashing scheme is clearly better (even
>>>> if it contains its own obscure/subtle problems); on OS releases where
>>>> #_mach_port_allocate_name still works, then propagating the change simply risks
>>>> introducing some obscure/subtle problems.
>>>> I'll try to decide what to do soon, but I don't think that it'd be a good idea
>>>> for you (or anyone) to base a shipping application on the CCL trunk, simply
>>>> because the trunk's volatility would make it harder for you or us to maintain.
>>>> On Sat, 22 Sep 2012, Alexander Repenning wrote:
>>>>> tracked down some bugs that are due to the new :utf-8 encoding. When actually did that happen? Anyway, is there a way to set the encoding back (:ascii)? The only *default-character-encoding* variable I can find is part of quicklisp/babel/encodings.
>>>>> Alex
>>>>> _______________________________________________
>>>>> Openmcl-devel mailing list
>>>>> Openmcl-devel at clozure.com
>>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel

More information about the Openmcl-devel mailing list