[Openmcl-devel] *default-character-encoding* should be :utf-8

Gary Byers gb at clozure.com
Tue Mar 6 04:30:09 UTC 2012

On Mon, 5 Mar 2012, Ron Garret wrote:

> On Mar 5, 2012, at 5:14 PM, Gary Byers wrote:
>> On Mon, 5 Mar 2012, Ron Garret wrote:
>>> On Mar 4, 2012, at 5:53 PM, Gary Byers wrote:
>>>> If your sources are in some legacy encoding - MacRoman is an example
>>>> that still comes up from time to tine - then you obviously need to
>>>> process them with that encoding in effect or you'll lose information.
>>> If you're using such legacy sources, you first step should be to
>>> convert them to UTF-8 and then never touch the original again.
>>> (The> same goes for latin-1, except that latin-1 is not a legacy
>>> encoding.  It's in common use today, which is the main reason this
>>> is a real problem.)
>> I agree, but the people who have these legacy-encoded sources that really
>> should have been converted to utf-8 long ago have all kinds of flimsy excuses
>> for not wanting to do so.  "It costs time", "it costs money"
> Those really are flimsy excuses.  Converting character encodings on modern processors can be done at a rate of gigabytes per minute.  You could probably convert the entire corpus of all computer source code ever produced by humans for about $100.
>> "it requires expertise", "it breaks backward compatibility"
> Those are slightly less flimsy excuses.  But expertise can be hired or acquired.  Backwards compatibility can be a real concern in certain application domains, but I'd be surprised to learn that CCL is being used in any of them.

Live and learn then.

I wouldn't claim that it happens that often (probably no more than a few times
a year, and maybe less each passing year), but there are bug reports  on this
list, in Trac, and sent to me personally from people who have never thought about
any of this before and seem genuinely surprised when they're told that then need
to start doing so.

>> ...  Sheesh.  It's almost as if these people live in the real world or something.
> Those don't sound like real-world concerns to me.  To the contrary, those sound more like the concerns of people who want to cling to the belief that it's still the 20th century, and OS 9 is still a viable operating system.

I assure you that these imaginary people still exist.
> Maybe there's something I'm missing here.  How does UTF-8 lose information?

In ISO-8859-1 (and some other systems) it's possible to do:

;;; This isn't claimed to be interesting in and of itself.
   (let* ((ch (read-char in nil nil)))   ; IN is open in iso-8859-1
     (when (null ch) (return))
     (write-char ch out)))               ; as is OUT

and get a verbatim copy of the input file, regardless of its contents.
That's true as long as the encoding describes a 1:1 mapping between
octets and characters; in ISO-8859-1, the mapping is simply CODE-CHAR.

In UTF-8, there constraints on what octets can appear in what sequence
when some of the octets in question have their most signifcant bits set;
a stream that's alleged to be in UTF-8 but whose contents violate these
constraints isn't well-formed.  On encountering ill-formed sequences,
a UTF-8 translator is supposed to either signal an error or generate
a replacement character; CCL generally tries to do the latter.

If a stream that isn't really UTF-8 is treated as UTF-8 (and if it
contains octets >= #x80 that don't accidently form valid UTF-8
sequences) the result will contain replacement characters with no
indication of what octets they were derived from.

If a stream that isn't really ISO-8859-1 is treated as ISO-8859-1,
the result will be a sequence of characters that reflect the octet
structure of the source.

In both cases, the result is incorrect. If a bug report from a naive
(and apparently imaginary) user says "my file has a trademark
character in it but CCL thought that is was a Feminine Ordinal Indicator",
that's a strong clue that the file was encoded in MacRoman, and such
clues can be helpful when the user honestly doesn't have a clue as to
how the file is encoded and may not be sure what the question means.
If the default was UTF-8, there would be fewer such clues.

This (having the default of ISO-8859-1 provide more information in
cases where the default shouldn't have been used) has some value.  I
may be overstating that value or you may be underappreciating it, but
the value isn't totally imaginary.  (And it's also true that having
the default be ISO-8859-1 has masked some problems, and it's also true
that having the default be UTF-8 could make some subtle errors easier
to spot because their effects - #\Replacement_characters all over the
place - less subtle-looking.)

More information about the Openmcl-devel mailing list