[Openmcl-devel] default-character-encoding should be :utf-8

Mon Mar 5 19:16:14 PST 2012

Here is a concise example of what motivated me to start this thread:

File test.lisp (encoded in utf-8):

(defvar *city* "Montréal")

Now, go to CCL:

CL-USER> (load "test.lisp")
#P"/home/viper/test.lisp"
CL-USER> *city*
"MontrÃ©al"

Latin-1 is absolutely the wrong choice for a default encoding for LOAD
because it results in unnoticeable "corruption" like this. I'm an
experienced Lisp programmer and I was confused. I'm not sure how this
behavior will benefit the hypothetical user who is "naive" about
encodings.

If utf-8 doesn't seem like a good idea, then default to ascii. Either
utf-8 or ascii will reliably indicate a problem on encountering
unknown 8- or 16-bit encodings in source files, so at least you will
know there is a problem and that it involves encodings. Note that
right now Clozure does not throw an error if
*default-file-character-encoding* is :ascii and 8-bit characters are
found and just puts #\Replacement_Character in place of the offending
byte (so the above string comes out as "Montr��al"). This behavior is
also wrong and an error needs to be signalled.

Leaving latin-1 as the default is as arbitrary as any other 8-bit
encoding and will result in unnoticed "corruption" as in the scenario
above.

The globally-optimal solution is to specify the file encodings in the
ASDF system definitions (ASDF needs to be extended to support this,
unless I missed a section of the manual). Relying on the local machine
locale won't work because the source files were probably authored on a
different machine, and enca-style encoding guessing is unreliable.

Vladimir

On Mon, Mar 5, 2012 at 8:14 PM, Gary Byers <gb at clozure.com> wrote:
>
>
> On Mon, 5 Mar 2012, Ron Garret wrote:
>
>>
>> On Mar 4, 2012, at 5:53 PM, Gary Byers wrote:
>>
>>> If your sources are in some legacy encoding - MacRoman is an example
>>> that still comes up from time to tine - then you obviously need to
>>> process them with that encoding in effect or you'll lose information.
>>
>>
>> If you're using such legacy sources, you first step should be to
>> convert them to UTF-8 and then never touch the original again.
>> (The> same goes for latin-1, except that latin-1 is not a legacy
>> encoding.  It's in common use today, which is the main reason this
>> is a real problem.)
>
>
> I agree, but the people who have these legacy-encoded sources that really
> should have been converted to utf-8 long ago have all kinds of flimsy
> excuses
> for not wanting to do so.  "It costs time", "it costs money", "it requires
> expertise", "it breaks backward compatibility"  ...  Sheesh.  It's almost
> as if these people live in the real world or something.
>
> At some point, people with legacy code do need to invest in its viability
> (and in many cases that point was probably "years ago.")  It doesn't always
> happen, and this so-called "real world" thing that I keep hearing about
> seems
> to have something to do with that.  Given that situation (and the general
> lack
> of awareness of encoding issues that sometimes accompanies it), a default
> encoding that loses less information (ISO-8859-1) has more practical value
> than one that loses as much information as UTF-8 can.  That's one of those
> real-world considerations (debugging reported problems that often stem from
> that lack of awareness is part of the real world of anyone who has to do
> it);
> I don't know if I'm overstating the importance of that or if other people
> are unaware of just how intrusive this so-called "real world" can be.
>  (There
> isn't much else to like about ISO-8859-1.)
>
> So, let's see.  There doesn't seem to be as much of a performance hit
> for repeatedly doing READ-CHAR on utf-8 encoded files (whose contents are
> all STANDARD-CHAR/ASCII) as I'd remembered, so changing the default terminal
> and file encodings (in the trunk) seems like a worthwhile experiment.  It
> may
> be easier to evaluate some of these things with those changes in effect, and
> it's entirely possible that the change is neither a particularly good nor
> a particularly bad idea.
>
>
>
>
>>
>> rg
>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
>>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel

[Openmcl-devel] *default-character-encoding* should be :utf-8

[Openmcl-devel] default-character-encoding should be :utf-8