[Openmcl-devel] *default-character-encoding* should be :utf-8

Gary Byers gb at clozure.com
Mon Mar 5 22:06:06 PST 2012


On Mon, 5 Mar 2012, Vladimir Sedach wrote:

> Here is a concise example of what motivated me to start this thread:

I may regret phrasing it this way, but have to ask: what would it take
to get you to stop it ?  The message that you're responding to is one
where I agreed to change the defaults to utf-8.

There's a tendency in discussions like this for everyone (myself included)
to get infatuated by the sound of their own voice.  At this point, I'm a bit
past that point and am getting tired of saying and hearing the same things
repeatedly, and I'd have to assume that many other people are as well.

>
> File test.lisp (encoded in utf-8):
>
> (defvar *city* "Montr?l")
>
> Now, go to CCL:
>
> CL-USER> (load "test.lisp")
> #P"/home/viper/test.lisp"
> CL-USER> *city*
> "Montréal"
>
> Latin-1 is absolutely the wrong choice for a default encoding for LOAD
> because it results in unnoticeable "corruption" like this. I'm an
> experienced Lisp programmer and I was confused. I'm not sure how this
> behavior will benefit the hypothetical user who is "naive" about
> encodings.
>
> If utf-8 doesn't seem like a good idea, then default to ascii. Either
> utf-8 or ascii will reliably indicate a problem on encountering
> unknown 8- or 16-bit encodings in source files, so at least you will
> know there is a problem and that it involves encodings. Note that
> right now Clozure does not throw an error if
> *default-file-character-encoding* is :ascii and 8-bit characters are
> found and just puts #\Replacement_Character in place of the offending
> byte (so the above string comes out as "Montr??????al"). This behavior is
> also wrong and an error needs to be signalled.
>
> Leaving latin-1 as the default is as arbitrary as any other 8-bit
> encoding and will result in unnoticed "corruption" as in the scenario
> above.
>
> The globally-optimal solution is to specify the file encodings in the
> ASDF system definitions (ASDF needs to be extended to support this,
> unless I missed a section of the manual). Relying on the local machine
> locale won't work because the source files were probably authored on a
> different machine, and enca-style encoding guessing is unreliable.
>
> Vladimir
>
> On Mon, Mar 5, 2012 at 8:14 PM, Gary Byers <gb at clozure.com> wrote:
>>
>>
>> On Mon, 5 Mar 2012, Ron Garret wrote:
>>
>>>
>>> On Mar 4, 2012, at 5:53 PM, Gary Byers wrote:
>>>
>>>> If your sources are in some legacy encoding - MacRoman is an example
>>>> that still comes up from time to tine - then you obviously need to
>>>> process them with that encoding in effect or you'll lose information.
>>>
>>>
>>> If you're using such legacy sources, you first step should be to
>>> convert them to UTF-8 and then never touch the original again.
>>> (The> same goes for latin-1, except that latin-1 is not a legacy
>>> encoding. ?It's in common use today, which is the main reason this
>>> is a real problem.)
>>
>>
>> I agree, but the people who have these legacy-encoded sources that really
>> should have been converted to utf-8 long ago have all kinds of flimsy
>> excuses
>> for not wanting to do so. ?"It costs time", "it costs money", "it requires
>> expertise", "it breaks backward compatibility" ?... ?Sheesh. ?It's almost
>> as if these people live in the real world or something.
>>
>> At some point, people with legacy code do need to invest in its viability
>> (and in many cases that point was probably "years ago.") ?It doesn't always
>> happen, and this so-called "real world" thing that I keep hearing about
>> seems
>> to have something to do with that. ?Given that situation (and the general
>> lack
>> of awareness of encoding issues that sometimes accompanies it), a default
>> encoding that loses less information (ISO-8859-1) has more practical value
>> than one that loses as much information as UTF-8 can. ?That's one of those
>> real-world considerations (debugging reported problems that often stem from
>> that lack of awareness is part of the real world of anyone who has to do
>> it);
>> I don't know if I'm overstating the importance of that or if other people
>> are unaware of just how intrusive this so-called "real world" can be.
>> ?(There
>> isn't much else to like about ISO-8859-1.)
>>
>> So, let's see. ?There doesn't seem to be as much of a performance hit
>> for repeatedly doing READ-CHAR on utf-8 encoded files (whose contents are
>> all STANDARD-CHAR/ASCII) as I'd remembered, so changing the default terminal
>> and file encodings (in the trunk) seems like a worthwhile experiment. ?It
>> may
>> be easier to evaluate some of these things with those changes in effect, and
>> it's entirely possible that the change is neither a particularly good nor
>> a particularly bad idea.
>>
>>
>>
>>
>>>
>>> rg
>>>
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>
>>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>
>


More information about the Openmcl-devel mailing list