[Openmcl-devel] *default-character-encoding* should be :utf-8

Ron Garret ron at flownet.com
Mon Mar 5 18:36:25 PST 2012


On Mar 5, 2012, at 5:14 PM, Gary Byers wrote:

> 
> 
> On Mon, 5 Mar 2012, Ron Garret wrote:
> 
>> 
>> On Mar 4, 2012, at 5:53 PM, Gary Byers wrote:
>> 
>>> If your sources are in some legacy encoding - MacRoman is an example
>>> that still comes up from time to tine - then you obviously need to
>>> process them with that encoding in effect or you'll lose information.
>> 
>> If you're using such legacy sources, you first step should be to
>> convert them to UTF-8 and then never touch the original again.
>> (The> same goes for latin-1, except that latin-1 is not a legacy
>> encoding.  It's in common use today, which is the main reason this
>> is a real problem.)
> 
> I agree, but the people who have these legacy-encoded sources that really
> should have been converted to utf-8 long ago have all kinds of flimsy excuses
> for not wanting to do so.  "It costs time", "it costs money"

Those really are flimsy excuses.  Converting character encodings on modern processors can be done at a rate of gigabytes per minute.  You could probably convert the entire corpus of all computer source code ever produced by humans for about $100.

> "it requires expertise", "it breaks backward compatibility"

Those are slightly less flimsy excuses.  But expertise can be hired or acquired.  Backwards compatibility can be a real concern in certain application domains, but I'd be surprised to learn that CCL is being used in any of them.

> ...  Sheesh.  It's almost as if these people live in the real world or something.

Those don't sound like real-world concerns to me.  To the contrary, those sound more like the concerns of people who want to cling to the belief that it's still the 20th century, and OS 9 is still a viable operating system.

> At some point, people with legacy code do need to invest in its viability
> (and in many cases that point was probably "years ago.")  It doesn't always
> happen, and this so-called "real world" thing that I keep hearing about seems
> to have something to do with that.  Given that situation (and the general lack
> of awareness of encoding issues that sometimes accompanies it), a default
> encoding that loses less information (ISO-8859-1) has more practical value
> than one that loses as much information as UTF-8 can.

Maybe there's something I'm missing here.  How does UTF-8 lose information?

> So, let's see.  There doesn't seem to be as much of a performance hit
> for repeatedly doing READ-CHAR on utf-8 encoded files (whose contents are
> all STANDARD-CHAR/ASCII) as I'd remembered, so changing the default terminal
> and file encodings (in the trunk) seems like a worthwhile experiment.  It may
> be easier to evaluate some of these things with those changes in effect, and
> it's entirely possible that the change is neither a particularly good nor
> a particularly bad idea.

You don't actually have to change the CCL defaults if you think it would upset a significant constituency.  This is more of a social issue than a technical one.  It's enough to just encourage people to put the following in their CCL-INIT files:

(setf CCL:*DEFAULT-FILE-CHARACTER-ENCODING* :utf-8)

and then have zero-tolerance for any source code that doesn't work as a result.  I've had that line in my ccl-init for so long that I don't even know what the default encoding that ccl ships with is any more.

rg




More information about the Openmcl-devel mailing list