[Openmcl-devel] *default-character-encoding* should be :utf-8

Ron Garret ron at flownet.com
Sun Sep 23 16:18:26 PDT 2012


More generally, if there were a universal way of designating the encoding of a unicode text file (not just a unicode .lisp file) that would make the world a Better Place too.  More^2 generally, if there were a universal way of encoding general metadata about a file (e.g. "This is a jpg image encoded in base64" or "This is an x68 executable for OSX version 10.5 and 10.6") that would make the world a Much Better Place.  Alas, such projects are historically fraught with peril.  The poster child for this is DER format, which tries to be a universal container for all things PKI.  DER is a colossal, horrible mess, which has real consequences besides making the writing of DER parsers an incredibly expensive and painful undertaking.  DER is actually a very serious security risk.  I refer anyone wishing to dive into this rabbit hole to this paper:

http://www.cosic.esat.kuleuven.be/publications/article-1432.pdf

Gary, why are you so resistant to adopting UTF8?  I really don't get it.

rg

On Sep 23, 2012, at 3:53 PM, Gary Byers wrote:

> One good suggestion that Robert Goldman made (and that everyone - including me -
> ignored) in the discussion last spring is to have LOAD and COMPILE-FILE (at least)
> honor a coding: attribute [*] in the file attributes line (aka the modeline).  E.g.:
> 
> ;;; -*- Mode: lisp; Coding: utf-8 -*-
> 
> at the top of a .lisp source file makes it pretty clear that the file's author
> intends for the file to be processed in utf-8 and makes that fact obvious to
> a human reader as well.
> 
> Emacs (generally) supports this; other environments (the Cocoa IDE)
> could be made to if they don't already, and LOAD and COMPILE-FILE
> could do so in CCL (and may already do so in other implementations) at
> least when their :EXTERNAL-FORMAT argument isn't explicitly specifed.
> (OPEN could also do so, but might not find an attribute line as often.)
> 
> Things like *DEFAULT-FILE-CHARACTER-ENCODING* would still have to exist
> and we could continue to argue about what value it should take, but following
> Robert's suggestion would mean that that wouldn't matter as often.
> 
> 
> ---
> [*] IIRC.  The point here is to use whatever attribute name Emacs uses.
> 
> On Sun, 23 Sep 2012, Ron Garret wrote:
> 
>> As the instigator of this thread, I think it's worth recapping the original argument, which had nothing to do with moral failings and everything to do with real-world considerations.
>> 
>> The sad fact of the matter is that the Internet is lousy with texts that cannot be encoded in Latin-1.  Some people (notably those whose native language is not English) even write code that contain characters that cannot be encoded in latin-1 (the nerve!)  There are three -- and only three -- ways to deal with this situation:
>> 
>> 1.  Use Latin-1 exclusively, and lock yourself out of being able to deal with code and texts the contain non-European glyphs.
>> 2.  Use Latin-1 and some other encoding(s), and deal with the confusion that inevitably results.
>> 3.  Use an encoding that covers all (or at least most) of the unicode code point space.
>> 
>> I advocate #3 in general, and UTF-8 in particular, because I don't like provincialism and I don't like unnecessary complication.  But this is a value judgement, and reasonable people can disagree.  UTF8 is no panacea.  There are drawbacks, most notably that ELT is no longer O(1), and the length of a string is not a linear function of its size in memory.
>> 
>> There is one aspect of Latin-1 that I find particularly annoying in the context of choosing an encoding for Lisp code, and that is that the encoding of lower-case lambda (?) is incompatibly different between Latin-1 and UTF-8.  Since it is no longer 1978, I sometimes like to spell lambda as "?".  Because I use the ? character, and because I don't want to close the door on non-European texts, and because I don't like unnecessary complication, I choose to use UTF8 exclusively, and I think the world would be a better place if everyone did likewise.
>> 
>> Again, reasonable people can disagree, and clearly the fate of civilization does not hinge on this decision.  But if CCL is going to revert to latin-1 I would hope it would not be because the argument for UTF8 had been misunderstood.
>> 
>> rg
>> 
>> 
>> On Sep 23, 2012, at 12:49 PM, Gary Byers wrote:
>> 
>>> The values of *TERMINAL-CHARACTER-ENCODING-NAME* and *DEFAULT-FILE-CHARACTER-ENCODING*
>>> changed (experimentally) in the trunk in r15236, largely as an attempt to silence
>>> an apparently endless discussion started in:
>>> 
>>> <http://clozure.com/pipermail/openmcl-devel/2012-March/013401.html>
>>> 
>>> Both of those variables have historically been initialized to NIL (which
>>> is equivalent to :ISO-8859-1.)
>>> 
>>> A careful reading of that thread will reveal that if you have files that
>>> aren't encoded in :UTF-8 that's because of sloth, avarice, or some other
>>> personal failing on your part (and would have nothing to do with real-world
>>> issues.)
>>> 
>>> That change was intentionally not incorporated into 1.8; all other things
>>> being equal, I think that I'd prefer that 1.9 revert back to using NIL/:ISO-8859-1
>>> (but that might cause that discussion to start up again.)
>>> 
>>> In the bigger picture: as I understand how things currently stand, the trunk
>>> contains workarounds for a couple of Mountain Lion issues (the mechanism used
>>> by things like GUI:EXECUTE-IN-GUI and the problems with #_mach_port_allocate_name)
>>> that have not yet been propagated to 1.8.  Assuming that the fixes/workarounds
>>> are correct and have been smoke-tested to at least some degree, the changes
>>> do need to be incorporated into 1.8 ASAP, and once they are I would think that
>>> you'd want to base your application on 1.8.
>>> 
>>> I'm a little nervous about the hashing scheme that's being used to avoid
>>> #_mach_port_allocate_name (I don't know how well it scales and don't know
>>> whether or not there are non-obvious race conditions or other thread-safety
>>> issues in the code) and I'd ordinarily be a little reluctant to push something
>>> like that to the release: the consequences of using #_mach_port_allocate_name
>>> are so horrible on Mountain Lion that the hashing scheme is clearly better (even
>>> if it contains its own obscure/subtle problems); on OS releases where
>>> #_mach_port_allocate_name still works, then propagating the change simply risks
>>> introducing some obscure/subtle problems.
>>> 
>>> I'll try to decide what to do soon, but I don't think that it'd be a good idea
>>> for you (or anyone) to base a shipping application on the CCL trunk, simply
>>> because the trunk's volatility would make it harder for you or us to maintain.
>>> 
>>> 
>>> On Sat, 22 Sep 2012, Alexander Repenning wrote:
>>> 
>>>> tracked down some bugs that are due to the new :utf-8 encoding. When actually did that happen? Anyway, is there a way to set the encoding back (:ascii)? The only *default-character-encoding* variable I can find is part of quicklisp/babel/encodings.
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>> 
>>>> 
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>> 




More information about the Openmcl-devel mailing list