[Openmcl-devel] A plug for UTF-8

Daniel Weinreb dlw at itasoftware.com
Tue Oct 20 11:28:29 PDT 2009


I was just looking at this email, and wanted to tell you
something relevant that I thought you might be interested
in.  The cl-bench library has a benchmark called read-many-lines,
which opens /usr/share/dict/words and calls readline on it until
end of file.  This turns out to signal a condition in SBCL,
because (on Ubuntu 8, at least) the file is encoded in LATIN-1,
whereas SBCL defaults to UTF-8, and there is a byte sequence
in the file that is not legal in the UTF-8 encoding.

I won't bother to try to do a root-blame analysis!  I agree
that everything should be UTF-8, including that file...

Anyway, just FYI.

-- Dan

Ron Garret wrote:
> I'm not advocating any change in CCL, I'm just urging people to as a 
> matter of common practice set their default encodings to UTF-8 and 
> publish their code using UTF-8.  That's all.
> The reason for this (and for nearly everything I'm advocating 
> nowadays) is that I want to make CL in general and CCL in particular 
> as attractive as possible to new users.  I believe one way to do this 
> is insure that to the maximum extent possible things "just work".  
> Nowadays, a big part of "just working" is to minimize the amount of 
> mental energy users have to spend fiddling with unicode encodings.  
> Unfortunately, the unicode standard is b0rken so it is not possible to 
> reduce this fiddling to zero, but until unicode is fixed I think just 
> having everyone use UTF-8 by convention is the next best thing.  The 
> situation with unicode today is analogous to that which plagued IBM PC 
> add-on cards before plug-and-play came along.  Users had to manually 
> fiddle with various hardware configurations.  Some day the unicode 
> community will fix the mess they've created and come up with a 
> standard way to embed the encoding in the byte stream.  But until that 
> happens the best we can do is just all follow some convention.  And 
> the simplest convention is to just pick an encoding and stick with it.
> rg
> On Sep 10, 2009, at 9:08 AM, Daniel Weinreb wrote:
>> Ron,
>> I'm not sure I understand what you are advocating.
>> What change would you like to see in CCL?
>> -- Dan
>> Ron Garret wrote:
>>> I would like to take a moment to lobby on behalf of UTF-8.  This is  
>>> not a huge big deal because it's easy enough to convert from one  
>>> encoding to another once you know how, but I think it would be a 
>>> nice  selling point for CCL is things tended to Just Work, and one 
>>> way to  make them Just Work is to have an encoding convention that 
>>> is  universally followed so that newcomers can set it and forget 
>>> it.  The  reason I think UTF-8 is a better choice than, say, Latin-1 
>>> is that  UTF-8 gives you access to the entire unicode code space, 
>>> and in  particular the lower-case Greek lambda character (λ) and 
>>> European- style «quotation marks» which are self-balancing and hence 
>>> let you  build nested strings without the need for backslash escapes.
>>> Thank you for your indulgence during this commercial break.  You 
>>> may  now return to your regularly scheduled programming.
>>> rg
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel

More information about the Openmcl-devel mailing list