[Openmcl-devel] *default-character-encoding* should be :utf-8

Ron Garret ron at flownet.com
Mon Mar 5 11:14:22 PST 2012


On Mar 4, 2012, at 5:53 PM, Gary Byers wrote:

> 
> 
> On Sun, 4 Mar 2012, Ron Garret wrote:
> 
>> 
>> On Mar 4, 2012, at 10:55 AM, Raymond Wiker wrote:
>> 
>> Another point here is that the encoding of a particular file is what
>> it is, no matter what the user's environment has been set up for. It
>> is no help if the user's environment has been set up for a
>> particular encoding when at least some source files are in a
>> completely different encoding.
>> Yes, this is another reason why it's important for everyone to use
>> the same encoding unless there's a COMPELLING reason not to.  If
>> you're writing code, UTF-8 is the One True Encoding.
> 
> If you're writing code in a vacuum, UTF-8 is a good choice.

It's a good choice in other circumstances as well.  In particular, it's a good choice if you're writing code in (a programming language derived from) an indo-european language.  The vast majority of extant computer code in earth is written in such languages AFAIK.  Lisp code in particular is strongly biased towards indo-european languages because Lisp is specified in such a language (English) and its root symbols have names comprised of characters whose code points are all <128.

> If your sources contain a significant number of CJK characters, some
> variant of UTF-16 can be a better choice.

Yes, you will note the disclaimer "unless there is a compelling reason not to" use UTF-8.  If you're a native Mandarin speaker interacting only with other native Mandarin speakers then UTF-8 might not be the best choice for you.  But then you're also very unlikely to be reading this.

In practical terms, there is really only one encoding in common use today that causes problems, and that is latin-1.  The reason latin-1 is problematic is that it IS commonly used, it does NOT provide complete unicode coverage, and it is incompatible with UTF-8.  (So, for example, ascii also does not provide complete unicode coverage, but that's not a problem because it is a strict subset of UTF-8.)  So really my crusade is not so much to use UTF-8 as it is to NOT use latin-1 (or any other incomplete encoding).

> If your sources are in some legacy encoding - MacRoman is an example
> that still comes up from time to tine - then you obviously need to
> process them with that encoding in effect or you'll lose information.

If you're using such legacy sources, you first step should be to convert them to UTF-8 and then never touch the original again.  (The same goes for latin-1, except that latin-1 is not a legacy encoding.  It's in common use today, which is the main reason this is a real problem.)

rg




More information about the Openmcl-devel mailing list