[Openmcl-devel] *default-character-encoding* should be :utf-8

Ron Garret ron at flownet.com
Tue Mar 6 00:29:04 PST 2012


On Mar 5, 2012, at 8:30 PM, Gary Byers wrote:

>> 
>> Maybe there's something I'm missing here.  How does UTF-8 lose information?

> On encountering ill-formed sequences,
> a UTF-8 translator is supposed to either signal an error or generate
> a replacement character; CCL generally tries to do the latter.

Well, there's your problem right there ;-)

(http://knowyourmeme.com/memes/well-theres-your-problem for those who didn't get the joke)

> If a stream that isn't really UTF-8 is treated as UTF-8 (and if it
> contains octets >= #x80 that don't accidently form valid UTF-8
> sequences) the result will contain replacement characters with no
> indication of what octets they were derived from.

That's not UTF-8 losing information, that's an error handling strategy losing information.

There are a very few circumstances under which dealing with UTF-8 decoding errors by generating a replacement character is OK (e.g. if not doing so would make it impossible to produce an error message).  But generating replacement characters as a matter of course is a Really Bad Idea® IMHO.

Again I want to stress that this is more of a cultural issue than a technical one.  Unicode is a mess because there is no universally accepted standard of embedding encoding metadata in a byte stream (no, the BOM does not qualify).  So the easiest way to avoid the confusion that arises from multiple mutually incompatible encodings is to not use them and just get everyone to always use the same encoding, and that encoding should be UTF-8 because it's the only commonly used encoding that is both complete and backwards-compatible with ascii.

rg




More information about the Openmcl-devel mailing list