[Openmcl-devel] *default-character-encoding* should be :utf-8

Raymond Wiker rwiker at gmail.com
Tue Mar 6 02:47:25 PST 2012


On Tue, Mar 6, 2012 at 9:29 AM, Ron Garret <ron at flownet.com> wrote:
>
> On Mar 5, 2012, at 8:30 PM, Gary Byers wrote:
>
>> If a stream that isn't really UTF-8 is treated as UTF-8 (and if it
>> contains octets >= #x80 that don't accidently form valid UTF-8
>> sequences) the result will contain replacement characters with no
>> indication of what octets they were derived from.
>
> That's not UTF-8 losing information, that's an error handling strategy losing information.
>
> There are a very few circumstances under which dealing with UTF-8 decoding errors by generating a replacement character is OK (e.g. if not doing so would make it impossible to produce an error message).  But generating replacement characters as a matter of course is a Really Bad Idea® IMHO.

You dont't really have an option. If you're trying to read a file
under the assumption that it is UTF-8, there are certain octet values
that are simply invalid (octet values 128-159, I think). At that
point, you can either drop or replace those characters, or drop your
assumption that it is UTF-8.

On the other hand, if you assume that the file is an 8-bit coding
(like iso-8859-X), then all octet values are valid, and you do not
lose any information. If your assumption of the encoding is wrong, and
you figure out what it should have been, you can then convert whatever
you have into what it should have been. This is not possible if you
started with the assumption that it was UTF-8.

> Again I want to stress that this is more of a cultural issue than a technical one.  Unicode is a mess because there is no universally accepted standard of embedding encoding metadata in a byte stream (no, the BOM does not qualify).  So the easiest way to avoid the confusion that arises from multiple mutually incompatible encodings is to not use them and just get everyone to always use the same encoding, and that encoding should be UTF-8 because it's the only commonly used encoding that is both complete and backwards-compatible with ascii.

Unicode is not a mess. The problem is that there is no 100% guaranteed
way of detecting the file encoding. The BOM *does* solve this
particular problem for files that have been encoded as UTF-8, UTF-16BE
and UTF-16LE.



More information about the Openmcl-devel mailing list