[Openmcl-devel] A plug for UTF-8

Thu Sep 3 14:49:47 PDT 2009

I generally agree that using UTF-8 would be a good thing.  The
counterargument (and I think that this is most of the reason for
defaulting to ISO-8859-1) has to do with the situation that you
ran into the other day.

A sequence of 8-bit bytes (octets) in a stream may or may not
be "well-formed UTF-8" (there are constaints on when octets can
have high bits set in a valid UTF-8 sequence and how many high
bits must be set, among other things.)  I believe that some web
browsers have been vulnerable to malformed UTF-8 in the past (e.g,
visit a page containing malformed UTF-8 text one day and find that
someone on the other side of the world is using your credit card
the next.)

Most decoding utilities do a better job of at least recognizing
that a stream isn't in valid UTF-8 format; there's probably a
lot of variance in behavior once that's determined.  The Cocoa
text system seems to just report that the file isn't valid, which
isn't entirely unreasonable.  CCL's own UTF-8 decoding code will
generate a #\Replacement_Character if it discovers an invalid
sequence (hopefully in all cases), but it does so after reading
a byte that isn't a valid part of a UTF-8 sequence.  In the case
of the file that Glen Foy committed a few days ago, an octet
that could have started a UTF-8 sequence was followed by a #x22
(a #\" character); we concluded that the double-quote made for
an invalid UTF-8 sequence but basically didn't read the #\" and
eventually got a premature EOF in what we thought was the middle
of a string.  That's a silly way to lose, but I think that it's
hard to guarantee that you don't lose somehow in cases like this.

Glen's file seemed to actually be encoded in Macintosh (aka MacRoman).
If you mistakenly try to interpret MacRoman (or UTF-8, or lots of
other encodings) as ISO-8859-1, you get some of the characters wrong
(the ellipsis looks like an acute-accented #\E, etc.) but don't run
the risk of getting out-of-synch and confused.

The reasons for having the default be ISO-8859-1 have more to do with
ISO-8859-1's neutrality (there's no such thing as malformed ISO-8859-1
and therefore aren't any recovery issues) than with a belief that it's
a particularly useful encoding, and having it as the default might
encourage people to actually -use- it to encode (some) non-ASCII characters
in files.  (If the only real argument against UTF-8 is that some files
use legacy encodings that can confuse UTF-8 parsing, it'd probably be
wise for CCL to stop defaulting to a legacy encoding that can confuse
UTF-8 parsing.)

I think that I just talked myself into agreeing that the advantages of
switching to UTF-8 outweight the disadvantages.

On Tue, 1 Sep 2009, Ron Garret wrote:

> I would like to take a moment to lobby on behalf of UTF-8.  This is 
> not a huge big deal because it's easy enough to convert from one 
> encoding to another once you know how, but I think it would be a nice 
> selling point for CCL is things tended to Just Work, and one way to 
> make them Just Work is to have an encoding convention that is 
> universally followed so that newcomers can set it and forget it.  The 
> reason I think UTF-8 is a better choice than, say, Latin-1 is that 
> UTF-8 gives you access to the entire unicode code space, and in 
> particular the lower-case Greek lambda character (?) and European- 
> style ?quotation marks? which are self-balancing and hence let you 
> build nested strings without the need for backslash escapes.
>
> Thank you for your indulgence during this commercial break.  You may 
> now return to your regularly scheduled programming.
>
> rg
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel