[Openmcl-devel] how many angels can dance on a unicode character?
Hamilton Link
hamlink at comcast.net
Fri Apr 20 08:59:43 PDT 2007
I think the arguments that Mac OS X and Windows both use UTF 16, that
UTF16 is the "recommended" format, and that you'll see such documents
and library interfaces far more often than for UTF 32, are all fairly
compelling arguments in favor of UTF16 or Unicode (UTF16-LE, right?)
being the native format of openmcl.
The question is, how much does openmcl do for you if UTF16 is the
default format?
1 - being able to process UTF 16 and access it at the whole-character
level means... the system has to do the work to figure out what the
valid character sequences are and retrieve them, and it has to do more
computation to ensure that strings are valid sequences of code points.
This is a little easier in UTF32, where you only care about a few
special cases, but it can still end up with linear operations.
2 - being able to use schar on 16-bit code points without extra fuss in
constant-access time means... the burden is on the user to handle code
points, recognize when they aren't whole characters, and accept the
blame when malformed strings are passed around, and openmcl does zero
work to help you (and thus also zero work interfering with you, if you
really want a length-1 string with a single Unicode surrogate code
point in it, which I can imagine happening).
I was originally a proponent of UTF 32, but since then I've come around
to #2 (based on the rest of the world primarily using UTF16 when they
use UTF at all, and my general feeling that the system shouldn't do too
much hand-holding when I'm working with raw data). So I vote for #2:
let me allocate strings with element-types UTF8, UTF8LE, UTF8BE, UTF16,
UTF16LE, etc. ... up to UTF32, give me character-related functions that
actually are code-point functions, make Unicode be the default, and
let's be done with it. Then I'll always know what is really in a
string, and I always know how long it takes to get a character (code
point). If I care about having a whole-character view of things I can
write a couple of extra functions, if I try to print something I should
probably take some care to not print garbage, and when I put values
into a string I should know what I'm doing and not put in invalid code
points or code points in invalid orders.
ttyl,
hamilton
More information about the Openmcl-devel
mailing list