[Openmcl-devel] how many angels can dance on a unicode character?

Fri Apr 20 08:59:43 PDT 2007

I think the arguments that Mac OS X and Windows both use UTF 16, that 
UTF16 is the "recommended" format, and that you'll see such documents 
and library interfaces far more often than for UTF 32, are all fairly 
compelling arguments in favor of UTF16 or Unicode (UTF16-LE, right?) 
being the native format of openmcl.

The question is, how much does openmcl do for you if UTF16 is the 
default format?

1 - being able to process UTF 16 and access it at the whole-character 
level means... the system has to do the work to figure out what the 
valid character sequences are and retrieve them, and it has to do more 
computation to ensure that strings are valid sequences of code points.  
This is a little easier in UTF32, where you only care about a few 
special cases, but it can still end up with linear operations.

2 - being able to use schar on 16-bit code points without extra fuss in 
constant-access time means... the burden is on the user to handle code 
points, recognize when they aren't whole characters, and accept the 
blame when malformed strings are passed around, and openmcl does zero 
work to help you (and thus also zero work interfering with you, if you 
really want a length-1 string with a single Unicode surrogate code 
point in it, which I can imagine happening).

I was originally a proponent of UTF 32, but since then I've come around 
to #2 (based on the rest of the world primarily using UTF16 when they 
use UTF at all, and my general feeling that the system shouldn't do too 
much hand-holding when I'm working with raw data).  So I vote for #2: 
let me allocate strings with element-types UTF8, UTF8LE, UTF8BE, UTF16, 
UTF16LE, etc. ... up to UTF32, give me character-related functions that 
actually are code-point functions, make Unicode be the default, and 
let's be done with it.  Then I'll always know what is really in a 
string, and I always know how long it takes to get a character (code 
point).  If I care about having a whole-character view of things I can 
write a couple of extra functions, if I try to print something I should 
probably take some care to not print garbage, and when I put values 
into a string I should know what I'm doing and not put in invalid code 
points or code points in invalid orders.

ttyl,
hamilton