[Openmcl-devel] Plans for Unicode support within OpenMCL?

Sat Apr 1 13:49:44 PST 2006

Takehiko Abe wrote:

 > If having multiple string types is not desirable, I think UTF-16
 > string is a good compromise. The 4-fold increase of string size
 > is too much.

I disagree, personally.  UTF-32 doesn't really bother me.  However, it 
doesn't completely help either.  Indexing on fully compliant Unicode 
strings is always O(n) worst case.  However, I have seen some pretty 
smart speed optimizations in languages with boxed types (ie. store the 
fact that it actually is an ASCII-only or BMP-only or non-combining 
string and change internal encoding.  However, this means that you need 
to do at least one O(n) scan on a string to set up the optimization).

The big advantage that everybody forgets about with UTF-8 is the fact 
that all the nice low-level, null-termination-expecting strcpy(), 
strncpy(), etc all work just fine with UTF-8.  That's not true for any 
other encoding.  UTF-8 also has no endianness confusion.

UTF-8 might be the easiest method to make OpenMCL Unicode compliant as 
there is no need to upgrade the entire system all at once.  Declare the 
system UTF-8 compliant, touch some of the basic text input/output 
functions to handle encoding, create a Unicode reader macro, and *bingo* 
instant Unicode compliance.

This does, of course, gloss over things like performance issues with 
non-ASCII characters, indexing, combining characters, etc.  However, 
things run at roughly the same speed as before and you get 80%+ of the 
Unicode compliance you need.  Then you can start chopping off the rough 
edges as people hit them rather than having to try to get them all at once.

 > Unicode has combining characters and covers lots of scripts/writing
 > systems. Handling them is inherently hard and having characters
 > with unicode direct codepoints does not make it easier much, imo.

Yup.  There is really no way around the fact that some Unicode string 
operations are O(n).  Once Unicode decided that combining characters 
were a good idea, that die was cast.

Unicode strings are *not* vectors/arrays in spite of the fact that they 
"almost" are.  We all need to just suck it up and get over it.

-a

Quoting the standard about combining characters:
http://www.unicode.org/faq/char_combmark.html

Q: How should characters (particularly composite characters) be counted, 
for the purposes of length, substrings, positions in a string, etc.

A: In general, there are 3 different ways to count characters. Each is 
illustrated with the following sample string.
"a" + umlaut + greek_alpha + \uE0000.
(the latter is a private use character)

1. Code Units: e.g. how many bytes are in the physical representation of 
the string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 
00 00]

2. Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count 
divided by 4.

3. Graphemes: what end-users consider as characters.
A default grapheme cluster is specified by the Unicode Standard 4.0, and 
is also in UTR #18 Regular Expressions at 
http://www.unicode.org/reports/tr18/.

The choice of which one to use depends on the tradeoffs between 
efficiency and comprehension. For example, Java, Windows and ICU use #1 
with UTF-16 for all low-level string operations, and then also supply 
layers above that provide for #2 and #3 boundaries when circumstances 
require them. This approach allows for efficient processing, with 
allowance for higher-level usage. However, for a very high level 
application, such as word-processing macros, graphemes alone will 
probably be sufficient. [MD]