[Openmcl-devel] how many angels can dance on a unicode character?

Takehiko Abe keke at gol.com
Tue Apr 24 06:47:51 PDT 2007


Hamilton Link wrote:

> - given UTF32 support, it's largely unnecessary (strings Just Work
> and can contain whatever you like, characters are always
> semantically what you think of as a character,

That depends on who "you" are. Unicode characters differs from
what many of us think as characters. For instance, what I
thought was a Hangul (Korean) character turned out to be a
cluster of characters --that kind of thing. The unicode
standard has such terms as "grapheme clusters" or "user
character" for "what users think of as characters".

See UAX-29 Text Boundaries:
<http://www.unicode.org/reports/tr29/>

| One or more Unicode characters may make up what the user thinks
| of as a character or basic unit of the language.

I believe we all need to work on at the level of "what users think of
as characters" if we are to modify unicode string.

> and code for converting a string of base-characters to an array of
> unsigned 16-bit integers isn't bad when doing FFI stuff to libraries
> that expect UTF16)

Maybe it is not as bad as what I thought it was. But there is
overhead: you need to scan the string twice to determine how
many bytes you need to allocate and copy individual characters,
and you also need to prepare for the case in which a FFI call
return a malformed result. Neither is necessary if
formats are the same -- you can allocate length (x 2 in case of
utf-16) bytes and blit the data without examining the contents.

> - if someone wants it that badly, they can try implementing it
> themselves, GB's time is better spent elsewhere

You are right.

regards,
T.

--
"Abandon normal instruments."




More information about the Openmcl-devel mailing list