[Openmcl-devel] The case for UTF-16

Thu Apr 19 21:12:50 PDT 2007

About an year ago (2006-03-21) I advocated the use of UTF-16 for
openmcl string implementation. I still believe that it is a better
choice than UTF-32.

The advantages of UTF-16 over UTF-32 are:

  A. It requires 50% less memory
  B. Other libraries of interest use UTF-16.
     e.g. ICU and Mac OSX

The drawback is that with UTF-16 we need to treat surrogates as legit
lisp character objects (e.g. #\U+D800). This is not clean but I think
the harm it can cause is small and avoidable with relative
ease. (worse is better...?)

Now let me refer to sources with higher credit than myself.

Unicode.org practically recommends UTF-16.

1. The Unicode Standard 4.0 Chapter 2.5 Encoding Forms:
  <http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf>

| The performance of UTF-32 as a processing code may actually be
| worse than UTF-16 for the same data, because the additional
| memory overhead means that cache limits will be exceeded more
| often and memory paging will occur more frequently. For systems
| with processor designs that have penalties for 16-bit aligned
| access, but with very large memories, this effect may be less.
| 
| In any event, Unicode code points do not necessarily match user
| expectations for "characters." For example, the following are
| not represented by a single code point: a combining character
| sequence such as <g, acute>; a conjoining jamo sequence for
| Korean; or the Devanagari conjunct "ksha." Because some Unicode
| text processing must be aware of and handle such sequences of
| characters as text elements, the fixed-width encoding form
| advantage of UTF-32 is somewhat offset by the inherently
| variable-width nature of processing text elements.

 [Q] Regarding the above "this effect may be less", do 64-bit
 hardwares change the picture further in favor of UTF-32?

2. Unicode Technical Note #12 UTF-16 for Processing
  <http://www.unicode.org/notes/tn12/>

| Summary
| 
| This document attempts to make the case that it is advantageous
| to use UTF-16 (or 16-bit Unicode strings) for text processing.
| It is most important to use Unicode rather than older
| approaches to text encoding, but beyond that it simplifies
| software development even further to use the same internal form
| for text representation everywhere. UTF-16 is already the
| dominant processing form and therefore provides advantages.

 This technical note lists softwares that use UTF-16. Of them, I think
 ICU is of particular interest because it provides a normalization
 support and others that are hard to implement.

 If openmcl string is in UTF-16, using ICU will be more efficient and
 easier in my opinion.

| From a programming point of view it reduces the need for error
| handling that there are no invalid 16-bit words in 16-bit
| Unicode strings. By contrast, there are code unit values that
| are invalid in 8/32-bit Unicode strings. 

 UTF-32 is an encoding with holes in it. I doubt that we can
 afford to have such holes.

 OpenMCL currently does not allow us to have surrogate codepoints as
 characters. e.g. #\U+D800 or (code-char #xD800). However, Mac OS X
 let user to type in surrogate values directly (through unicode hex
 input).

3. Nicest UTF -- Good discussion on pros/cons among UTFs.
   <http://www.mail-archive.com/unicode@unicode.org/msg27077.html>

| Given this little model and some additional assumptions about
| your own project(s), you should be able to determine the
| 'nicest' UTF for your own performance-critical case.

4. Both Lispworks and ACL use UTF-16.

 - Lispworks has multiple string types: for for 8bit/char string and
   another for 16bit/char string just like MCL does.

 - ACL has two versions. The international version uses 16bit/char
   string.

 I have not used either, so I may be wrong.

regards,
T.

Omake:

* comp.lang.lisp article by Erik Naggum
  <http://groups.google.com/group/comp.lang.lisp/msg/cf4586ae1a2dc726>

| I think it is important to make sure that there is a single
| code for all character sequences in the stream when it is
| converted to a vector. The private use space should be used for
| these things, and a mapping to and from character sequences
| should be maintained such that if a private use character is
| queried for its properties, those of the character sequence
| would be returned.

* comp.lang.lisp article by Ray Dillinger
  <http://groups.google.com/group/comp.lang.lisp/msg/81eb8caa17fc1969>

* Unicode Myth by Mark Davis
  <http://macchiato.com/slides/UnicodeMyths.pdf>

| You will have to rewrite all your code for surrogates.
| 
| - surrogates don't overlap.
| - Most codes not sensitive to surrogates
| - Good code accounts for strings, not just code points

* BitC programming language seems to use 32bit character type exclusively.
  <http://bitc-lang.org/>

  BitC is a system language for Coyotos <http://coyotos.org/>.
  Both Coyotos and BitC are not released yet.

--
"Humanize something free of error."