[Openmcl-devel] The case for UTF-16

Fri Apr 20 00:06:32 PDT 2007

On Fri, 20 Apr 2007, Takehiko Abe wrote:

> About an year ago (2006-03-21) I advocated the use of UTF-16 for
> openmcl string implementation. I still believe that it is a better
> choice than UTF-32.
>
> The advantages of UTF-16 over UTF-32 are:
>
>  A. It requires 50% less memory
>  B. Other libraries of interest use UTF-16.
>     e.g. ICU and Mac OSX

The obvious disadvantage is that there are characters whose codes
can't be represented in 16 bits.  You can pretend that the elements
of a surrogate pair are "characters", but there's no way to refer
to the character whose code is encoded by a UTF-16 surrogate pair.

I think that the thing that makes using UTF-16 awwkard isn't
the fact that OpenMCL doesn't use it natively/internally so much
as the fact that there aren't things like WITH-UTF-16-STRINGS (analogous
to WITH-CSTRS).  If it did use UTF-16 internally, those things would
still need to be there; there's low-level support for them 
(CHARACTER-ENCODINGs know how to endode to/decode from foreign memory),
but there isn't yet a sane/reasonable interface.

libiconv (the GNU conversion library) assumes that the native/internal
representation of a character is 32 bits wide.  To use libiconv
effectively and conveniently from OpenMCL, it'd be helpful to have
a bunch of things like WITH-UTF-32-STRINGs.

The fact that some libraries use a 16-bit encoding and other libraries
use 32-bit encodings doesn't seem to be a compelling reason for OpenMCL
to favor either a 16-bit or a 32-bit internal representation.  I agree,
though, that there needs to be higher-level support for passing things
back and forth to foreign code.

>
> The drawback is that with UTF-16 we need to treat surrogates as legit
> lisp character objects (e.g. #\U+D800). This is not clean but I think
> the harm it can cause is small and avoidable with relative
> ease. (worse is better...?)
>
> Now let me refer to sources with higher credit than myself.
>
> Unicode.org practically recommends UTF-16.
>
> 1. The Unicode Standard 4.0 Chapter 2.5 Encoding Forms:
>  <http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf>
>
> | The performance of UTF-32 as a processing code may actually be
> | worse than UTF-16 for the same data, because the additional
> | memory overhead means that cache limits will be exceeded more
> | often and memory paging will occur more frequently. For systems
> | with processor designs that have penalties for 16-bit aligned
> | access, but with very large memories, this effect may be less.

UTF-16 is (technically) a variable-width encoding.  An interesting
subset of it - UCS-2 - covers what's known as the "Basic Multilingual
Plane" in a fixed-width format (surrogate pairs aren't needed or
used; IIRC, it defines around 48,000 characters - about half the
number defined in Unicode 5.0.)  These ~48K characters are certainly
"interesting" (generally useful); the BMP is sometimes described
as containing characters for almost all modern languages.

I believe that it's the case that up until version 3.0 of the standard
(1999), Unicode didn't define anything outside the BMP (so a Unicode
code point was essentially a 16-bit entity.)  That has changed in the
last few versions, but I believe that there are still some language
environments (including some lisp implementations) where CHAR-CODE-LIMIT
or the moral equivalent thereof is a 16-bit number.

If it's agreed (I certainly hope it is ...) that it's not really
practical to use a variable-width encoding internally (anyone who
believes otherwise is welcome to write a non-consing, unit-cost
version of (SETF SCHAR)), then the choices are seemingly between:

   - UCS-2, with or without complaints about the use of surrogate
     pairs, offering potential compliance with a 7-or-8-year-old
     version of the standard at 16 bits per character

   - UTF-32/UCS-4, offering potential compliance with current and
     future versions of the standard at 32 bits per character.

I don't mean to sound like I'm totally unconcerned about the memory
issue, but I thought and still think that the second choice was
preferable.  (Maybe more accurately, I didn't think that I'd likely
be comfortable with the first choice 5 years down the road.)

> |
> | In any event, Unicode code points do not necessarily match user
> | expectations for "characters." For example, the following are
> | not represented by a single code point: a combining character
> | sequence such as <g, acute>; a conjoining jamo sequence for
> | Korean; or the Devanagari conjunct "ksha." Because some Unicode
> | text processing must be aware of and handle such sequences of
> | characters as text elements, the fixed-width encoding form
> | advantage of UTF-32 is somewhat offset by the inherently
> | variable-width nature of processing text elements.

It is true that code that displays characters/strings has to be
aware of these issues (as does some other code.)  It is less
clear that SCHAR and EQUAL and INTERN and STRING= and ... need
to be aware of these issues.

>
> [Q] Regarding the above "this effect may be less", do 64-bit
> hardwares change the picture further in favor of UTF-32?

I don't know enough to say.  It seems reasonable to assume that some
of the performance penalty that 64-bit applications currently pay
(consuming cache faster, etc.) will be addressed to some degree
(bigger, faster caches ...).  I don't have a good sense of how the
cache behavior of a program that does "typical" processing of 32-bit
strings compares with that of a similar program dealing with 16- or
8- bit strings.  (Apple's CHUD tools can access performance counters
that measure cache misses; it'd be interesting to try to compare two
versions of the same program that differed only in character width.)

>
>
> 2. Unicode Technical Note #12 UTF-16 for Processing
>  <http://www.unicode.org/notes/tn12/>
>
> | Summary
> |
> | This document attempts to make the case that it is advantageous
> | to use UTF-16 (or 16-bit Unicode strings) for text processing.
> | It is most important to use Unicode rather than older
> | approaches to text encoding, but beyond that it simplifies
> | software development even further to use the same internal form
> | for text representation everywhere. UTF-16 is already the
> | dominant processing form and therefore provides advantages.

>
> This technical note lists softwares that use UTF-16. Of them, I think
> ICU is of particular interest because it provides a normalization
> support and others that are hard to implement.
>
> If openmcl string is in UTF-16, using ICU will be more efficient and
> easier in my opinion.

See above.

Passing an arbitrary lisp string directly to foreign code is something that
you don't want to do, unless you're (a) willing to say something like
"the GC can't run if any thread is running foreign code" or (b) "there
are severe constraints on the GC's ability to move things around in
memory because of the possibility that some foreign code somewhere
might be referencing some object."  I really don't believe that either
of those things (or similar things ...) are attractive.

What OpenMCL's allowed you to do instead is to pass a "foreign copy"
of a string to foreign code (and copy foreign memory to lisp strings.)
The existing things that do that copying (WITH-CSTRS, %GET-CSTRING
...)  are probably pretty reasonable ways of dealing with
#\Nul-terminated 8-bit strings ("C strings"), but we don't yet have
reasonable ways of passing other foreign encodings around
(WITH-UTF-8-STRING, WITH-UTF-16-C-STRING, %GET-UCS-4-STRING, ...)
I think that I'd agree that if OpenMCL used UCS-2 internally,
WITH-UTF-16-C-STRING would be simple.  WITH-UTF-8-STRING would
be a little more complicated (and filesystems often use some
form of UTF-8 to represent filenames internally; it'd certainly
be desirable to be able to say things like:

(with-utf-8-string ((utf-8-string "some lisp namestring"))
   (#_open ...))

).  I think that it's desirable that all of these things be reasonably
efficient and have simple/sane interfaces.  Some will be inherently
more efficient than others.  It's no longer as efficient - it
certainly shouldn't be - to copy between 32-bit lisp strings and 8-bit
"C strings" as it was to copy between 8-bit lisp strings and 8-bit C
strings.  It might be possible to copy so many strings between lisp
and foreign memory that this difference is measurable, but I don't
think that that fact would support an argument in favor of reverting
to 8-bit strings.

>
> | From a programming point of view it reduces the need for error
> | handling that there are no invalid 16-bit words in 16-bit
> | Unicode strings. By contrast, there are code unit values that
> | are invalid in 8/32-bit Unicode strings.
>
> UTF-32 is an encoding with holes in it. I doubt that we can
> afford to have such holes.
>
> OpenMCL currently does not allow us to have surrogate codepoints as
> characters. e.g. #\U+D800 or (code-char #xD800). However, Mac OS X
> let user to type in surrogate values directly (through unicode hex
> input).
>
>
> 3. Nicest UTF -- Good discussion on pros/cons among UTFs.
>   <http://www.mail-archive.com/unicode@unicode.org/msg27077.html>
>
> | Given this little model and some additional assumptions about
> | your own project(s), you should be able to determine the
> | 'nicest' UTF for your own performance-critical case.
>
>
> 4. Both Lispworks and ACL use UTF-16.

I think that in practice that means that they use UCS-2 (possibly
agreeing to look the other way if surrogate pairs are allowed
in strings.)

It's not a totally unreasonable decision, but I don't think that
(looking forward) it's the correct decision.

>
> - Lispworks has multiple string types: for for 8bit/char string and
>   another for 16bit/char string just like MCL does.
>
> - ACL has two versions. The international version uses 16bit/char
>   string.
>
> I have not used either, so I may be wrong.
>
>
> regards,
> T.
>

Memory size (and disk size) aren't totally insignificant issues, but I
think that some/most of the points that you raise can be addressed by
fleshing things out more (providing WITH-XXX-STRING and related
constructs.)  If I'm right about that and the tradeoff is just between
memory utilization and "completeness" ... well, both of those things
matter, but it's often easier and cheaper to add memory than it is to
add completeness.

The fact that it's still awkward to pass non-C-strings between lisp
and foreign code is another kind of incompleteness.  If that's addressed,
let's see what the space vs full Unicode character support issue looks
like.