[Openmcl-devel] how many angels can dance on a unicode character?

Gary Byers gb at clozure.com
Sat Apr 21 02:40:08 PDT 2007

Supposedly, people who are conversant in several different languages
find it easier to discuss certain subjects in specific languages.  I
don't remember all of the details, but IIRC it's easier to discuss
technical and scientific things in German than in other languages,
easier to discuss commerce in English, and there are apparently
or allegedly at least a few other cases.

My knowledge of ancient Sumero/Akkadian is extremely limited, so I
hope that people will forgive me if I lapse into cuneiform here ...  I
think that what I want to say here can best be summed up by the
well-known expression "X, Y, and <CUNEIFORM SIGN A>", where by
<CUNIEFORM SIGN A> I'm really referring to the Unicode character with
code #x12000, aka #\U+12000.

If I was using a clay tablet and stylus, it'd probably be easier
to write that message than it is with a computer (I'm not familiar
with Sumero/Akkadian input methods ...).  It's not hard to construct
that string procedurally in OpenMCL:

? (concatenate 'string "X, Y, and " (string #\U+12000))

If I execute that code in "openmcl64 -K utf-8" running under
Terminal.app, I see some mixture of ASCII text and cuneiform.
(Well, I would have if I'd remembered to install a cuneiform font.)

If wanted to exchange the first and last characters in that
string, I might use something (stupid) like:

(defun exchange-first-and-last-characters (string)
   (let* ((len (length string)))
     (when (> len 1)
       (let* ((temp (char string (1- len))))
         (setf (char string (1- len)) (char string 0)
               (char string 0) temp)))

and if I printed the result, the cuneiform character would
precede the others.  (For those who are beginning to suspect
as much: yes, this is a contrived example.)

The cuneiform character #\U+12000 isn't any different from
the other characters in that string; it's not a CL:STANDARD-CHAR,
but it is a CL:BASE-CHAR in OpenMCL.  We can treat it pretty
much the same way that we'd treat other CHARACTERs: we can store
it in strings, ask for its CHAR-CODE, and use CL character functions
on it.

[Aside: it is true that some of those CL character functions might give
meaningless results.  (CL:CHAR< a b) is defined to be true exactly
when (< (CHAR-CODE a) (CHAR-CODE b)) is true, and even if that's
well-defined, just about any set of assignment of character-codes
to characters will give answers that are meaningless/useless for
some characters in some locale(s).  I don't know if the concept
of alphabetic case applied to cuneifrom, but it does apply to
other characters that aren't STANDARD-CHARs.  Even if CHAR-UPCASE
and CHAR-DOWNCASE were extended to apply to all applicable Unicode
characters, there are ... cases ... where STRING-UPCASE/STRING-DOWNCASE
would need to change the number of characters in their arguments in
order to comply with local conventions, and it seems that we need
a set of character/string functions that are "useful" and not necessarily
equivalent to CL character/string functions.

It's also the case that some characters are intended to be "composed
with" adjacent characters (or are the results of such composition.
One may need to be aware of this in certain contexts, but I don't
think that it's any more meaningul to think of two adjacent combinable
characters in a string as being "one character" any more than it is
to think of a carriage return followed by a graphic character as being
"one character", even though the effect of rendering those characters
might be identical to the effect of rendering a single character.]

Back to Sumero/Akkadian: the fact that we can treat this character
(#\U+12000) as a first-class object is related to the decision to
make CHAR-CODE-LIMIT #x110000 and to use UTF-32/UCS-4 "encoding"
internally (e.g., 32-bit strings.)  [As another aside, it's not
completely out of the question to use a 24-bit string representation,
lessening the memory impact a bit.]

If OpenMCL were to use UCS-2 internally (recall that UCS-2 is basically
a subset of UTF-16 that doesn't use surrogate pairs), we would have no
way of communicating in cuneiform (or at least no way of being understood
by other programs if we did so.)  We would still have a well-defined
notion of a what a CHARACTER was, and we could still access and modify
STRINGs in constant/unit time.  It'd be slightly easier to copy UCS-2
strings to external UTF-16-encoded memory than it is to copy 32-bit
string, but I really, really think that this is fairly far down the
list of considerations that should affect the decision of how characters
and strings are represented internally.   This scheme would take about
half the memory for strings as the current scheme does, and I do think
that that's an important consideration.

Suppose we were to instead say that - formally or not - these 16-bit
strings were really UTF-16-encoded; we could allow the use of
surrogate pairs inside 16-bit strings.  If we did this "informally",
functions like SCHAR would either return true CHARACTER objects or the
high or low half of a surrogate pair.  Since we aren't inventing a new
language, the values returned by CHAR and SCHAR would have to be
CHARACTERs, even though they aren't "real": we can't ask ICU or
anything else what the uppercase version of such a pseudo-character is
in some locale.  (This situation is pretty much exactly like what you
get with CFString/NSString's characterAtIndex: operations: sometimes,
those functions return characters and sometimes they return halves of
surrogate pairs.)  Does anyone really look at something like this and
not see a mess (where the best way to avoid the mess is to only use
the intersection of UCS-2 and Unicode ?)  To be fair, a lot of
whatever mess one sees there is probably there for backward
compatibility; I'm sure that NextStep supported Unicode for a long
time in the days when all Unicode characters fit in 16 bits, and
changing things will break existing code.  In the modern
(Sumero/Akkadian) world, there are different constraints and issues.

A "formal" use of UTF-16 might recognize that a string is composed
of characters (not just 16-bit code elements).  Naively, this would
mean that things like SCHAR and AREF and LENGTH might need to scan
the string from the beginning, treating each non-surrogate-pair
element and each surrogate pair as a single (logical) character.
It's not hard to think of schemes that cache information about
a UTF-16 encoded string that would make these access operations
reasonably fast (e.g., cache the logical length in characters as
well as the physical length in elements, keep track of whether there
are in fact any surrogate pairs in the string, cache the location
of some element and character positions so that scans don't have
to start at the beginning of the string, probably other things.)
I'd assume that programming environments that use UTF-16 internally
and provide "sane" access to characters in strings do something
like this some of the time.

One way in which CL differs from many other programming environments
is the fact that CL strings are mutable ((SETF CHAR), destructive
sequence operations, lots of things in the reader and INTERN and
elsewhere expect to be able to perform cheap destructive operations on
strings.)  A destructive operation on a string - changing an ASCII
character to a cuneiform character, for instance - might change the
number of code elements needed to represent the string's characters in
a variable-length encoding like UTF-16.  A "simple" (SETF SCHAR) - or
something like EXCHANGE-FIRST-AND-LAST-CHARACTERS above - could
involve significant memory allocation and copying and rebuilding of
some or all of the cached information that makes access viable, and I
don't know how to explain how undesirable this is to anyone who says
that they want this but to say "no, you don't."  (I tried to explain
this in the discussion last year; a few days after I did so, someone
proposed using UTF-8 internally.)  I Sometimes Feel Like I'm Just
Not Getting Through To These Kids.

This all leads me to the conclusion that the only really viable options
for internal string representation are (a) the current scheme, in which
all Unicode 5.0 characters are representable, string operations are
cheap and sane, but there's significant memory overhead that could
be reduced somewhat by using a 24-bit string type and (b) a 16-bit
scheme that would allow the direct representation of "most characters
used in modern languages" - equivalent to Unicode 3.x - but which
would not allow the representation other characters without creating
a lot of confusion and inconsistency.  The latter scheme would not
allow use to use cunieform (unless we were willing to accept confusion
and inconsistency that I don't think we want to accept.)

You might be tempted to say "well, that's fine.  Personally, I only
use cuneiform in contrived examples; it'd be fine to stick to characters
in the Basic Multilingual Plane, and that would offer a significant
space saving relative to the current scheme and incidentally make
UTF-16 encoding and decoding simpler."  I'd agree with that (though
I think that the encoding/decoding issue is less significant than 
other people may believe), and I confess that it's been a long time
since I've even thought about printing cuneiform characters in
OpenMCL.  (Seems like forever, in fact.)

Let's agree that the percentage of possible users intersted in doing
cuneiform I/O in OpenMCL ("best thing since a clay tablet!") is small.
Other characters that can't be represented in a 16-bit encoding
include around 40,000 "mostly historical, but some modern" Chinese
ideographs", musical symbols, characters from other historical or
obscure languages ...  I don't know exactly what the percentage of
possible users interested in using some subset of those relatively
new (to Unicode) characters is, but I suspect that it's large enough
that I don't feel comfortable dismissing potential needs of such
users as irrelevant.

On Fri, 20 Apr 2007, Hamilton Link wrote:

> I think the arguments that Mac OS X and Windows both use UTF 16, that
> UTF16 is the "recommended" format, and that you'll see such documents
> and library interfaces far more often than for UTF 32, are all fairly
> compelling arguments in favor of UTF16 or Unicode (UTF16-LE, right?)
> being the native format of openmcl.
> The question is, how much does openmcl do for you if UTF16 is the
> default format?
> 1 - being able to process UTF 16 and access it at the whole-character
> level means... the system has to do the work to figure out what the
> valid character sequences are and retrieve them, and it has to do more
> computation to ensure that strings are valid sequences of code points.
> This is a little easier in UTF32, where you only care about a few
> special cases, but it can still end up with linear operations.
> 2 - being able to use schar on 16-bit code points without extra fuss in
> constant-access time means... the burden is on the user to handle code
> points, recognize when they aren't whole characters, and accept the
> blame when malformed strings are passed around, and openmcl does zero
> work to help you (and thus also zero work interfering with you, if you
> really want a length-1 string with a single Unicode surrogate code
> point in it, which I can imagine happening).
> I was originally a proponent of UTF 32, but since then I've come around
> to #2 (based on the rest of the world primarily using UTF16 when they
> use UTF at all, and my general feeling that the system shouldn't do too
> much hand-holding when I'm working with raw data).  So I vote for #2:
> let me allocate strings with element-types UTF8, UTF8LE, UTF8BE, UTF16,
> UTF16LE, etc. ... up to UTF32, give me character-related functions that
> actually are code-point functions, make Unicode be the default, and
> let's be done with it.  Then I'll always know what is really in a
> string, and I always know how long it takes to get a character (code
> point).  If I care about having a whole-character view of things I can
> write a couple of extra functions, if I try to print something I should
> probably take some care to not print garbage, and when I put values
> into a string I should know what I'm doing and not put in invalid code
> points or code points in invalid orders.
> ttyl,
> hamilton
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel

More information about the Openmcl-devel mailing list