[Openmcl-devel] how many angels can dance on a unicode character?

Mon Apr 23 16:42:28 PDT 2007

On Mon, 23 Apr 2007, Jeremy Jones wrote:

> Isn't it possible to implement strings such that all of the following are 
> true?
> 1) All Common Lisp functions work correctly for the complete set of
> Unicode code points.
> 2) It uses UTF-16
> 3) It is efficient and constant time when strings don't contain any
> surrogate code points.
> 4) It exhibits slower, but correct behavior when strings do contain
> surrogate code points.
>
> Just use a bit in each string that indicates whether or not the string
> contains any surrogate code points.  If the bit is clear, use the fast
> implementations and if it is set, use the slow implementations.  The
> bit would be clear for the vast majority of strings, so they would be
> fast.
>
> Have (setf schar) check if the code point being stored requires a
> surrogate code point, and if so, set the bit.

And change the number of 16-bit code units in some cases, or defeat
some of the purpose of using a compact encoding by reserving slop
space.

>
> I think that it would be possible to make the slow implementations not
> too bad by keeping offsets to surrogate code points if they are
> sparse.  If they become too dense, perhaps switch to UTF-32.  Another
> bit could be used to indicate the string's encoding.
> In fact, it would be possible to use this approach for UTF-8, although
> this might not be worth it.

After I wrote a long message in the thread last year explaining why I
don't think that you'd want to use a variable-width encoding for
strings any more than you'd (seriously) want to do so for other array
or vector types, someone proposed using UTF-8 internally.

>
> The down side of this approach is that all of the string operations
> would need to check the bit and branch, but this would be more
> efficient than using UTF-32 everywhere wouldn't it?  Am I missing
> something?
>

It'd be more space-efficient in most cases, and less time-efficient in
all cases (either "somewhat so" or "grossly so".)  Using UTF-16
internally is probably not nearly as far down the Bad Idea scale as
... oh, Huffman-coded arrays would be.  ("Newly created arrays are
full of 0s!  Why waste 31 or 63 extra bits ?  Of course, this makes
(SETF AREF) hard, but you could have a bit somewhere!")

Let's sketch out (very generally) what it would cost - space-wise- to
use UTF-16 internally.  I don't mean to suggest that the sketch below
is "real", but hope that it's "realistic", in the sense that it isn't
obviously overstating or understating the memory costs associated with
trying to make UTF-16 usable.

In the general case, (SETF SCHAR) on a variable-width-encoded string
is hard, because the general case includes cases where the character
we're "replacing" takes a different number of code units than the
character that we're replacing it with.  If the effect of (SETF SCHAR)
is to replace a thin character with a fat one, we may need to allocate
space for more code units (or have pre-allocated that space, defeating
the purpose of using UTF-16 to save space).  In both that case and
in the case of replacing a fat character with a thin one, we may have
to shuffle things around in memory and/or update any tables which tell
us where the fat characters are, and we have to adjust our notion of
how many code units are in the vector of code units that represent
the string's characters.

Even if this case occurs rarely in practice ("since people will know how
expensive it is, they'll never do it!"), it has to be possible.  Since
(SETF (SCHAR S I) C) can't change what S is, I think that the only way
that we could allow it to happen is to say that a SIMPLE-STRING is
implicitly displaced to the vector of code-units.  We don't -have-
to use something as general as what OpenMCL would use for other types
of displaced vectors for a SIMPLE-STRING, but we have (so far) something
that looks like:

<string-header>
  pointer-to-underlying-vector  -> (simple-array (unsigned-byte 16) (*))

We probably want to be able to access the string's length in
characters quickly, so the SIMPLE-STRING object probably needs a
"length" field, somewhere where the LENGTH function can find it
quickly.  (It's often going to be exactly the same as the length of
the underlying vector of code units, but it might not be, and we
probably don't want LENGTH of as string to do anything very
complicated.  More accurately, "I don't want this and don't think
that anyone else does, either.")

People (including Jeremy) have proposed that making SCHAR/AREF
acceptably fast on a UTF-16 string that actually contained some 32-bit
characters (represented by surrogate pairs) could be done by having a
table which described where those fat characters are; we'd certainly
need something to keep SCHAR/AREF from having to start at the
beginning and partially decode things until it got to the I'th
character each time.  For the probably-very-common case where a string
doesn't contain any fat characters, the table could be empty.

In tribute to everyone who's suggested that some special-case or other
could be recognized by having a bit somewhere, let's add a fixnum
field to SIMPLE-STRING that'd provide 29 or 61 bits.

(SCHAR S I) might still partly open-code as something like:

(if (null (string-header-table S))
   (%code-char-16 (%simple-16-bit-aref (string-underlying-vector s) i))
   (do-hairy-case-out-of-line))

and that's a lot more expensive than it is now, but the common
no-fat-characters case is still simple enough to inlne at the cost
of a few extra memory references and a bit of code bloat; likewise,
the comon case of (SETF SCHAR) just has to check to ensure that it
really is the common case and do an extra memory reference or two.

A SIMPLE-STRING object would look something like:

underlying-code-units
length
table-for-hairy-case
bits

and (again, I'm sure that we could bum this a little and am equally
willing to believe that I've forgotten something necessary) that'd
cost us about 4 words of memory + another word or two of memory-management
overheade (type information, alignment.)

Looking at strings in memory via:

(let* ((nstrings 0)
        (nchars 0))
   ;; CCL::%MAP-AREAS walks memory, calling a function on every object
   ;; it finds there.  It does horrible things with other threads, and
   ;; it's not recommended for things other than this kind of informal
   ;; memory profiling
   (ccl::%map-areas #'(lambda (thing)
                   (when (typep thing 'simple-string)
                     (incf nstrings)
                     (incf nchars (length thing)))))
   (format t "~&~d strings, average length = ~s." nstrings (float (/ nchars nstrings))))

claims that the average string is a little more than 16 characters in length.
That means that representing the "average" string - where no characters were
outside of the Basic Multilingual Plane - in the scheme proposed above
would require:

    64-bit implementation                        32-bit implementation
       32 bytes              4 native words in       16 bytes
                              string object
       12 bytes               header, alignment       6 bytes
                              of string object
       32 bytes             16 16-bit words of       32 bytes
                               char data
       12 bytes               header, alignment       6 bytes
                                of char data
--------------------------------------------------------------------
       80 bytes                                      60 bytes

vs:

       64 bytes               UTF-32 string data     64 bytes      '
       12 bytes               header, alignment       6 bytes
--------------------------------------------------------------------
       78 bytes                                      70 bytes

I think that these numbers are about right; the "header + alignment"
costs are more predictable for a fixed-size structure, and the
numbers I'm using above are just averages.  If I'm getting those
numbers wrong, I'm at least trying to do it consistently ...

If these numbers are roughly accurate and if the sketch of what
a displaced SIMPLE-STRING object would look like is realistic,
then I'd say that using UTF-16 to represent arbitrary Unicode
characters in a realistic way costs about as much memory-wise
as using UTF-32 does, is somewhat slower in the simplest cases
and much slower in general, has very complex boundary
cases once we step outside the BMP, and just generally doesn't
seem to have many socially-redeeming qualities that I can see.