[Openmcl-devel] how many angels can dance on a unicode character?

Mon Apr 23 13:28:34 PDT 2007

Isn't it possible to implement strings such that all of the following are true?
1) All Common Lisp functions work correctly for the complete set of
Unicode code points.
2) It uses UTF-16
3) It is efficient and constant time when strings don't contain any
surrogate code points.
4) It exhibits slower, but correct behavior when strings do contain
surrogate code points.

Just use a bit in each string that indicates whether or not the string
contains any surrogate code points.  If the bit is clear, use the fast
implementations and if it is set, use the slow implementations.  The
bit would be clear for the vast majority of strings, so they would be
fast.

Have (setf schar) check if the code point being stored requires a
surrogate code point, and if so, set the bit.

I think that it would be possible to make the slow implementations not
too bad by keeping offsets to surrogate code points if they are
sparse.  If they become too dense, perhaps switch to UTF-32.  Another
bit could be used to indicate the string's encoding.

In fact, it would be possible to use this approach for UTF-8, although
this might not be worth it.

The down side of this approach is that all of the string operations
would need to check the bit and branch, but this would be more
efficient than using UTF-32 everywhere wouldn't it?  Am I missing
something?