[Openmcl-devel] how many angels can dance on a unicode character?

Takehiko Abe keke at gol.com
Mon Apr 23 22:46:02 PDT 2007


Jeremy Jones wrote:

> Isn't it possible to implement strings such that all of the following are
> true?
> 1) All Common Lisp functions work correctly for the complete set of
> Unicode code points.
> 2) It uses UTF-16
> 3) It is efficient and constant time when strings don't contain any
> surrogate code points.
> 4) It exhibits slower, but correct behavior when strings do contain
> surrogate code points.
> 
> Just use a bit in each string that indicates whether or not the string
> contains any surrogate code points.  If the bit is clear, use the fast
> implementations and if it is set, use the slow implementations.  The
> bit would be clear for the vast majority of strings, so they would be
> fast.
> 
> Have (setf schar) check if the code point being stored requires a
> surrogate code point, and if so, set the bit.

Personally, I do not think that supplementary characters deserve
such close attention.

By treating them as second class pseudo characters, we get 50%
space saving and FFI convenience. From my point of view, the
benefit of the space saving alone outweighs the benefit of
efficient handling of supplementary characters (and justifies
the ugliness), because I don't expect I'll ever need the
latter.

> 
> I think that it would be possible to make the slow implementations not
> too bad by keeping offsets to surrogate code points if they are
> sparse.  If they become too dense, perhaps switch to UTF-32.  Another
> bit could be used to indicate the string's encoding.
> 
> In fact, it would be possible to use this approach for UTF-8, although
> this might not be worth it.
> 
> The down side of this approach is that all of the string operations
> would need to check the bit and branch, but this would be more
> efficient than using UTF-32 everywhere wouldn't it?  Am I missing
> something?

I think what you propose is in essense the same as having
multiple string types.

Pekka Pirinen of Lispworks had this to say in 2002:
<http://groups.google.com/group/comp.lang.lisp/msg/4c5934ed093214d0>

| As an implementor, I can tell you that actually the step from
| one string type to two is the hardest bit.  Once you've figured
| out how you want to implement that, having more is not such a
| big deal. [...]

Interestingly, although Lispworks has two string types (8-bit
and 16-bit), it still does not seem to have 32-bit/char string
type. And I doubt it ever will -- there wouldn't be much demand
for it.

regards,
T.

--
"Consider different fading systems."





More information about the Openmcl-devel mailing list