[Openmcl-devel] how many angels can dance on a unicode character?

Mon Apr 23 15:26:00 PDT 2007

Since GB reminded me of the argument for UTF32, let me try to save him 
some typing.

I think GB's points are...

- what you propose for UTF16 support is a hassle and is very disruptive 
to the code (an alternative already exists, and changing the code would 
require widespread complicated changes for many very low level heavily 
used things, etc.)

- given UTF32 support, it's largely unnecessary (strings Just Work and 
can contain whatever you like, characters are always semantically what 
you think of as a character, and code for converting a string of 
base-characters to an array of unsigned 16-bit integers isn't bad when 
doing FFI stuff to libraries that expect UTF16)

- if someone wants it that badly, they can try implementing it 
themselves, GB's time is better spent elsewhere

Here's an analogy to proposing a move to UTF16 (hopefully people will 
realize that the analogous proposal is comparably unrealistic):
   At the moment, openmcl uses a complete, compacting, generational 
garbage collector.  The compacting and generational parts mean that data 
moves around and some FFI things are a hassle, for example you need to 
copy data before passing it to C++ APIs.  Does it make sense to convert 
openmcl to use a conservative GC so that things will stop moving 
around?  This would make FFI use easier in many cases, but would be 
extremely disruptive to the current code base and would sacrifice 
certain advantages of the current approach.  And certainly GB has other 
priorities than ripping out and replacing a working garbage collector.

In the GC world, you can trade mobility and completeness for static 
locations and interoperability, but the current state of and need for 
FFI don't justify making widespread, disruptive changes.  In the string 
world, you can trade some memory and interoperability with efficiency 
and consistency, and the benefits don't seem justify the cost of the 
change (in operational effects and in the cost of Gary's time).

h

Jeremy Jones wrote:
> Isn't it possible to implement strings such that all of the following are true?
> 1) All Common Lisp functions work correctly for the complete set of
> Unicode code points.
> 2) It uses UTF-16
> 3) It is efficient and constant time when strings don't contain any
> surrogate code points.
> 4) It exhibits slower, but correct behavior when strings do contain
> surrogate code points.
>
> Just use a bit in each string that indicates whether or not the string
> contains any surrogate code points.  If the bit is clear, use the fast
> implementations and if it is set, use the slow implementations.  The
> bit would be clear for the vast majority of strings, so they would be
> fast.
>
> Have (setf schar) check if the code point being stored requires a
> surrogate code point, and if so, set the bit.
>
> I think that it would be possible to make the slow implementations not
> too bad by keeping offsets to surrogate code points if they are
> sparse.  If they become too dense, perhaps switch to UTF-32.  Another
> bit could be used to indicate the string's encoding.
>
> In fact, it would be possible to use this approach for UTF-8, although
> this might not be worth it.
>
> The down side of this approach is that all of the string operations
> would need to check the bit and branch, but this would be more
> efficient than using UTF-32 everywhere wouldn't it?  Am I missing
> something?
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>
>