[Openmcl-devel] how many angels can dance on a unicode character?
helink at sandia.gov
Mon Apr 23 22:26:00 UTC 2007
Since GB reminded me of the argument for UTF32, let me try to save him
I think GB's points are...
- what you propose for UTF16 support is a hassle and is very disruptive
to the code (an alternative already exists, and changing the code would
require widespread complicated changes for many very low level heavily
used things, etc.)
- given UTF32 support, it's largely unnecessary (strings Just Work and
can contain whatever you like, characters are always semantically what
you think of as a character, and code for converting a string of
base-characters to an array of unsigned 16-bit integers isn't bad when
doing FFI stuff to libraries that expect UTF16)
- if someone wants it that badly, they can try implementing it
themselves, GB's time is better spent elsewhere
Here's an analogy to proposing a move to UTF16 (hopefully people will
realize that the analogous proposal is comparably unrealistic):
At the moment, openmcl uses a complete, compacting, generational
garbage collector. The compacting and generational parts mean that data
moves around and some FFI things are a hassle, for example you need to
copy data before passing it to C++ APIs. Does it make sense to convert
openmcl to use a conservative GC so that things will stop moving
around? This would make FFI use easier in many cases, but would be
extremely disruptive to the current code base and would sacrifice
certain advantages of the current approach. And certainly GB has other
priorities than ripping out and replacing a working garbage collector.
In the GC world, you can trade mobility and completeness for static
locations and interoperability, but the current state of and need for
FFI don't justify making widespread, disruptive changes. In the string
world, you can trade some memory and interoperability with efficiency
and consistency, and the benefits don't seem justify the cost of the
change (in operational effects and in the cost of Gary's time).
Jeremy Jones wrote:
> Isn't it possible to implement strings such that all of the following are true?
> 1) All Common Lisp functions work correctly for the complete set of
> Unicode code points.
> 2) It uses UTF-16
> 3) It is efficient and constant time when strings don't contain any
> surrogate code points.
> 4) It exhibits slower, but correct behavior when strings do contain
> surrogate code points.
> Just use a bit in each string that indicates whether or not the string
> contains any surrogate code points. If the bit is clear, use the fast
> implementations and if it is set, use the slow implementations. The
> bit would be clear for the vast majority of strings, so they would be
> Have (setf schar) check if the code point being stored requires a
> surrogate code point, and if so, set the bit.
> I think that it would be possible to make the slow implementations not
> too bad by keeping offsets to surrogate code points if they are
> sparse. If they become too dense, perhaps switch to UTF-32. Another
> bit could be used to indicate the string's encoding.
> In fact, it would be possible to use this approach for UTF-8, although
> this might not be worth it.
> The down side of this approach is that all of the string operations
> would need to check the bit and branch, but this would be more
> efficient than using UTF-32 everywhere wouldn't it? Am I missing
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
More information about the Openmcl-devel