[Openmcl-devel] how many angels can dance on a unicode character?

Jeremy Jones jaj at clozure.com
Mon Apr 23 15:47:37 PDT 2007


I buy that.  GB certainly has more important things to work on.

On 4/23/07, Hamilton Link <helink at sandia.gov> wrote:
> Since GB reminded me of the argument for UTF32, let me try to save him
> some typing.
>
> I think GB's points are...
>
> - what you propose for UTF16 support is a hassle and is very disruptive
> to the code (an alternative already exists, and changing the code would
> require widespread complicated changes for many very low level heavily
> used things, etc.)
>
> - given UTF32 support, it's largely unnecessary (strings Just Work and
> can contain whatever you like, characters are always semantically what
> you think of as a character, and code for converting a string of
> base-characters to an array of unsigned 16-bit integers isn't bad when
> doing FFI stuff to libraries that expect UTF16)
>
> - if someone wants it that badly, they can try implementing it
> themselves, GB's time is better spent elsewhere
>
> Here's an analogy to proposing a move to UTF16 (hopefully people will
> realize that the analogous proposal is comparably unrealistic):
>    At the moment, openmcl uses a complete, compacting, generational
> garbage collector.  The compacting and generational parts mean that data
> moves around and some FFI things are a hassle, for example you need to
> copy data before passing it to C++ APIs.  Does it make sense to convert
> openmcl to use a conservative GC so that things will stop moving
> around?  This would make FFI use easier in many cases, but would be
> extremely disruptive to the current code base and would sacrifice
> certain advantages of the current approach.  And certainly GB has other
> priorities than ripping out and replacing a working garbage collector.
>
> In the GC world, you can trade mobility and completeness for static
> locations and interoperability, but the current state of and need for
> FFI don't justify making widespread, disruptive changes.  In the string
> world, you can trade some memory and interoperability with efficiency
> and consistency, and the benefits don't seem justify the cost of the
> change (in operational effects and in the cost of Gary's time).
>
> h
>
> Jeremy Jones wrote:
> > Isn't it possible to implement strings such that all of the following are true?
> > 1) All Common Lisp functions work correctly for the complete set of
> > Unicode code points.
> > 2) It uses UTF-16
> > 3) It is efficient and constant time when strings don't contain any
> > surrogate code points.
> > 4) It exhibits slower, but correct behavior when strings do contain
> > surrogate code points.
> >
> > Just use a bit in each string that indicates whether or not the string
> > contains any surrogate code points.  If the bit is clear, use the fast
> > implementations and if it is set, use the slow implementations.  The
> > bit would be clear for the vast majority of strings, so they would be
> > fast.
> >
> > Have (setf schar) check if the code point being stored requires a
> > surrogate code point, and if so, set the bit.
> >
> > I think that it would be possible to make the slow implementations not
> > too bad by keeping offsets to surrogate code points if they are
> > sparse.  If they become too dense, perhaps switch to UTF-32.  Another
> > bit could be used to indicate the string's encoding.
> >
> > In fact, it would be possible to use this approach for UTF-8, although
> > this might not be worth it.
> >
> > The down side of this approach is that all of the string operations
> > would need to check the bit and branch, but this would be more
> > efficient than using UTF-32 everywhere wouldn't it?  Am I missing
> > something?
> > _______________________________________________
> > Openmcl-devel mailing list
> > Openmcl-devel at clozure.com
> > http://clozure.com/mailman/listinfo/openmcl-devel
> >
> >
> >
>
>
>



More information about the Openmcl-devel mailing list