[Openmcl-devel] Plans for Unicode support within OpenMCL?

Tom Emerson tree at dreamersrealm.net
Wed Mar 22 13:12:15 UTC 2006


Replying to this whole thread:

It is important not to get mixed up between code point values and the  
various encoding forms that are available. When viewed this way,  
surrogate pairs become an encoding issue. The view of characters that  
the programmer seems is of whole codepoints: they should *never* see  
a character value in the range #xD800 -- #xDFFF because these cannot  
appear in a valid Unicode stream. Unicode's character model is well  
documented: there should be no confusion there.

With regards to combining characters: these are a necessary part of  
the standard, and you have to deal with them. UAX #29 provides  
algorithms for doing appropriate text boundary detection, including  
'glyph' boundaries. However, most people don't need to deal with  
that. I've worked in Unicode for almost 8 years and combining  
characters rarely show up in practice in my experience.

For internal representation: multiple internal representations make a  
lot of sense. You can use ASCII for characters under #x80. Characters  
above #x80 can use UTF-16. Astral plane characters can either force a  
change to UTF-32, or you can set a flag indicating that the 16-bit  
representation contains surrogates. If the flag is not set, then you  
can directly index into offsets in the 16-bit string, since you don't  
worry about multi-16-bit characters. Otherwise you linearly search  
for character positions. You can extend this by using the offset of  
the first astral plane character instead of a simple flag.  
Regardless, implementation tricks can allow efficient implementation  
of Unicode strings in a language in a way that makes the actual  
representation transparent to the programmer.

     -tree

---
Tom Emerson
tree at dreamersrealm.net
http://www.dreamersrealm.net/~tree




On Mar 22, 2006, at 3:24 AM, Gary Byers wrote:

>
>
> On Wed, 22 Mar 2006, Takehiko Abe wrote:
>
>> Pat Lasswell wrote:
>>
>>>> If having multiple string types is not desirable, I think UTF-16
>>>> string is a good compromise. The 4-fold increase of string size
>>>> is too much.
>>>>
>>>
>>> A 4-fold increase is too much?   While I don't advocate that the
>>> hardware requirements of OpenMCL keep up with Moore's Law, it seems
>>> that bits and flops are so cheap that a little extra string storage
>>> isn't going to be noticed in amonst the mp3s and jpegs.
>>
>> I don't think "a little extra string storage" is an accurate
>> description. It's 4x increase of what you have.
>
> FWIW, there seem to be around 500K bytes in around 33K strings
> in a roughly-current OpenMCL image.  Increasing that by another
> ~500K (16-bit characters) or ~1500K (32-bit characters) wouldn't
> have too negative an impact.
>
> The example that Christophe Rhodes mentioned - people doing
> bionformatics - at least points points out that there's existing code
> that works under the implicit asssumption that characters are 8-bits
> wide and that would fail or suffer severe problems if that assumption
> was violated.  (That particular example might not be that compelling,
> since it sounds like they're treating the "characters" as
> (UNSIGNED-BYTE 8)'s in the first place, but it's certainly believable
> that there is code out there that really wants/needs to retain tens of
> millions of strings (or more) and hundreds of millions of characters
> (or more), and those programs would indeed by significantly and
> negatively affected if their assumptions were violated.)
>
> The assumption that "characters are 8 bits wide, at least by default
> and most of the time" is pretty pervasive.  (It's also likely to be
> something of a QWERTY phenomenon.)  There are similar,
> almost-as-pervasive assumptions about the size of a machine address
> (32-bit machines have been prevalent for decades), and moving to a
> 64-bit platform violates those assumptions (in general, the same
> program will need 2X as much memory for the same pointer-based data,
> cache lines will fill up 2X faster, etc.)  The point that I'm trying
> to make with this admittedly somewhat strained analogy is that it's no
> more valid to assume things about the width/cost of characters than it
> is to make similar assumptions about the width/cost of addresses.
>
> Just to strain things a bit further: it's hard to avoid the
> observation that a lot of 64-bit addresses are pointing to (or could
> be made to point to) the low 32-bits of memory; the upper 32-bits of
> such pointers are all 0.  I remember a semi-serious discusion a few
> years ago where someone (more-or-less thinking out loud) wondered
> whether it was possible to exploit this in some way - to have some
> data structures whose element type was something imaginary like (T 32)
> - and avoid wasting all of the memory otherwise needed to store those
> 0 bits.  (Whovever proposed this eventually sobered up, and would
> probably claim to have no memory of the unfortunate episode.)
>
> I don't believe that they went so far as to suggest that the imaginary
> that I fuzzily referred as (T 32) - "general Lisp objects that happen
> to fit in 32 bits" - as BASE-T.  Had they had the foresight to do so,
> the point of this analogy might have been clearer earlier ...
>
> There were a number of reasons why this BASE-T scheme wasn't viable
> (including the pervasive assumption that lisp objects - pointers or
> immmediate things like FIXNUMs - are at least the same size.)  The
> analogy between this imaginary partitioning of T into BASE-T /
> EXTENDED-T and a similar partitioning of CHARACTER into BASE-CHAR /
> EXTENDED-CHAR isn't very good for many reasons (software gets to
> decide what is and isn't a character, hardware has more say in what's
> an address) and there are existence proofs that the CHARACTER
> partitioning can be made at least somewhat usable in some cases.  It
> seems to me that it does share some things in common (both at the
> implementation level and in terms of whatever cognigitive burden it
> puts on the user) and certainly shares similar well-intentioned
> motivations.
>
>
>> If input is 1000k, the buffer will be 4000k. and so on.
>>
>> I believe that having multiple string types is better.
>> (If Moore's Law also applies to processor speed, then we can
>> perhaps ignore a little bit of extra cpu cycles?)
>>
>> MCL has extended-string which is 16-bit and I like it. I like
>> it perhaps because I have never cared how it is implemented.
>> And crucially, Gary doesn't like it (or doesn't seem
>> enthusiastic about it.)
>>
>
> In MCL's case, there was a lot of DWIMming going on: the :ELEMENT-TYPE
> of a stream opened with OPEN is supposed to default to CHARACTER, but
> (IIRC) MCL defaulted this argument to BASE-CHAR, under the assumption
> that that's probably what the user intended (the :EXTERNAL-FORMAT
> argument either didn't exist or was already used for some other
> purpose), and I think that there were several other instances of
> this (some of which I may have been responsible/to blame for,)
> In other implementations, I've seen things like:
>
> (concatenate 'string a-base-string a-non-base-string)
>
> fail.  Of course that's just a bug, but it's a bug that likely
> wouldn't exist if there was only one internal character  
> representation.
>
> It's certainly possible to try to be less DWIMmy/sloppy about things
> and to try to implement things correctly, and it's possible that the
> runtime cost of multiple character representations could be kept
> acceptably small.  I suspect (and Christophe's anecdote seems to
> confirm this suspicion) that many users (and lots of existing code)
> are pretty casual about the whole issue, and that unless users/code
> are very rigorous about saying :ELEMENT-TYPE :BASE-CHAR when it's
> desirable to do so they'd wind up getting EXTENDED-CHARs and might
> only notice/care if they were getting a lot of them.  People/programs
> that have reason to care about this would have a means of dealing
> with their assumptions; I'm not entirely convinced that that's a
> Good Thing, and I'm largely convinced that it's not a Good Enough
> Thing to justify the partitioning and all of the conceptual and
> runtime overhead and baggage that goes with it.
>
> I sort of look at this as a case where the only way to wean oneself
> of the assumption that characters are usually 8 bits wide in memory
> is to do so cold turkey; that's sometimes painful, but often healthier
> in the long term.
>
> I do think that the point that 16 bits is probably enough is a good
> one; the subset of UTF-16 that doesn't include surrogate pairs -
> which I guess is often referred to as UCS-2 - is probably a more
> reasonable step (if a 2X increase in string size is "quitting cold
> turkey", a 4X increase is probably "quitting cold turkey and then
> getting religion", which does seem a little drastic ...)
>
>
>
>
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel



More information about the Openmcl-devel mailing list