[Openmcl-devel] Plans for Unicode support within OpenMCL?

Wed Mar 22 00:24:02 PST 2006

On Wed, 22 Mar 2006, Takehiko Abe wrote:

> Pat Lasswell wrote:
>
>>> If having multiple string types is not desirable, I think UTF-16
>>> string is a good compromise. The 4-fold increase of string size
>>> is too much.
>>>
>>
>> A 4-fold increase is too much?   While I don't advocate that the
>> hardware requirements of OpenMCL keep up with Moore's Law, it seems
>> that bits and flops are so cheap that a little extra string storage
>> isn't going to be noticed in amonst the mp3s and jpegs.
>
> I don't think "a little extra string storage" is an accurate
> description. It's 4x increase of what you have.

FWIW, there seem to be around 500K bytes in around 33K strings
in a roughly-current OpenMCL image.  Increasing that by another
~500K (16-bit characters) or ~1500K (32-bit characters) wouldn't
have too negative an impact.

The example that Christophe Rhodes mentioned - people doing
bionformatics - at least points points out that there's existing code
that works under the implicit asssumption that characters are 8-bits
wide and that would fail or suffer severe problems if that assumption
was violated.  (That particular example might not be that compelling,
since it sounds like they're treating the "characters" as
(UNSIGNED-BYTE 8)'s in the first place, but it's certainly believable
that there is code out there that really wants/needs to retain tens of
millions of strings (or more) and hundreds of millions of characters
(or more), and those programs would indeed by significantly and
negatively affected if their assumptions were violated.)

The assumption that "characters are 8 bits wide, at least by default
and most of the time" is pretty pervasive.  (It's also likely to be
something of a QWERTY phenomenon.)  There are similar,
almost-as-pervasive assumptions about the size of a machine address
(32-bit machines have been prevalent for decades), and moving to a
64-bit platform violates those assumptions (in general, the same
program will need 2X as much memory for the same pointer-based data,
cache lines will fill up 2X faster, etc.)  The point that I'm trying
to make with this admittedly somewhat strained analogy is that it's no
more valid to assume things about the width/cost of characters than it
is to make similar assumptions about the width/cost of addresses.

Just to strain things a bit further: it's hard to avoid the
observation that a lot of 64-bit addresses are pointing to (or could
be made to point to) the low 32-bits of memory; the upper 32-bits of
such pointers are all 0.  I remember a semi-serious discusion a few
years ago where someone (more-or-less thinking out loud) wondered
whether it was possible to exploit this in some way - to have some
data structures whose element type was something imaginary like (T 32)
- and avoid wasting all of the memory otherwise needed to store those
0 bits.  (Whovever proposed this eventually sobered up, and would
probably claim to have no memory of the unfortunate episode.)

I don't believe that they went so far as to suggest that the imaginary
that I fuzzily referred as (T 32) - "general Lisp objects that happen
to fit in 32 bits" - as BASE-T.  Had they had the foresight to do so,
the point of this analogy might have been clearer earlier ...

There were a number of reasons why this BASE-T scheme wasn't viable
(including the pervasive assumption that lisp objects - pointers or
immmediate things like FIXNUMs - are at least the same size.)  The
analogy between this imaginary partitioning of T into BASE-T /
EXTENDED-T and a similar partitioning of CHARACTER into BASE-CHAR /
EXTENDED-CHAR isn't very good for many reasons (software gets to
decide what is and isn't a character, hardware has more say in what's
an address) and there are existence proofs that the CHARACTER
partitioning can be made at least somewhat usable in some cases.  It
seems to me that it does share some things in common (both at the
implementation level and in terms of whatever cognigitive burden it
puts on the user) and certainly shares similar well-intentioned
motivations.

> If input is 1000k, the buffer will be 4000k. and so on.
>
> I believe that having multiple string types is better.
> (If Moore's Law also applies to processor speed, then we can
> perhaps ignore a little bit of extra cpu cycles?)
>
> MCL has extended-string which is 16-bit and I like it. I like
> it perhaps because I have never cared how it is implemented.
> And crucially, Gary doesn't like it (or doesn't seem
> enthusiastic about it.)
>

In MCL's case, there was a lot of DWIMming going on: the :ELEMENT-TYPE
of a stream opened with OPEN is supposed to default to CHARACTER, but
(IIRC) MCL defaulted this argument to BASE-CHAR, under the assumption
that that's probably what the user intended (the :EXTERNAL-FORMAT
argument either didn't exist or was already used for some other
purpose), and I think that there were several other instances of
this (some of which I may have been responsible/to blame for,)
In other implementations, I've seen things like:

(concatenate 'string a-base-string a-non-base-string)

fail.  Of course that's just a bug, but it's a bug that likely
wouldn't exist if there was only one internal character representation.

It's certainly possible to try to be less DWIMmy/sloppy about things
and to try to implement things correctly, and it's possible that the
runtime cost of multiple character representations could be kept
acceptably small.  I suspect (and Christophe's anecdote seems to
confirm this suspicion) that many users (and lots of existing code)
are pretty casual about the whole issue, and that unless users/code
are very rigorous about saying :ELEMENT-TYPE :BASE-CHAR when it's
desirable to do so they'd wind up getting EXTENDED-CHARs and might
only notice/care if they were getting a lot of them.  People/programs
that have reason to care about this would have a means of dealing
with their assumptions; I'm not entirely convinced that that's a
Good Thing, and I'm largely convinced that it's not a Good Enough
Thing to justify the partitioning and all of the conceptual and
runtime overhead and baggage that goes with it.

I sort of look at this as a case where the only way to wean oneself
of the assumption that characters are usually 8 bits wide in memory
is to do so cold turkey; that's sometimes painful, but often healthier
in the long term.

I do think that the point that 16 bits is probably enough is a good
one; the subset of UTF-16 that doesn't include surrogate pairs -
which I guess is often referred to as UCS-2 - is probably a more
reasonable step (if a 2X increase in string size is "quitting cold
turkey", a 4X increase is probably "quitting cold turkey and then
getting religion", which does seem a little drastic ...)