[Openmcl-devel] Plans for Unicode support within OpenMCL?

Tue Mar 21 05:02:59 PST 2006

On Tue, 21 Mar 2006, Takehiko Abe wrote:

> Gary Byers wrote:
>
>> "Plans" in the sense of "general agreement that it'd be good and
> necessary and
>> should really be done someday, and some thoughts about how to do it" : yes.
>>
>> "Plans" in the sense of "concrete effort, funded or voluntary" : sadly no.
>>
>> There are certainly some technical and design issues to be resolved;
>> there was some discussion of those issues here a couple of years ago:
>
> I used to think that characters should be implemented with unicode
> direct codepoints. I've changed my mind since and now believe
> that having base-string with latin-1 and extended-string with
> UTF-16 is good enough for dealing with unicode. We (ok i am not sure
> who we are) can add UTF-32 string later in case UTF-16 string turns
> out to be inadequate.

I'd be more willing to agree as long as we (whoever we are ...) are
talking about a large and useful subset of UTF-16.

The argument (my argument, at least) against representing strings
internally in an encoded format has to do with cases where you're
not doing I/O with them.  Suppose (since it's easier to show in
an example) that strings were encoded in UTF-8, and you had a
string containing a mixture of Hebrew "aleph" characters and Latin
"a" characters UTF-8-encoded as a series of octets in a string S:

#xd7 #x90 #x97 #xd7 #x90 #x97 #x97 #xd7 #x90 #xd7 #x90

What's (SCHAR S 3) ?  (Certainly an answerable question, but it's
                        not generally answerable in O(1) time unless
                        we maintain some auxiliary information.)

What's (LENGTH S)  ?  (Answerable in O(1) time if we're willing to
                        maintain both a logical length in elements
                        and a physical length in octets/byte; devolves
                        to O(n) otherwise.)

If we do something like (DOTIMES (I (LENGTH S))
                           (SETF (SCHAR S I) aleph))

we probably have to allocate more memory for the octets (we might
need to do this if we changed a single "a" to an aleph.)  Even if
we do something that reduces the number of octets (changing an
aleph to an "a", we either have to shuffle things around or maintain
some data structure that tells us which octets start characters, and
we're generally finding that SIMPLE-STRINGs aren't so simple anymore.

>
> If having multiple string types is not desirable, I think UTF-16
> string is a good compromise. The 4-fold increase of string size
> is too much.
>

Using a fixed (rather than variable) number of octets per character
therefore seems highly desirable.  Using one octet per character (as
OpenMCL currently does) is fine as far as it goes, but it doesn't
go very far.  Using two octets would allow most (almost all, IIRC)
characters in most (almost all, IIRC) widely-used languages to be
directly represented.  (As I understand it, some subset of the
16-bit space is used in UTF-16 encoding to represent variable-length
encodings, but the fixed-length subset of that space is still very,
very useful.)  I don't know whether the Unicode code points that
can't be encoded in 16 bits are "interesting" enough to justify
what'd probably be a 4x increase in string size, and I don't know
exactly how to evaluate that (the answer may depend on who you ask
and on what languages are important to them.)

> Unicode has combining characters and covers lots of scripts/writing
> systems. Handling them is inherently hard and having characters
> with unicode direct codepoints does not make it easier much, imo.

Note that even ASCII has "control characters" which are often only
recognized by old (sometimes obsolete) serial communications equipment;
I'm not sure that saying that a Unicode combining character is a
character with a unique CHAR-CODE is any different from saying that
DC2 (CODE-CHAR 17) is.

I agree that supporting lots of different scripts and writing systems
is hard, but I think that's orthogonal to the question of how things
are represented internally. I believe that having a single type of
CHARACTER (not having such a thing as EXTENDED-CHAR, having all
SIMPLE-STRINGs be SIMPLE-BASE-STRINGs) also has advantages (I believe
this both from an implementor's and a user's point of view), but
that's also orthogonal to all of the issues that come up when dealing
with the outside world.

>
>
> T.
>
>
>