[Openmcl-devel] utf-8 support in openmcl?

Wed Apr 16 11:50:36 PDT 2003

On Wed, 16 Apr 2003, Tim Moore wrote:

> I took a really quick look at this a few weeks ago.  It looks like there's
> some (possibly vestigial) support for 16 bit general characters and
> strings. I imagine this was to support the CLtL1 notion of character
> attributes.
>

It's left over from MCL's support for various 16-bit encodings
(SHIFT-JIS ?).  What is now OpenMCL was orignally intended to be
used in systems (spacecraft) that didn't do much character I/O (and
didn't need to do what little they did in non-ROMAN languages.

To be honest, I was glad to find an excuse to get rid of EXTENDED-CHAR
and EXTENDED-STRING: rightly or wrongly, people have a tendency to
say things like:

 (coerce foo 'simple-string)

when they (probably) often want a SIMPLE-BASE-STRING, and people say
things like:

 (open "file" :direction :output :element-type 'character ...)

and then are surprised to see a lot of ASCII NULs in the (though one
could argue that this is another job for :EXTERNAL-FORMAT.)

It's also unfortunate if things like CHAR and SCHAR have to deal
with two different types of string, and the degenerate case (where
they dispatch at runtime in the middle of a loop) isn't as uncommon as
one might hope.

> Would that be a viable path to unicode support? I believe that UTF-16
> covers most of the Unicode scripts that people are likely to use in
> practice.

My understanding is that UTF-16 is itself an encoding of 21-bit characters,
though it's often successful in encoding them in a single (UNSIGNED-BYTE 16).
My impression is that in some earlier versions of the Unicode standard
"all" characters were directly representable in 16 bits, but that that's
no longer true: some pairs of 16-bit values are used to denote larger
characters.

I think that it's undesirable to use any type of encoding or escaping
to represent a CL string: it's desirable that things like LENGTH and
CHAR/SCHAR and AREF and ELT and ... be unit-cost operation.

I suppose that we might be able to do something like:

 - widen the character type to 16 bits
 - say that it is an error to use any 16-bit character code that's
   used to encode either half of a UTF-16 "surrogate pair" in a CL
   string, and try to enforce this restriction where possible.

This would make CL strings a useful subset of UTF-16 and would allow
primitive string operations to remain primitive.

I -think- that this is not too far from what the status quo has been
in the C world: in Apple's header files, the "unichar" type is currently
defined as "unsigned short" (16-bit), though I also believe that they've
announced plans to change this.

The other attractive alternative that I can see is to bite the bullet
and make CHARACTERs wide enough to support full Unicode; UTF-16 and
UTF-8 would then just be interesting external representations for
strings.

I don't have enough experience dealing with this sort of issue to
know how reasonable this opinion is, but something makes me think that
a decision to limit characters to 16 bits would appear shortsighted
in the near future, especially if the rationale for that decision was
to save a few hundred K bytes ...

>
> Tim
>

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel at clozure.com
http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel