[Openmcl-devel] utf-8 support in openmcl?
Gary Byers
gb at clozure.com
Wed Apr 16 11:50:36 PDT 2003
On Wed, 16 Apr 2003, Tim Moore wrote:
> I took a really quick look at this a few weeks ago. It looks like there's
> some (possibly vestigial) support for 16 bit general characters and
> strings. I imagine this was to support the CLtL1 notion of character
> attributes.
>
It's left over from MCL's support for various 16-bit encodings
(SHIFT-JIS ?). What is now OpenMCL was orignally intended to be
used in systems (spacecraft) that didn't do much character I/O (and
didn't need to do what little they did in non-ROMAN languages.
To be honest, I was glad to find an excuse to get rid of EXTENDED-CHAR
and EXTENDED-STRING: rightly or wrongly, people have a tendency to
say things like:
(coerce foo 'simple-string)
when they (probably) often want a SIMPLE-BASE-STRING, and people say
things like:
(open "file" :direction :output :element-type 'character ...)
and then are surprised to see a lot of ASCII NULs in the (though one
could argue that this is another job for :EXTERNAL-FORMAT.)
It's also unfortunate if things like CHAR and SCHAR have to deal
with two different types of string, and the degenerate case (where
they dispatch at runtime in the middle of a loop) isn't as uncommon as
one might hope.
> Would that be a viable path to unicode support? I believe that UTF-16
> covers most of the Unicode scripts that people are likely to use in
> practice.
My understanding is that UTF-16 is itself an encoding of 21-bit characters,
though it's often successful in encoding them in a single (UNSIGNED-BYTE 16).
My impression is that in some earlier versions of the Unicode standard
"all" characters were directly representable in 16 bits, but that that's
no longer true: some pairs of 16-bit values are used to denote larger
characters.
I think that it's undesirable to use any type of encoding or escaping
to represent a CL string: it's desirable that things like LENGTH and
CHAR/SCHAR and AREF and ELT and ... be unit-cost operation.
I suppose that we might be able to do something like:
- widen the character type to 16 bits
- say that it is an error to use any 16-bit character code that's
used to encode either half of a UTF-16 "surrogate pair" in a CL
string, and try to enforce this restriction where possible.
This would make CL strings a useful subset of UTF-16 and would allow
primitive string operations to remain primitive.
I -think- that this is not too far from what the status quo has been
in the C world: in Apple's header files, the "unichar" type is currently
defined as "unsigned short" (16-bit), though I also believe that they've
announced plans to change this.
The other attractive alternative that I can see is to bite the bullet
and make CHARACTERs wide enough to support full Unicode; UTF-16 and
UTF-8 would then just be interesting external representations for
strings.
I don't have enough experience dealing with this sort of issue to
know how reasonable this opinion is, but something makes me think that
a decision to limit characters to 16 bits would appear shortsighted
in the near future, especially if the rationale for that decision was
to save a few hundred K bytes ...
>
> Tim
>
_______________________________________________
Openmcl-devel mailing list
Openmcl-devel at clozure.com
http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel
More information about the Openmcl-devel
mailing list