[Openmcl-devel] Unicode in OpenMCL

Wed Jun 23 13:00:43 PDT 2004

On Wed, 23 Jun 2004, Steve Jenson wrote:

> Hi,
>
> What is the status of Unicode in OpenMCL? A google search turned up a
> message to this list from april 2003 saying that 16-bit characters
> should be considered deprecated but that no significant work to support
> unicode has happened. How can I help?
>
> Here's a presentation about how Python dealt with this issue;
>   http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf
>
> Tim Bray has also written some useful essays on this subject:
>   http://tbray.org/ongoing/When/200x/2003/04/06/Unicode
>   http://tbray.org/ongoing/When/200x/2003/04/26/UTF
>   http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings
>
> Java's String, Character, and char have a lot of great lessons to teach
> (Wow, what a nice way to put that).
>
> HTH,
> Steve
>

I think that it would be bad to have EXTENDED-CHAR (basically, bad to
have more than one type of CHARACTER); it makes more sense to me to
make all CHARACTERs (BASE-CHARs) and make CHAR-CODE-LIMIT be 2^24
or so (I think that Unicode 4 needs about 21 bits to natively encode
any character.)

A BASE-STRING would then be a vector of 24-bit immediate objects (in
practice, this would almost certainly mean "a vector of 32-bit
immediate objects with 8-11 unused bits."), and that lisp strings would
be effectively UTF-32 encoded (if I'm using the correct term.)

This does mean that all strings (including symbol pnames, etc.) would
take up about 4x as much space as they currently do; in a standard
OpenMCL image, I think that that adds up to a few 100 KB (much of
which would/could be readonly).  (I don't think that this would be
too onerous.)

Since (most of) the rest of the world doesn't use UTF-32, we'd
obviously need some way of translating these lisp strings to and from
other more useful encodings (ASCII, LATIN-1, UTF-8, UTF-16, ...).
We'd probably need to extend the :EXTERNAL-FORMAT argument (to
OPEN/LOAD/COMPILE-FILE) to allow the encoding to be specified (and
this would default to something that makes an 8-bit ASCII or ISO
representation the default (e.g., something that corresponds to the
low 8 bits of UTF-32).

We obviously need to be able to do I/O to and from other encodings
(and to be able to use various encodings with the FFI), and it's
fairly clear that we'd want to have lisp objects of type
UTF-8-ENCODED-STRING (etc.)  It's equally clear (to me, at least) that
we don't want things like UTF-8-ENCODED string to be of type STRING
(e.g., we may want to support things like ENCODED-SCHAR or
ENCODED-STRING-LENGTH, but don't want things like SCHAR and LENGTH to
have to worry about them.  I'm not sure that an ENCODED-STRING should
be mutable; at the very least, it should be recognized that if (SETF
(ENCODED-SCHAR s i) c) exists, it might not be a simple operation.)

This is (I think ...) clean and simple and would offer many benefits.
It's also (to use the technical term) a bitch to bootstrap: there are
lots of places in OpenMCL that assume that CHARACTERs are 8 bits wide,
and these places include things like INTERN, the fasloader, the
compiler, the kernel, ...  My hunch is that it'd still be easier to do
this cold-turkey (and not try to support 8-bit and 32-bit lisp strings
at the same time.)