[Openmcl-devel] Unicode in OpenMCL

Wed Jun 23 21:21:52 PDT 2004

At 2:00 PM -0600 6/23/04, Gary Byers wrote:
>On Wed, 23 Jun 2004, Steve Jenson wrote:
>
> > Hi,
> > What is the status of Unicode in OpenMCL?
>
>I think that it would be bad to have EXTENDED-CHAR (basically, bad to
>have more than one type of CHARACTER); it makes more sense to me to
>make all CHARACTERs (BASE-CHARs) and make CHAR-CODE-LIMIT be 2^24
>or so (I think that Unicode 4 needs about 21 bits to natively encode
>any character.)

I've spent some time thinking about this recently in the more general sense for a project I want to do.

24 bits is the right size if you have only one type of char.

I suspect your intuition may be right, but can you offer any insight as to why having an EXTENNDED-CHAR type (or 2) is bad? Numbers certainly have more than one type. If you support more than one type of char you almost really need to support three types: 8 bit (ascii, extended ascii, various latin, utf-8, and others), 16 bit (unicode), and 21 (24) bit (extended unicode and maybe others as well). There will be files with all three of these formats for the foreseeable future.

Of course one of the nice things about having three types is that bootstrapping could be easy and gradual.

>
>A BASE-STRING would then be a vector of 24-bit immediate objects (in
>practice, this would almost certainly mean "a vector of 32-bit
>immediate objects with 8-11 unused bits."), and that lisp strings would
>be effectively UTF-32 encoded (if I'm using the correct term.)

Yes, but see:
http://www.unicode.org/unicode/reports/tr19/tr19-9.html

>We obviously need to be able to do I/O to and from other encodings
>(and to be able to use various encodings with the FFI), and it's
>fairly clear that we'd want to have lisp objects of type
>UTF-8-ENCODED-STRING (etc.)  It's equally clear (to me, at least) that
>we don't want things like UTF-8-ENCODED string to be of type STRING
>(e.g., we may want to support things like ENCODED-SCHAR or
>ENCODED-STRING-LENGTH, but don't want things like SCHAR and LENGTH to
>have to worry about them.  I'm not sure that an ENCODED-STRING should
>be mutable; at the very least, it should be recognized that if (SETF
>(ENCODED-SCHAR s i) c) exists, it might not be a simple operation.)

To me a string should have at least 4 attributes:

1. The base size of a glyph. I think this could be 8, 16, and 24 bits.
2. The encoding. This could be ASCII, LATIN-1, UNICODE, UTF-8, UTF-32, ...
3. A script. This is the key for how to enter, properly display, and count the glyphs.
4. Length

Most people forget the script because it often isn't need for European languages, but it is critical for many others which use the same characters but interpret them differently. Without a script code it will be impossible to display characters correctly. ISO 15924 defines scripts as 4 (UTF-8) character codes, but I get the feeling it's missing some important scripts.

Best,

leb