[Openmcl-devel] Unicode in OpenMCL

Wed Jun 23 23:50:42 PDT 2004

On Wed, 23 Jun 2004, Lawrence E. Bakst wrote:

> At 2:00 PM -0600 6/23/04, Gary Byers wrote:
> >On Wed, 23 Jun 2004, Steve Jenson wrote:
> >
> > > Hi,
> > > What is the status of Unicode in OpenMCL?
> >
> >I think that it would be bad to have EXTENDED-CHAR (basically, bad to
> >have more than one type of CHARACTER); it makes more sense to me to
> >make all CHARACTERs (BASE-CHARs) and make CHAR-CODE-LIMIT be 2^24
> >or so (I think that Unicode 4 needs about 21 bits to natively encode
> >any character.)
>
> I've spent some time thinking about this recently in the more general sense for a project I want to do.
>
> 24 bits is the right size if you have only one type of char.
>

> I suspect your intuition may be right, but can you offer any insight
> as to why having an EXTENNDED-CHAR type (or 2) is bad?  Numbers
> certainly have more than one type.

The strongest reason that I can think of for partitioning CHARACTER
into EXTENDED-CHAR and BASE-CHAR is to enable STRING be something other
than BASE-STRING, presumably because the latter can be represented
more compactly (e.g., as a vector of 8-bit values).

The strongest reason that I can think of against this partitioning is
that STRING becomes something other than BASE-STRING: primitive operations
for accessing/updating/creating/copying strings have to come in two
flavors.  (Consider something like STRING=, where either string could
be either a BASE-STRING or a (general) STRING; you either wind up
writing four "fast inner loops" - or hoping that the compiler will do
this for you - or doing a couple of extra comparisons on every iteration
of the not-so-fast-anymore inner loop.

It's -possible- that if the user was very careful about type
declarations and very rigorous about saying things like (MAKE-STRING n
:ELEMENT-TYPE 'BASE-CHAR) and things like that that most of this
overhead could be avoided most of the time.  I'm skeptical that that
ideal situation would be achieved in practice; I think that in practice,
many strings that in fact could be BASE-STRINGs would not be, and the
runtime-dispatching overhead that -could- be optimized out of many
string operations would not be, and the reason for introducing this
potential (cognitive and runtime) overhead doesn't seem very compelling
anymore.

> If you support more than one type of char you almost really need to
> support three types: 8 bit (ascii, extended ascii, various latin,
> utf-8, and others), 16 bit (unicode), and 21 (24) bit (extended
> unicode and maybe others as well). There will be files with all
> three of these formats for the foreseeable future.

Sure: we need to be able to read characters from/write characters to
files in as many different formats/encoding systems as we can support.
In all of these cases, we're translating between an external format
(ASCII, UTF-16, ...) and an internal representation (a lisp object
that represents what I believe Unicode refers to as a "code point".)
In current OpenMCL, this "translation" is trivial (and there are all
sorts of ways in which this is exploited internally); in order to even
support translation between (even) an ASCII external format and a
24-bit internal representation, the file system code in OpenMCL would
have to stop playing some tricks that it currently plays.  Some kinds
of translation add additional overhead (UNREAD-CHAR needs to know how
many bytes were in the most recently read character, FILE-POSITION and
FILE-LENGTH may be a bit more complicated for similar reasons), but in
all cases it's certainly possible to have one internal representation.

> Of course one of the nice things about having three types is that bootstrapping could be easy and gradual.
>
> >
> >A BASE-STRING would then be a vector of 24-bit immediate objects (in
> >practice, this would almost certainly mean "a vector of 32-bit
> >immediate objects with 8-11 unused bits."), and that lisp strings would
> >be effectively UTF-32 encoded (if I'm using the correct term.)
>
> Yes, but see:
> http://www.unicode.org/unicode/reports/tr19/tr19-9.html
>
>
> >We obviously need to be able to do I/O to and from other encodings
> >(and to be able to use various encodings with the FFI), and it's
> >fairly clear that we'd want to have lisp objects of type
> >UTF-8-ENCODED-STRING (etc.)  It's equally clear (to me, at least) that
> >we don't want things like UTF-8-ENCODED string to be of type STRING
> >(e.g., we may want to support things like ENCODED-SCHAR or
> >ENCODED-STRING-LENGTH, but don't want things like SCHAR and LENGTH to
> >have to worry about them.  I'm not sure that an ENCODED-STRING should
> >be mutable; at the very least, it should be recognized that if (SETF
> >(ENCODED-SCHAR s i) c) exists, it might not be a simple operation.)
>
> To me a string should have at least 4 attributes:
>
> 1. The base size of a glyph. I think this could be 8, 16, and 24 bits.
> 2. The encoding. This could be ASCII, LATIN-1, UNICODE, UTF-8, UTF-32, ...
> 3. A script. This is the key for how to enter, properly display, and count the glyphs.
> 4. Length
>

If objects like this exist internally in the lisp, I think that it's
important that these objects -not- be of type CL:STRING.

> Most people forget the script because it often isn't need for
> European languages, but it is critical for many others which use the
> same characters but interpret them differently. Without a script
> code it will be impossible to display characters correctly. ISO
> 15924 defines scripts as 4 (UTF-8) character codes, but I get the
> feeling it's missing some important scripts.

>
> Best,
>
> leb
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>