[Openmcl-devel] Plans for Unicode support within OpenMCL?

Tue Mar 21 06:42:16 PST 2006

Gary Byers <gb at clozure.com> writes:

> I agree that supporting lots of different scripts and writing systems
> is hard, but I think that's orthogonal to the question of how things
> are represented internally. I believe that having a single type of
> CHARACTER (not having such a thing as EXTENDED-CHAR, having all
> SIMPLE-STRINGs be SIMPLE-BASE-STRINGs) also has advantages (I believe
> this both from an implementor's and a user's point of view), but
> that's also orthogonal to all of the issues that come up when dealing
> with the outside world.

For what it's worth, my experience with this in SBCL is that while the
implementation side of having more than one kind of string is not that
great, the user perspective of having all sorts of strings is
basically confusion.  The problem appears to lie in the upgrading
rules not being terribly well internalised.

(Of course, given SBCL's somewhat overpedantic adherence to standards,
there are of necessity more than one kind of string even in a
single-octet character world: (array nil (*)) is subtypep string...)

One argument against only two-octet (or four-octet) strings is the
argument from bioinformatics: for some reason, bioinformaticians love
to represent two bit quantities in eight bits: specifically, sequences
of ACGT as strings.  Perhaps this is because much of bioinformatics is
based around grep(1), but in any case there was a recent complaint
that a full (simple-array character (*)) for one user's dataset
overflowed the heap; being able to recommend the use of
simple-base-string instead was a help, rather than a hindrance.

I'd be happy to discuss other implementation details if there are any
questions.

Cheers,

Christophe