[Openmcl-devel] Plans for Unicode support within OpenMCL?

Tue Mar 21 18:41:35 PST 2006

Gary Byers wrote:

> > What I wanted to say is that we can pretend that UTF-16's code values
> > are real code points and treat surrogate values as legit char codes.
> > Then, UTF-16 would not be an encoded format any more.
> 
> Yes; I agree that this is reasonable.  I think that UTF-16 without
> surrogate pairs is referred to as the "Basic Multilingual Plane",
> and covers a very high percentage of the characters/languages that
> people would be likely to want to use.

And 16-bit covers lots of non-unicode characters too.

> 
> >
> > Make handling surrogate pairs properly a user's task.
> >
> 
> I think that it was true that in some earlier versions of the Unicode
> standard, all defined characters could be encoded in 16 bits, and that
> a lot of people/programs/libraries still work in this subset and never
> deal with surrogate pairs/variable-length encodings.  (It's not clear
> to me whether MacOS or Windows support support code points that can't
> be encoded in 16 bits, or if there are plans to change to change that;
> people seem to use the term UTF-16 informally in at least some cases.)

Mac supports full unicode. ATSUI uses UTF-16 and renders surrogate
pairs fine. And there are routines that count 'characeters', find
various text boundaries, and etc in a unicode string. I think these
routines consider a base char + combining sequence as a single
character (or it might give finer controls. I am not sure. I don't
use them). These routines are in higher level than lisp's LENGTH
SUBSEQ CHAR etc.

(Btw, Since ATSUI uses UTF-16, I can do ccl::%copy-ivector-to-ptr
and ccl::%copy-ptr-to-ivector to and from MCL's extended-string.)

Lisp has characters. But does Unicode have them too? I think
it's questionable. The standard often refers to 'what a user
thinks as character' but is (seems) careful not to say what
a standard thinks as character.

I needed to find a 'character' boundary in a unicode string.
(The next task is to find word boundary but it is currently
postponed indefinitely.) Dealing with surrogate pairs was the
easy part.

There is a document for it: 

  "UAX29 Text Boundaries"
  <http://www.unicode.org/reports/tr29/>.

After going through the doc, subseq-ing a surrogate pair
into two does not sound so horrible anymore.

I think we can consider subseq and others as (sort of) low
level functions. It may or may not do the right thing depending
on one's needs, UTF-16 or not.