[Openmcl-devel] The case for UTF-16

Sat Apr 21 04:35:57 PDT 2007

Gary Byers wrote:

> The obvious disadvantage is that there are characters whose codes
> can't be represented in 16 bits.  You can pretend that the elements
> of a surrogate pair are "characters", but there's no way to refer to
> the character whose code is encoded by a UTF-16 surrogate pair.

We can refer to it by a string (of 2 surrogate chars) or by a
codepoint.

> [...]
> UTF-16 is (technically) a variable-width encoding.

I'd say UTF-32 is a variable-width encoding too.

> [...]
> 
> If it's agreed (I certainly hope it is ...) that it's not really
> practical to use a variable-width encoding internally (anyone who
> believes otherwise is welcome to write a non-consing, unit-cost
> version of (SETF SCHAR)), then the choices are seemingly between:

By treating a surrogate codepoint as a character and setting
CHAR-CODE-LIMIT at 16-bits, (SETF SCHAR) will be non-consing
and unit-cost.

>   - UCS-2, with or without complaints about the use of surrogate
>     pairs, offering potential compliance with a 7-or-8-year-old
>     version of the standard at 16 bits per character

UCS-2 is superseded by UTF-16 and is obsolete. 

Regarding the compliance, I don't think 16-bits CHAR-CODE-LIMIT would
mean that you have to say OpenMCL only supports the BMP and not full
Unicode. Java uses UTF-16 and it doesn't say so, nor does Lispworks or
ACL.

I believe it is fine to punt the responsiblity (to handle surrogate
pairs properly) to us users. I suspect that the majority of users only
care about latin-1 range characters anyway let alone chars outside of
the BMP. If an application deals with unicode text input, it may
occasinally receive rare chars -- but then most likely you do not know
what those chars really are, so you treat them as opaque data. And if
you do know what these chars are, you ought to know how to handle
them.

> 
>    - UTF-32/UCS-4, offering potential compliance with current and
>      future versions of the standard at 32 bits per character.
>
> I don't mean to sound like I'm totally unconcerned about the memory
> issue, but I thought and still think that the second choice was
> preferable.  (Maybe more accurately, I didn't think that I'd likely
> be comfortable with the first choice 5 years down the road.)

I don't think it is likely that the unicode standard abondon UTF-16 in
the future. The BMP already covers characters for major scripts. And
the supplementary characters are... well, supplemntary and I believe
it will remain to be so. (the growth rate will slow down.)

> [...]
> 
> It is true that code that displays characters/strings has to be
> aware of these issues (as does some other code.)  It is less
> clear that SCHAR and EQUAL and INTERN and STRING= and ... need
> to be aware of these issues.

I believe the answer is no (though I am not sure why you included
INTERN). And if you agree, then why having surrogate codepoint as
characters is so bad? IMO it's any more different. (it's actually much
simpler than handling combining sequences.)

An obvious problem of surrogate pairs is SUBSEQ. But I believe you do
not split string at an arbitrary position without checking what's in
there first.

Another is LENGTH. (LENGTH string-of-a-surrogate-pair) will report 2
instead of 1. But this is fine. Someone may complain that it says 2
while there's only a single character displayed -- but comibining
characters presents exactly the same issue.

Furthermore, I think many programs will treat unicode strings as
opaque data and passing them left to right without processing it. Even
when a program parses the strings, the significant characters are
likely to be in ascii range (e.g. HTML).  And other programs that do
process unicode strings heavily need to handle combining sequences and
others, which means that [1] CL standard functions are inadequate for
them and [2] handling of surrogate pairs is trivial (relatively
speaking).

> [...]
> What OpenMCL's allowed you to do instead is to pass a "foreign copy"
> of a string to foreign code (and copy foreign memory to lisp strings.)

I was thinking of %copy-ivector-to-ptr and %copy-ptr-to-ivector. I
used them in MCL to call ATSUI API. Since MCL has 16-bits/char
extended-string I can blindly blit. Incidentally, it was when I
changed my mind and concluded that UTF-16 is better than UTF-32.
Having the same encoding format strings in fred buffer as the one OS
wants is a clear win.

Admittedly this is a special case. So, I guess having a good set of
macros may be suffice.

> [...]
> > 4. Both Lispworks and ACL use UTF-16.
> 
> I think that in practice that means that they use UCS-2 (possibly
> agreeing to look the other way if surrogate pairs are allowed
> in strings.)

It would be really odd if they don't allow surrogate values as
lisp characters.

> 
> It's not a totally unreasonable decision, but I don't think that
> (looking forward) it's the correct decision.

I don't expect that any significant characters that demand
efficient processing will be added to the supplementary range
in future. Cf. <http://www.unicode.org/roadmaps/>

> [...]
>
> The fact that it's still awkward to pass non-C-strings between lisp
> and foreign code is another kind of incompleteness.  If that's
> addressed, let's see what the space vs full Unicode character
> support issue looks like.

I think "let's see" is a fine idea. But pitting space against
full support is not right. The full unicode character support
is possible with UTF-16.

ICU, Mac OS X, Java, and etc use UTF-16 and they are fully compliant.
What makes OpenMCL different?

regards,
T.

--
"Is the information correct?"