[Openmcl-devel] Plans for Unicode support within OpenMCL?

Wed Mar 22 07:17:32 PST 2006

On Mar 22, 2006, at 10:03 AM, Takehiko Abe wrote:

> Tom Emerson wrote:
>
>> It is important not to get mixed up between code point values and the
>> various encoding forms that are available. When viewed this way,
>> surrogate pairs become an encoding issue. The view of characters that
>> the programmer seems is of whole codepoints: they should *never* see
>> a character value in the range #xD800 -- #xDFFF
>
> But you cannot enforce it. It is possible that you see them.

You cannot enforce it, but the presence of a surrogate value in a  
UTF-32 stream is an error: there are no defined semantics for that  
character. So while you can see them, they have no valid character  
value.

> If they are rare, I think it is reasonable to assume that it is up
> to each user how to deal with them. I think the same is true for
> surrogate pairs in UTF-16. They are rare too.

You cannot put combining characters and surrogate pairs in the same  
bin. A combining character *is* a valid character: the fact that it  
combines with the following character(s) is a display issue.  
Surrogates are an encoding artifact. Expecting the programmer to deal  
with them is wrong.

> If we will have multiple string types, let's use Latin-1 for under
> #xFF.

I propose multiple string representations, internally, for Unicode  
characters. Values under #xFF are represented with a single byte, and  
by definition this maps to Latin-1.