[Openmcl-devel] Plans for Unicode support within OpenMCL?
Tom Emerson
tree at dreamersrealm.net
Wed Mar 22 07:17:32 PST 2006
On Mar 22, 2006, at 10:03 AM, Takehiko Abe wrote:
> Tom Emerson wrote:
>
>> It is important not to get mixed up between code point values and the
>> various encoding forms that are available. When viewed this way,
>> surrogate pairs become an encoding issue. The view of characters that
>> the programmer seems is of whole codepoints: they should *never* see
>> a character value in the range #xD800 -- #xDFFF
>
> But you cannot enforce it. It is possible that you see them.
You cannot enforce it, but the presence of a surrogate value in a
UTF-32 stream is an error: there are no defined semantics for that
character. So while you can see them, they have no valid character
value.
> If they are rare, I think it is reasonable to assume that it is up
> to each user how to deal with them. I think the same is true for
> surrogate pairs in UTF-16. They are rare too.
You cannot put combining characters and surrogate pairs in the same
bin. A combining character *is* a valid character: the fact that it
combines with the following character(s) is a display issue.
Surrogates are an encoding artifact. Expecting the programmer to deal
with them is wrong.
> If we will have multiple string types, let's use Latin-1 for under
> #xFF.
I propose multiple string representations, internally, for Unicode
characters. Values under #xFF are represented with a single byte, and
by definition this maps to Latin-1.
More information about the Openmcl-devel
mailing list