[Openmcl-devel] Plans for Unicode support within OpenMCL?

Wed Mar 22 07:03:32 PST 2006

Tom Emerson wrote:

> It is important not to get mixed up between code point values and the  
> various encoding forms that are available. When viewed this way,  
> surrogate pairs become an encoding issue. The view of characters that  
> the programmer seems is of whole codepoints: they should *never* see  
> a character value in the range #xD800 -- #xDFFF 

But you cannot enforce it. It is possible that you see them.

> because these cannot  
> appear in a valid Unicode stream. 
>
> Unicode's character model is well  
> documented: there should be no confusion there.
> 
> With regards to combining characters: these are a necessary part of  
> the standard, and you have to deal with them. UAX #29 provides  
> algorithms for doing appropriate text boundary detection, including  
> 'glyph' boundaries. However, most people don't need to deal with  
> that. I've worked in Unicode for almost 8 years and combining  
> characters rarely show up in practice in my experience.

If they are rare, I think it is reasonable to assume that it is up
to each user how to deal with them. I think the same is true for
surrogate pairs in UTF-16. They are rare too.

> 
> For internal representation: multiple internal representations make a  
> lot of sense. You can use ASCII for characters under #x80. Characters  
> above #x80 can use UTF-16.

If we will have multiple string types, let's use Latin-1 for under
#xFF.