[Openmcl-devel] Plans for Unicode support within OpenMCL?

Wed Mar 22 06:08:05 PST 2006

I agree that UTF-8 is convenient, but there is no reason you cannot  
hash UTF-16 values. Python, to cite one example, has a Unicode string  
type that hashes very well.

I strongly believe that *internal* representation must be separated  
from *external* representation. The external representation of a  
Unicode string, that which is displayed, should probably be UTF-8 for  
all the reasons you give. But internally UTF-8 is suboptimal.

Plan 9 is one example of an application that uses UTF-8 internal.  
Sure. Mac OS X is a much more widely used, real world, system that  
does not. It uses UTF-16. I don't see UTF-8 as being an overly  
efficient representation for Unicode if your primary interest is in  
string processing. You have no hope of direct indexing into the  
string without extra data structures. It only saves you space if you  
work in ASCII, otherwise it is no better than or worse than UTF-16.

Programmers should not need to worry about character representation:  
whether you need to deal with UTF-8 or transcode to a 'rune' to  
manipulate the characters.

CLISP's implementation strategy is worth looking at here, I think.

     -tree

---
Tom Emerson
tree at dreamersrealm.net
http://www.dreamersrealm.net/~tree

On Mar 22, 2006, at 8:51 AM, David Tolpin wrote:

>
> On 22/07/5766, at 17:42, Tom Emerson wrote:
>
>> "Real world" applications that were retrofitted to use Unicode use
>> UTF-8 as the internal encoding, because the C runtime string
>> functions "just work" (for various meanings of work) with it, i.e.,
>> strlen() of a UTF-8 string gives a valid number (the number of
>> bytes in the string). strlen() of a UTF-16 string usually gives 0
>> since the first byte of the UTF-16 character is often 0.
>
> UTF-8 is a convenient character encoding of UCS. It is usable for
> hashing, sorting (when sorting need not be lexicographical) and
> string manipulations. It is also handy for viewing the result in an
> editor/viewer -- terminals and editors handle UTF-8 natively.
>
>> Applications that are written with Unicode in mind rarely, if ever,
>> use UTF-8 as an internal encoding. While there is a space savings
>> in many cases, other manipulations are much more difficult because
>> of the multi-byte character representation.
>
> That's not true. Applications written with Unicode in mind do use
> UTF-8 as an internal encoding. The best example is Plan 9 itself,
> written from scratch with Unicode in mind. That's because it is a
> convenient and efficient representation, and memory footprint is
> unrelated to that.
>
> They just don't use UTF-8 representation when random access to
> individual characters is required, providing decoders and encoders.
> When you need strings as integral objects, you keep them in UTF-8;
> when you want to access individual characters, you do something like
>
> (with-runes (runes length) string
> 	....)
>
> or similar to that.
>
> David
>
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel