[Openmcl-devel] Plans for Unicode support within OpenMCL?

Wed Mar 22 05:51:12 PST 2006

On 22/07/5766, at 17:42, Tom Emerson wrote:

> "Real world" applications that were retrofitted to use Unicode use  
> UTF-8 as the internal encoding, because the C runtime string  
> functions "just work" (for various meanings of work) with it, i.e.,  
> strlen() of a UTF-8 string gives a valid number (the number of  
> bytes in the string). strlen() of a UTF-16 string usually gives 0  
> since the first byte of the UTF-16 character is often 0.

UTF-8 is a convenient character encoding of UCS. It is usable for  
hashing, sorting (when sorting need not be lexicographical) and  
string manipulations. It is also handy for viewing the result in an  
editor/viewer -- terminals and editors handle UTF-8 natively.

> Applications that are written with Unicode in mind rarely, if ever,  
> use UTF-8 as an internal encoding. While there is a space savings  
> in many cases, other manipulations are much more difficult because  
> of the multi-byte character representation.

That's not true. Applications written with Unicode in mind do use  
UTF-8 as an internal encoding. The best example is Plan 9 itself,  
written from scratch with Unicode in mind. That's because it is a  
convenient and efficient representation, and memory footprint is  
unrelated to that.

They just don't use UTF-8 representation when random access to  
individual characters is required, providing decoders and encoders.  
When you need strings as integral objects, you keep them in UTF-8;  
when you want to access individual characters, you do something like

(with-runes (runes length) string
	....)

or similar to that.

David