[Openmcl-devel] utf-8 support in openmcl?

Wed Apr 16 15:41:02 PDT 2003

I was reading up on the unicode support in CLISP. The implementation notes state: 

Only one character set is understood: the platform's  native (8-bit) character set.  See Chapter  13 . 
Platform dependent:   Only in CLISP built with compile-time flag UNICODE . 

The following character sets are supported, as values  of the corresponding (constant) symbol in the "CHARSET" package:  

UCS-2 =UNICODE-16 =UNICODE-16-BIG-ENDIAN ,    the 16-bit basic multilingual plane of the UNICODE character set.    Every character is represented as two bytes.  

UNICODE-16-LITTLE-ENDIAN  

UCS-4 =UNICODE-32 =UNICODE-32-BIG-ENDIAN ,    the 21-bit UNICODE character set. Every character is represented as    four bytes.  

UNICODE-32-LITTLE-ENDIAN  

UTF-8 ,       the 21-bit UNICODE character set.       Every character is represented as one to four bytes. ASCII characters represent themselves and need one byte per character.       Most Latin/Greek/Cyrillic/Hebrew characters need two bytes per
character. Most other characters need three bytes per character,       and the rarely used remaining characters need four bytes per       character. This is therefore, in general, the most space-efficient       encoding of all of Unicode.  

UTF-16 ,       the 21-bit UNICODE character set. Every character in the 16-bit       basic multilingual plane is represented as two bytes, and the       rarely used remaining characters need four bytes per character.       This character set is only
available on                                  platforms with GNU libc or GNU libiconv . 

UTF-7 ,       the 21-bit UNICODE character set. This is a stateful 7-bit encoding.       Not all ASCII characters represent themselves.       This character set is only available on                                  platforms with GNU libc or GNU libiconv .

JAVA ,       the 21-bit UNICODE character set. ASCII characters represent themselves and need one byte per character.       All other characters of the basic multilingual plane are represented       by \u nnnn sequences       ( nnnn a hexadecimal number)
and need 6 bytes per character. The remaining characters are represented       by \u xxxx \u yyyy and need 12 bytes per character. While this encoding is very comfortable       for editing Unicode files using only ASCII -aware tools and editors, it
cannot faithfully represent all UNICODE text. Only text which       does not contain \u (backslash followed by       lowercase Latin u) can be faithfully represented by this encoding.  

>From this description it seems most useful to support UNICODE 32 and convert other representations to it. Seems to me that utf-8 can be converted while reading from file to UNICODE 32, which to me is conceptually simpler, especially in terms of sorting and
string manipulation as the character size is fixed to 4 bytes. 

On woensdag, apr 16, 2003, at 20:50 Europe/Amsterdam, Gary Byers wrote: 

> On Wed, 16 Apr 2003, Tim Moore wrote: 
> 
> 
> 	I took a really quick look at this a few weeks ago.  It looks like there's 
> 	some (possibly vestigial) support for 16 bit general characters and 
> 	strings. I imagine this was to support the CLtL1 notion of character 
> 	attributes. 
> 
> 
> 
> It's left over from MCL's support for various 16-bit encodings 
> (SHIFT-JIS ?).  What is now OpenMCL was orignally intended to be 
> used in systems (spacecraft) that didn't do much character I/O (and 
> didn't need to do what little they did in non-ROMAN languages. 
> 
> To be honest, I was glad to find an excuse to get rid of EXTENDED-CHAR 
> and EXTENDED-STRING: rightly or wrongly, people have a tendency to 
> say things like: 
> 
>  (coerce foo 'simple-string) 
> 
> when they (probably) often want a SIMPLE-BASE-STRING, and people say 
> things like: 
> 
>  (open "file" :direction :output :element-type 'character ...) 
> 
> and then are surprised to see a lot of ASCII NULs in the (though one 
> could argue that this is another job for :EXTERNAL-FORMAT.) 
> 
> It's also unfortunate if things like CHAR and SCHAR have to deal 
> with two different types of string, and the degenerate case (where 
> they dispatch at runtime in the middle of a loop) isn't as uncommon as 
> one might hope. 
> 
> 
> 
> 	Would that be a viable path to unicode support? I believe that UTF-16 
> 	covers most of the Unicode scripts that people are likely to use in 
> 	practice. 
> 
> 
> My understanding is that UTF-16 is itself an encoding of 21-bit characters, 
> though it's often successful in encoding them in a single (UNSIGNED-BYTE 16). 
> My impression is that in some earlier versions of the Unicode standard 
> "all" characters were directly representable in 16 bits, but that that's 
> no longer true: some pairs of 16-bit values are used to denote larger 
> characters. 
> 
> I think that it's undesirable to use any type of encoding or escaping 
> to represent a CL string: it's desirable that things like LENGTH and 
> CHAR/SCHAR and AREF and ELT and ... be unit-cost operation. 
> 
> I suppose that we might be able to do something like: 
> 
>  - widen the character type to 16 bits 
>  - say that it is an error to use any 16-bit character code that's 
>    used to encode either half of a UTF-16 "surrogate pair" in a CL 
>    string, and try to enforce this restriction where possible. 
> 
> This would make CL strings a useful subset of UTF-16 and would allow 
> primitive string operations to remain primitive. 
> 
> I -think- that this is not too far from what the status quo has been 
> in the C world: in Apple's header files, the "unichar" type is currently 
> defined as "unsigned short" (16-bit), though I also believe that they've 
> announced plans to change this. 
> 
> The other attractive alternative that I can see is to bite the bullet 
> and make CHARACTERs wide enough to support full Unicode; UTF-16 and 
> UTF-8 would then just be interesting external representations for 
> strings. 
> 
> I don't have enough experience dealing with this sort of issue to 
> know how reasonable this opinion is, but something makes me think that 
> a decision to limit characters to 16 bits would appear shortsighted 
> in the near future, especially if the rationale for that decision was 
> to save a few hundred K bytes ... 
> 
> 
> 
> 	Tim 
> 
> 
> 
> 
>