[Openmcl-devel] When does (= (length "hî") 3)?

Thu Mar 24 19:36:29 PDT 2011

Hi,

I'm having trouble using the ccl external format api. It started with malformed unicode output, produced using a UTF-8 stream to output content from a web server to a browser. The unicode characters were incorrect, and contained extra junk.

it boils down to this:
(length "hî") produces 3 and not 2

(in case that did not translate, that is the characters "h" and "LATIN SMALL LETTER I WITH CIRCUMFLEX" which is produced on a Mac with pressing option-i then i.)

This is obviously wrong, and most other lisps (cmucl being an exception) don't show this behavior.

If you are close to this code, this may be enough to go on. 

If not, below I describe what I went through this afternoon to get to this point:

The project for today was to add better unicode support to my web server. I know, I know. It is 2011 and the horse has left the gate, run down the field, retired, sired its offspring, retired again, and been buried. During that process, I added charset detection and proper(ish) setting of the external format for the connection stream. So I was not using the external format functions directly to convert to bytes, but when debugging I got into them.

So, regarding the above test string "hî", the function encode-string-to-octets produces

#(104 195 131 194 174)

and string-size-in-octets gives

5

The exact code is:

(setf f (make-external-format :character-encoding :utf-8 :line-termination :unix))
(ccl:encode-string-to-octets "hî" :external-format f)

Well, given my knowlege of unicode and utf-8 was quite low before this adventure, but my cursory reading led me to believe that a unicode character could be between 1 and 4 bytes, I had no reason to believe at first that his was incorrect.

Oh, ye of little faith!

So I explored the same thing in a few different contexts:

1. AllegroCL 

CL-USER(3): (string-to-octets "hî" :external-format :utf-8 :null-terminate nil)
#(104 195 174)
3

2. cmucl

(string-to-octets "hî" :external-format :utf-8)
#(104 195 131 194 174)
5

3. DrRacket (scheme)

> (bytes->list (string->bytes/utf-8 "hî"))
'(104 195 174)

4. Online utf-8 tool: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%EE&mode=char

just does one character at a time, so I did the î and got:

hex C3 AE

which is decimal 195 174

5. Another online conversion tool http://rishida.net/tools/conversion/

gives

68 C3 AE

which is decimal 104 195 174

-- 

So now I know that something is wrong in CCL-land, and interestingly in cmucl land. 

Next I groused through the CCL source code, finding the relevant stuff in level-1/l1-unicode.lisp.

I played around with the core functions, until I found a test case:

(%string-size-in-octets string start end encoding line-termination use-byte-order-mark)

was working okay when I tested with dummy values like this:

(setf g (ccl::get-character-encoding (ccl:external-format-character-encoding f)))
(ccl::%string-size-in-octets "hî" 0 2 g :unix nil)

and, wow, it produced 3, not 5. 

When I had been testing with encode-string-to-octets (where this code was pulled from) I had NOT been supplying the string end, leaving that as nil. It ends up that the encode-string-to-octets determines the string length with 
(setq end (check-sequence-bounds string start end)), and that check-sequence-bounds in turn calls (length seq) to find the length ... and then we find the amazing fact that 

(length "hî") is 3 not 2.

and that (char "hî" 2) not only exists, but is #\Registered_Sign (decimal 195)

and that (char "hî" 1) is #\Latin_Capital_Letter_A_With_Tilde (decimal 174)

Now is it a coincidence that 195 and 174 are the two bytes that make up the utf-8 representation of î? I think not! It looks like utf-8 is being used for the internal string representation (hey, I though it was utf-32), and that ccl is not providing the opacity at the level of char, length, aref, etc., and higher unicode characters are expanding out into the array.

(You can try this out with copy/pasting characters from http://csbruce.com/~csbruce/software/utf-8.html -- characters with even higher code points will expand to 3 or 4 array elements).

Interestingly, cmucl gives (length "hî") as 3, and also provides similar results for char.

acl, clisp, sbcl give a length of 2, and fail on things like (char "hî" 2).

I started digging into CCL to find out how to fix this, but really, honestly, I don't have the time or brain cells left today to do that.

I'm hoping someone close to this code can explain and fix it!

Thanks,
Erik.