[Openmcl-devel] When does (= (length "hî") 3)?

Fri Mar 25 00:15:21 PDT 2011

Thanks, Gary, for the solution and the pointers. I think I begin to see. See that I'm an idiot.

- only for other idiots and fools - 

The answer is "when your ccl is in iso-8859-1 default character encoding but your terminal is in UTF-8."

If only I had used the "reopen using encoding" feature of BBEdit or TextMate to check the output to a file created with supposedly unicode text, I would have seen that in utf-8 mode, the editor text looked normal, but that in 8859-1 mode it looked just like ccl "said" it should look by inspecting the string array - three characters, and the same ones reported by char.

And now I understand why I was so confused.

I WAS using OS X terminal in UTF-8 mode, and CCL in 8859-1 mode. I type in character "LATIN SMALL LETTER I WITH CIRCUMFLEX", which happens to be two bytes in utf-8, and the terminal program generates the two bytes and sends them to on the terminal io stream. CCL, and more specifically the terminal io stream, is in default ISO-8859-1 mode, so it receives 3 bytes, which it dutifully stores in the string after I press the Return key. When it echos back to the terminal, it shows up again as two characters, because the terminal is displaying the two bytes as the single character according to utf-8. So to appearances, CCL is understanding and echoing utf-8. But under the hood, it is really 8859-1 (or perhaps utf-32 translated from and to 8859.)

I think somewhere in my riddled brain I must have assumed some magical connection between the terminal and ccl -- so that ccl would "know" that the terminal meant "latin small letter i with circumflex" when it was typed.  (Well, I suppose one could enhance the startup script to parse the LANG env variable and set things up through -K and -e ...)

Really diabolical, if you ask me, but it makes perfect sense, of course, as everything in the lisp world does eventually.

I should have remembered what I tell my kids -- if you think the entire world is wrong, it is probably you, honey. Or, if things don't make sense to you, it is probably you, and not the things.

Cheers to Gary for patiently enduring my offense to ccl, and taking the time to explain.

Now to grab a beer and break something else!

Erik.

On Mar 24, 2011, at 10:03 PM, Gary Byers wrote:

> 
> 
> On Thu, 24 Mar 2011, Erik Pearson wrote:
> 
>> Hi,
>> 
>> I'm having trouble using the ccl external format api. It started with malformed unicode output, produced using a UTF-8 stream to output content from a web server to a browser. The unicode characters were incorrect, and contained extra junk.
>> 
>> it boils down to this:
>> (length "h?) produces 3 and not 2
>> 
>> (in case that did not translate, that is the characters "h" and "LATIN SMALL LETTER I WITH CIRCUMFLEX" which is produced on a Mac with pressing option-i then i.)
> 
> 
> ? (coerce (list #\h #\latin_small_letter_i_with_circumflex) 'string)
> "h?
> ? (length *)
> 2
> 
> 
>> 
>> This is obviously wrong, and most other lisps (cmucl being an exception) don't show this behavior.
> 
> CCL doesn't exhibit the behavior that you attibute to it
>> 
>> If you are close to this code, this may be enough to go on.
>> 
>> If not, below I describe what I went through this afternoon to get to this point:
>> 
>> The project for today was to add better unicode support to my web server. I know, I know. It is 2011 and the horse has left the gate, run down the field, retired, sired its offspring, retired again, and been buried. During that process, I added charset detection and proper(ish) setting of the external format for the connection stream. So I was not using the external format functions directly to convert to bytes, but when debugging I got into them.
>> 
>> So, regarding the above test string "h?, the function encode-string-to-octets produces
>> 
>> #(104 195 131 194 174)
>> 
>> and string-size-in-octets gives
>> 
>> 5
> 
> ? (defvar *example* (coerce (list #\h #\latin_small_letter_i_with_circumflex) 'string))
> *EXAMPLE
> ?  (encode-string-to-octets *example* :external-format :utf-8)
> #(104 195 174)
> 3
> ?
> ? (string-size-in-octets *example* :external-format :utf-8)
> 3
> 
> 
> Rather than go on with this any further ...  I suspect that whatever
> program you're using to display output and provide input to CCL (e.g.,
> a terminal program, Emacs, ...) is configured to use UTF-8, but CCL
> doesn't know this and is just sending and receiving ISO-8859-1.  When
> CCL reads the constant string, the "terminal" (Emacs) encodes that string
> in UTF-8 and sends the 3 octets 104, 195, 174; CCL interprets these octets
> as ISO-8859-1 character codes and constructs a 3-character string whose
> characters happen to have those codes.  If you print this string to the
> "terminal", CCL will (trivially) encode the 3-character string in ISO-8859-1
> an send those octets.  The terminal application will interpret those 3 octets
> as the UTF-8 encoding of the original 2-character string, and it won't be
> obvious that information's being lost in (inappropriate) translation.
> 
> Nothing else is going on here besides the fact that CCL isn't using
> the same character encoding for I/O to the terminal as the terminal's
> using for I/O to it.  There isn't generally a good way for CCL to
> guess the correct encoding (though you could make the argument that
> UTF-8 would be a better default), so you have to tell it what
> character encoding it should use for terminal I/O.  The -K command-line
> argument is the best way to do that; you need to start CCL via a command like:
> 
> $ ccl -K utf-8
> 
> If you do so, you'll probably find that things work a lot better than you
> thought that you did, and you probabl won't try to convince yourself that
> fairly simple, fundamental stuff is broken and that you're somehow the only
> person to notice that.
>> 
>> I started digging into CCL to find out how to fix this, but really, honestly, I don't have the time or brain cells left today to do that.
> 
> I'm not going to touch that with a 10 foot pole.  It'd be like shooting fish
> in a barrel, just not challenging at all.
> 
>> 
>> I'm hoping someone close to this code can explain and fix it!
> 
> I hope that you find the explanation above clear.
> 
> If the default terminal encoding in CCL was :utf-8, the bad news would be
> that when that was wrong (if the terminal was actually using :iso-8859-1
> or some other legacy 8-bit encoding) there'd be lots of cases where what
> was received wasn't valid utf-8 and it'd be more obvious that information
> was getting lost.  That might also be the good news, in that it might
> make the real issue more obvious.
> 
>> 
>> Thanks,
>> Erik.
>> 
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>>