[Openmcl-devel] When does (= (length "hî") 3)?

Thu Mar 24 22:03:56 PDT 2011

On Thu, 24 Mar 2011, Erik Pearson wrote:

> Hi,
>
> I'm having trouble using the ccl external format api. It started with malformed unicode output, produced using a UTF-8 stream to output content from a web server to a browser. The unicode characters were incorrect, and contained extra junk.
>
> it boils down to this:
> (length "h?) produces 3 and not 2
>
> (in case that did not translate, that is the characters "h" and "LATIN SMALL LETTER I WITH CIRCUMFLEX" which is produced on a Mac with pressing option-i then i.)

? (coerce (list #\h #\latin_small_letter_i_with_circumflex) 'string)
"h?
? (length *)
2

>
> This is obviously wrong, and most other lisps (cmucl being an exception) don't show this behavior.

CCL doesn't exhibit the behavior that you attibute to it
>
> If you are close to this code, this may be enough to go on.
>
> If not, below I describe what I went through this afternoon to get to this point:
>
> The project for today was to add better unicode support to my web server. I know, I know. It is 2011 and the horse has left the gate, run down the field, retired, sired its offspring, retired again, and been buried. During that process, I added charset detection and proper(ish) setting of the external format for the connection stream. So I was not using the external format functions directly to convert to bytes, but when debugging I got into them.
>
> So, regarding the above test string "h?, the function encode-string-to-octets produces
>
> #(104 195 131 194 174)
>
> and string-size-in-octets gives
>
> 5

? (defvar *example* (coerce (list #\h #\latin_small_letter_i_with_circumflex) 'string))
*EXAMPLE
?  (encode-string-to-octets *example* :external-format :utf-8)
#(104 195 174)
3
?
? (string-size-in-octets *example* :external-format :utf-8)
3

Rather than go on with this any further ...  I suspect that whatever
program you're using to display output and provide input to CCL (e.g.,
a terminal program, Emacs, ...) is configured to use UTF-8, but CCL
doesn't know this and is just sending and receiving ISO-8859-1.  When
CCL reads the constant string, the "terminal" (Emacs) encodes that string
in UTF-8 and sends the 3 octets 104, 195, 174; CCL interprets these octets
as ISO-8859-1 character codes and constructs a 3-character string whose
characters happen to have those codes.  If you print this string to the
"terminal", CCL will (trivially) encode the 3-character string in ISO-8859-1
an send those octets.  The terminal application will interpret those 3 octets
as the UTF-8 encoding of the original 2-character string, and it won't be
obvious that information's being lost in (inappropriate) translation.

Nothing else is going on here besides the fact that CCL isn't using
the same character encoding for I/O to the terminal as the terminal's
using for I/O to it.  There isn't generally a good way for CCL to
guess the correct encoding (though you could make the argument that
UTF-8 would be a better default), so you have to tell it what
character encoding it should use for terminal I/O.  The -K command-line
argument is the best way to do that; you need to start CCL via a command like:

$ ccl -K utf-8

If you do so, you'll probably find that things work a lot better than you
thought that you did, and you probabl won't try to convince yourself that
fairly simple, fundamental stuff is broken and that you're somehow the only
person to notice that.
>
> I started digging into CCL to find out how to fix this, but really, honestly, I don't have the time or brain cells left today to do that.

I'm not going to touch that with a 10 foot pole.  It'd be like shooting fish
in a barrel, just not challenging at all.

>
> I'm hoping someone close to this code can explain and fix it!

I hope that you find the explanation above clear.

If the default terminal encoding in CCL was :utf-8, the bad news would be
that when that was wrong (if the terminal was actually using :iso-8859-1
or some other legacy 8-bit encoding) there'd be lots of cases where what
was received wasn't valid utf-8 and it'd be more obvious that information
was getting lost.  That might also be the good news, in that it might
make the real issue more obvious.

>
> Thanks,
> Erik.
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>