[Openmcl-devel] Character Encoding Problem?

Thu Dec 23 21:02:17 PST 2010

I don't know if you're doing anything wrong.  I don't even know
in any detail what you're doing, or how the behavior that you
got differed from what you expected.

The following code uses (MAKE-ARRAY ... :INITIAL-CONTENTS ...) to 
initialize a couple of strings to (hopefully) the same values as
in your example (without letting anyone's email client chew things
up for them), writes those strings (with trailing newlines) to a
file in UTF-8, reads them back from the same file (in UTF-8), and
complains if the values read differ from those written.  I wouldn't
expect to see any complaint in this or more realistic cases, and in
fact I don't.  Do you ?

A raw hex dump of the bytes written to the file looks like:

[~] gb at antinomial> hexdump test.txt
0000000 41 42 00 44 0a 4d 61 72 c3 ad 61 0a 
000000c

All characters whose codes are < 128 are encoded in UTF-8 as a single
byte containing that code; characters with larger codes are encoded
as two or more bytes (and there are constraints on the values these
bytes can have.)  The "i with acute" is encoded as #xc3 #xad,  as
it should be.

When we read the file back (in UTF-8), we should recognize the bytes
with values < 128 as representing characters with small codes and the
sequence starting with #xc3 as representing a character with a larger
code.  (The value #xc3 tells us, among other things, how many of
the following bytes are part of the encoding of this character.)

<http://en.wikipedia.org/wiki/UTF-8> provides a slightly more
detailed explanation of how UTF-8 works.

When opening the file for input in the code below, we again say
that the file is in an external format which uses UTF-8 encoding.
If we hadn't said that (or had said :EXTERNAL-FORMAT :DEFAULT),
the value of CCL:*DEFAULT-FILE-CHARACTER-ENCODING* would have
been used; the default value of this variable is NIL, which is
equivalent to ISO-8859-1.  ISO-8859-1 can only represent characters
whose codes are < 256, but the encoding/decoding are trivial (there's
a 1:1 mapping between bytes/octets in the file and character codes
in the range 0-255.)

If the input file had been opened in ISO-8859-1, then the octets
#xc3 and #xad would each have been interpreted as encodings of
characters with those codes (#\Latin_Capital_Letter_A_With_Tilde
and #\Soft_Hyphen.)  If this is what you saw, it's the result of
trying to treat UTF-8-encoded data as if it were encoded in ISO-8859-1,
and this kind of result is typical of encoding mismatches.

If you got some other sort of incorrect result, it's hard to see
how anyone could know why without more information.  I can say that
I can't think of any way in which #\nul characters are different from
other characters, and would be very surprised if you've found a way
that #\nul charaters affect anything.

(let* ((s1 (make-array 4 :element-type 'character :initial-contents
               '(#\A #\B #\nul #\D)))
        (s2 (make-array 5 :element-type 'character :initial-contents
               '(#\M #\a #\r #\Latin_Small_Letter_I_With_Acute #\a))))
   (with-open-file (f "home:test.txt"
                      :direction :output
                      :if-exists :supersede
                      :if-does-not-exist :create
                      :external-format :utf-8)
     (write-line s1 f)
     (write-line s2 f))
   (with-open-file (f "home:test.txt"
                      :direction :input
                      :external-format :utf-8)
     (let* ((r1 (read-line f))
            (r2 (read-line f)))
     (unless (string= s1 r1)
       (format t "~& botched encoding of ~s, got ~s" s1 r1))
     (unless (string= s2 r2)
       (format t "~& botched encoding of ~s, got ~s" s2 r2)))))

On Thu, 23 Dec 2010, Philippe Sismondi wrote:

> In the past day or so I posted a question regarding file :external-format usage before I had learned everything I should have.
>
> However, in attempting to sort out my character encoding problems I have observed some behaviour in ccl which seems problematic to me. This problem relates to the presence of a null character, i.e. #\Null, in a string.
>
> When my function outputs the following two strings (in this order) to a file using external-format :utf-8, the character encoding of the second string gets messed up:
>
> (format out "AB^@D~%")
> (format out "Mar?~%")
>
> In the first string above ^@ represents the null character. Notice that the second string contains an accented i, which is char-code 237. If the null character is not present the second string is encoded properly on output. When the null is there I am getting something or other that is wrong, but I don't really know what
> ccl is trying to do with it.
>
> The nulls are getting into the strings from external binary files that I am parsing. Either the input data is corrupt, or my parser is buggy. In any case, the string containing the null was output a thousand lines or so before the messed up string, so it took me a long time to find the connection.
>
> However the nulls got in my strings, it does not seem right to me that the character encoding  on output should be affected by this. At least, I tried the same thing in sbcl and did not observe this behaviour.
>
> Is this a bug? Or am I doin' it wrong?
>
> Best,
>
> - Phil -
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>