[Openmcl-devel] On file encoding

Thu Dec 23 08:45:41 PST 2010

Gary's response is pretty spot on: generally you need to rely on out-of-band
data to know which encoding to use when reading the bytes in the file.

If you know the file contains Unicode and just need to identify the encoding
form, it can be done pretty easily:

1. Look for a BOM at in the first four bytes: UTF-16 and UTF-32 will have
unique sequences, and some editors (especially on Windows) include the UTF-8
encoded BOM even though it isn't necessary.

2. If there is no BOM, you can sniff the bytes and look for likely
sequences. UTF-8 is easily identified because of the bit-patterns on the
individual bytes. UTF-16/UTF-32 can be ambiguous: the sequence #x00 #x4E
could be big-endian 'N' or little-endian #\U+4E00 (the Chinese character for
"one"). It is conceivable that you could have multiple ideographs whose
low-order byte is #x00 in succession, but this would be pretty rare: if
every other byte is #x00 (or virtually every other byte) then you can guess
endianness that way.

More difficult is interpreting files in other encodings: without external
info you're in trouble. Consider the ISO 8859-x series (which are actually
character sets, not encodings, but for all intents and purposes that is
irrelevant here) --- these share the same encoding space: the character with
byte value #xD6 is capital O + umlaut in 8859-1, but the Arabic letter 'dad'
in 8859-6. It gets worse if you add the Windows code pages: ISO 8859-1 and
CP1252 (the Latin-1 character sets) are not interchangeable: CP1252 is a
superset of 8859-1.

So what can you do? There are approaches to identifying
encodings programmatically: essentially you build a profile for each
encoding you want to be able to identify and then, for a given unknown
sequence of bytes, generate a profile for it and compare it with your known
profiles. When I was at Basis Technology we built a system that could do
this, but the basic idea is straight forward. Cavnar and Trenkle published a
paper several years ago that outlines a simple but effective approach
utilizing n-grams.

Of course once you've identified the encoding, you need to transcode it into
Unicode. CCL's coverage for this is pretty small, though of my pet projects
over the holidays is to expand the coverage a bit, especially for Asian
languages (my personal itch.)

Feel free to ask questions: I deal with these all the time.

    -tree

-- 
Tom Emerson
tremerson at gmail.com
http://treerex.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20101223/2ab92ce9/attachment.htm>