[Openmcl-devel] On file encoding

Wed Dec 22 22:27:35 PST 2010

In general it's not possible to know with absolute certainty how a
file's encoded by looking at its contents.  If you're willing to
forgo "absolute certainty", there are sometimes heuristics that'd
provide a pretty strong hint.

If a file contains this sequence of 8-bit bytes:

#x68 #x00 #x65 #x00 #x6c #x00 #x6c #x00 #x6f #x00

then it's probably likely that those bytes encode the string "hello"
in either UCS-2LE or UTF-16LE (and that we can't be sure which of
those two, and that it doesn't matter in this case.)  It's possible -
though unlikely - that the file contains a string where every other
character is a #\nul; that's only unlikely because most text files
don't contain #\nul characters.  It's also possible that the file
contains a big-endian 16-bit representation of the (CJK) characters
with codes #x6800, #x6500, etc.; how likely that is probably depends
on whether that sequence of characters "makes sense" in some context.

Byte-order marks can sometimes provide stronger hints: if a file
begins with the bytes #xfe #xff, then it's certainly possible and
perhaps likely that the file is encoded in one of UCS-2 or UTF-16 (and
the #xfe ff is a byte-order mark that serves to indicate that.)  If the
next two bytes are both #x00, then the file -might- be encoded in
UCS-4 or UTF-16, and the first two bytes are also perfectly valid
in a lot of 8-bit encodings (not including UTF-8).  I think that people
have probably written utilities that try to sniff a file's encoding
from its contents and these utilities are probably pretty useful in
practice but can't be foolproof.

The only ways to reliably know how a file/stream is encoded depend on
out-of-band information.  Network protocols often provide ways of
conveying this, and some filesystems provide ways of remembering a
file's encoding as an "extended attribute"; a file saved from the
Cocoa IDE will have its encoding saved as an extended attribute (and
this will generally be remembered when the file is next opened in
the editor.

[To see this: open a new buffer in the Cocoa IDE, type some text into
it, and save it as "foo.lisp" in your home directory; when saving, choose
"utf-16" from the popup menu.  Close the window; open the file again, and
note that the editor recognized the file's encoding.  In the shell, doing:

$ xattr -vl ~/foo.lisp

will produce output like:

/Users/gb/foo.lisp: com.apple.TextEncoding: utf-16;256

.]

OSX has supported extended attributes since (IIRC) 10.4; at least some
other Unix variants also provide them (though I don't know what conventions
- if any - other systems use to record file's character encodings or what
applications follow those conventions.)

Utilities that copy files and directories around (around the filesystem,
around the network) may or may not copy extended attributes as well (or
may require non-default options to do so.)

The use of extended attributes described above is part of the Cocoa 
text/document implementation.  It might be nice if things below the Cocoa
layer (LOAD, OPEN, COMPILE-FILE) also consulted extended attributes when
the file's external-format isn't explicitly specified.

The general observation - that you can't be sure of a file's encoding by
looking at its contents and the only ways to be sure depend on out-of-band
information - isn't at all specific to CCL.

On Thu, 23 Dec 2010, peter wrote:

> <muddy comprehension alert>
>
> From http://ccl.clozure.com/manual/chapter4.5.html and the sources, I cannot 
> divine how, given a file, to assess what character encoding it uses.
>
> In "4.5.4.2. Byte Order Marks" it says "If a byte order mark is missing from 
> input data, that data is assumed to be in big-endian order.".  So presumably 
> any file prefix characters that might give encoding clues are optional.
>
> I'm coming at this from the perspective of receiving a file of unknown 
> origin, and need to know what encoding it uses. Then assume the CCL way is to 
> use UTF-32 internally.
>
> I assume that incoming file character encoding could be anything (the file 
> being made by unknown other apps), as could line ending characters.
>
> Separately is the matter of intent, both of the file producer and the file 
> user.
> In make-external-format there's the arg: "domain---This is used to indicate 
> where the external format is to be used. Its value can be almost anything.". 
> Which makes me wonder if what I'm after is fantasy.
>
> I assume my app user needs to be put in (or choose) some encoding format for 
> the natural language interface he expects, but from there on the user has no 
> prior knowledge of incoming file formats.  These may be being re-purposed.
>
> I'm hoping there is some sort of inverse of describe-character-encodings, a 
> function that will take a file as argument and return an analysis of 
> character encoding, line termination characters, and default intents (maybe 
> that's pure fantasy on my part, but isn't every file created with inferred 
> uses anticipated/intended).
>
> I can't rely on OS level clues like file type suffix as these could be 
> changed or elided.  I assume I need to analyze the file contents, even if 
> these may have been mixed by file concatenation or undisciplined editors 
> handling of pastes. And most non English encodings leave me clueless, I 
> generally don't know what I'm looking at.
>
> The stream encoding functions can only be used after i've already opened a 
> file, so I assume I need to read raw bytes and pattern match.  But as many 
> must need that same, I'm hoping not to need to re-invent wheels.
>
> Any clarifications or pointers would be most welcome.
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>