[Openmcl-devel] On file encoding

peter p2.edoc at googlemail.com
Thu Dec 23 05:00:06 UTC 2010


<muddy comprehension alert>

 From http://ccl.clozure.com/manual/chapter4.5.html and the sources, I 
cannot divine how, given a file, to assess what character encoding it 
uses.

In "4.5.4.2. Byte Order Marks" it says "If a byte order mark is 
missing from input data, that data is assumed to be in big-endian 
order.".  So presumably any file prefix characters that might give 
encoding clues are optional.

I'm coming at this from the perspective of receiving a file of 
unknown origin, and need to know what encoding it uses. Then assume 
the CCL way is to use UTF-32 internally.

I assume that incoming file character encoding could be anything (the 
file being made by unknown other apps), as could line ending 
characters.

Separately is the matter of intent, both of the file producer and the 
file user.
In make-external-format there's the arg: "domain---This is used to 
indicate where the external format is to be used. Its value can be 
almost anything.". Which makes me wonder if what I'm after is fantasy.

I assume my app user needs to be put in (or choose) some encoding 
format for the natural language interface he expects, but from there 
on the user has no prior knowledge of incoming file formats.  These 
may be being re-purposed.

I'm hoping there is some sort of inverse of 
describe-character-encodings, a function that will take a file as 
argument and return an analysis of character encoding, line 
termination characters, and default intents (maybe that's pure 
fantasy on my part, but isn't every file created with inferred uses 
anticipated/intended).

I can't rely on OS level clues like file type suffix as these could 
be changed or elided.  I assume I need to analyze the file contents, 
even if these may have been mixed by file concatenation or 
undisciplined editors handling of pastes. And most non English 
encodings leave me clueless, I generally don't know what I'm looking 
at.

The stream encoding functions can only be used after i've already 
opened a file, so I assume I need to read raw bytes and pattern 
match.  But as many must need that same, I'm hoping not to need to 
re-invent wheels.

Any clarifications or pointers would be most welcome.



More information about the Openmcl-devel mailing list