[Openmcl-devel] On file encoding
peter
p2.edoc at googlemail.com
Wed Dec 22 21:00:06 PST 2010
<muddy comprehension alert>
From http://ccl.clozure.com/manual/chapter4.5.html and the sources, I
cannot divine how, given a file, to assess what character encoding it
uses.
In "4.5.4.2. Byte Order Marks" it says "If a byte order mark is
missing from input data, that data is assumed to be in big-endian
order.". So presumably any file prefix characters that might give
encoding clues are optional.
I'm coming at this from the perspective of receiving a file of
unknown origin, and need to know what encoding it uses. Then assume
the CCL way is to use UTF-32 internally.
I assume that incoming file character encoding could be anything (the
file being made by unknown other apps), as could line ending
characters.
Separately is the matter of intent, both of the file producer and the
file user.
In make-external-format there's the arg: "domain---This is used to
indicate where the external format is to be used. Its value can be
almost anything.". Which makes me wonder if what I'm after is fantasy.
I assume my app user needs to be put in (or choose) some encoding
format for the natural language interface he expects, but from there
on the user has no prior knowledge of incoming file formats. These
may be being re-purposed.
I'm hoping there is some sort of inverse of
describe-character-encodings, a function that will take a file as
argument and return an analysis of character encoding, line
termination characters, and default intents (maybe that's pure
fantasy on my part, but isn't every file created with inferred uses
anticipated/intended).
I can't rely on OS level clues like file type suffix as these could
be changed or elided. I assume I need to analyze the file contents,
even if these may have been mixed by file concatenation or
undisciplined editors handling of pastes. And most non English
encodings leave me clueless, I generally don't know what I'm looking
at.
The stream encoding functions can only be used after i've already
opened a file, so I assume I need to read raw bytes and pattern
match. But as many must need that same, I'm hoping not to need to
re-invent wheels.
Any clarifications or pointers would be most welcome.
More information about the Openmcl-devel
mailing list