[Openmcl-devel] Unicode issues, esp security
stelian.ionescu-zeus at poste.it
Mon Apr 13 14:18:41 PDT 2009
On Mon, 2009-04-13 at 22:24 +0200, james anderson wrote:
> [ironic in this discussion, is that utf-8b is non-conformant - by
I don't think so. See http://www.unicode.org/versions/Unicode5.1.0/
paragraph E: "in processing the UTF-8 code unit sequence <F0 80 80 41>,
the only requirement on a converter is that the <41> be processed and
correctly interpreted as <U+0041>."
> On 2009-04-13, at 20:37 , Dan Weinreb wrote:
> > Luis,
> > From two Unicode experts I have consulted come
> > the following comments:
> > See:
> > http://www.unicode.org/reports/tr36/
> > Cases like this, in which an illegal sequence is explicitly
> > transformed into another illegal sequence, would meet with a lot of
> > resistance from folks who care about security.
> > It's important not to do anything outside the definition. Your
> > objection to CODE-CHAR returning NIL is incompatible with the Unicode
> > concept of "Noncharacters". See the Unicode report section 16.7.
> is not 16.7 concerned with unicode interchange? kuhn's proposal, from
> which oliviera's 8b efforts follow, is not.
> it concerns an unambiguous internal representation. in any case,
> kuhn's proposal would also appear to adhere to tr36's
> recommendations, in that it neither deletes the initial invalid byte,
> nor consumes successors.
> one may argue, that the result is not a vector with element type
Perhaps it would be more correct to say that the result is a vector of
characters whose character set is a superset of Unicode.
> one may also argue, that the result should be permissible as input to
> an utf-8b encoding only and any other attempted encoding would be an
> the question remains, should a runtime support efficient decoding of
> this class of data and, if so, how should it do that with convenient,
> efficient operations on the respective internal representation? if
> the answer is "no lisp implementation should," then babel should
> eliminate utf-8b. if the answer is "there should be some way," then -
> particularly in light of the security issues, all implementations
> _should_ behave the same.
There should be some way, and the reason is that not all applications
need to *interpret* the data that they receive. Some need to work with
the data as-is, for example:
*) On most *nix variants, a pathname is just a vector of octets with no
I'd like to be able to list the contents of any directory and be sure
that I be able to get all the filenames in it without any decoding error
because I may not now the encoding of the files in it(assuming that
there is one - some people have been known to use the filesystem as a
generic datastore using binary blobs as filenames).
I'd also like to be able to decode such filenames into strings instead
of instances of (simple-array (unsigned-byte 8) (*))
**) Ideally, an editor should be able to open a file with mixed encoding
and maintain the contents that isn't explicitly modified by the user
as-is. For example, if a file that contains mostly UTF8 with some EUC_JP
inside and the user modifies only some of the UTF8 parts, upon saving
the file the EUC_JP parts should be written back as they were.
All decoders I've seen thus far in CL implementations either signal an
error which would block the editor from even displaying the file, or
replace non-UTF8 contents by U+FFFD or #\? causing loss of data.
UTF-8b works as expected because it deals transparently with malformed
UTF8 octet sequences and because it outputs strings, which are
preferable to bare (unsigned-byte 32) vectors
Stelian Ionescu a.k.a. fe[nl]ix
Quidquid latine dictum sit, altum videtur.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 198 bytes
Desc: This is a digitally signed message part
More information about the Openmcl-devel