[Openmcl-devel] Unicode issues, esp security
Luis Oliveira
luismbo at gmail.com
Mon Apr 13 14:55:46 PDT 2009
Dan Weinreb <dlw at itasoftware.com> writes:
> http://www.unicode.org/reports/tr36/
Thanks for that link.
> Cases like this, in which an illegal sequence is explicitly
> transformed into another illegal sequence, would meet with a lot of
> resistance from folks who care about security.
Assuming you're referring to UTF-8B, it should be pointed out (as James
already did) that it's not specified by Unicode and I would add that it
certainly isn't a general-purpose encoding.
James also points out that UTF-8B in fact follows the guidelines put
forward by TR36. Not that surprising since UTF-8B was, after all,
proposed by a Unicode expert.
> It's important not to do anything outside the definition. Your
> objection to CODE-CHAR returning NIL is incompatible with the Unicode
> concept of "Noncharacters". See the Unicode report section 16.7.
Well, that section says that the "Unicode Standard sets aside 66
noncharacter code points", and proceeds to specify them. CCL's CODE-CHAR
returns *non-NIL* for all of those codes -- at least in the oldish
version I have installed. A few comments about that:
1. Though Gary has hinted that he would like CCL to return NIL for
these codes, it's probably a good thing that CODE-CHAR currently
returns non-NIL for noncharacters. In the next paragraph from
that section, the standard says that "applications are free to
use any of these noncharacter code points internally".
2. Surrogate code points are not "noncharacters". The extra code
points used by UTF-8B to represent invalid bytes are a subset of
the surrogate code points. This distinction is probably not very
useful, though.
--
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/
More information about the Openmcl-devel
mailing list