[Openmcl-devel] code-char from #xD800 to #xDFFF

Tue Jul 31 10:59:19 PDT 2012

On Jul 31, 2012, at 9:48 AM, peter wrote:

> It seems that code-char returns nil from #xD800 to #xDFFF, otherwise it returns characters from 0 to (- (lsh 1 16) 3). I take it as defined in ccl::fixnum->char.
> 
> <http://www.unicode.org/charts/PDF/UDC00.pdf> and <http://www.unicode.org/charts/PDF/UD800.pdf> say
> "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range."
> 
> <http://ccl.clozure.com/manual/chapter4.5.html#Unicode> says these codes: "will never be valid character codes and will return NIL for arguments in that range".
> 
> When using CCL to run a dynamic web service, this can be inconvenient when passing material from external sources through CCL to a remote browsers (for instance, Japanese Emoji icon characters occupy this code area, sources use them and web browsers render them).
> 
> I cannot understand why CCL should behave as it does in this, but assume there is good reason.

Of course there's a good reason, and you cite it yourself: the unicode standard says that these are not valid code points.

> Ie. would it not make sense to return a character with appropriate code value even if CCL has no  use for that.

No, it wouldn't.  Invalid code points are as likely to be an indication of an error as they are likely to be valid data under some non-standard semantics.

> Is there any efficient strategy which side-steps this issue?

That depends on what you want to do.  If all you want to do is pass-through a bunch of bits unaltered you can read the stream as binary, or you can use an encoding without any invalid bytes (e.g. latin-1).

If you want to pass through character streams with Emoji the Right Way to do that would be by defining a non-standard encoding that knows about Emoji.

> At the moment I am intercepting character codes in this area and replacing them with #\Replacement_Character or #\null, but in so doing losing the character code.  Hence I would be passing material through CCL such that some characters were eliminated in transit, hence changing the original meaning/intent of the material.

That's right.  Under many circumstances, this is the Right Thing.  Note that if CCL did not behave as it did then intercepting invalid code points and treating them as errors would be much harder.

rg