[Openmcl-devel] code-char from #xD800 to #xDFFF

Tue Jul 31 10:15:17 PDT 2012

peter <p2.edoc at googlemail.com> writes:

> It seems that code-char returns nil from #xD800 to #xDFFF, otherwise
> it returns characters from 0 to (- (lsh 1 16) 3). I take it as defined
> in ccl::fixnum->char.
>
> <http://www.unicode.org/charts/PDF/UDC00.pdf> and
> <http://www.unicode.org/charts/PDF/UD800.pdf> say
> "Isolated surrogate code points have no interpretation; consequently,
> no character code charts or names lists are provided for this range."
>
> <http://ccl.clozure.com/manual/chapter4.5.html#Unicode> says these
> codes: "will never be valid character codes and will return NIL for
> arguments in that range".
>
> When using CCL to run a dynamic web service, this can be inconvenient
> when passing material from external sources through CCL to a remote
> browsers (for instance, Japanese Emoji icon characters occupy this
> code area, sources use them and web browsers render them).

Why do they use this code area?  The Unicode standard says that 

   "Isolated surrogate code points have no interpretation; consequently,
    no character code charts or names lists are provided for this
    range."

Therefore those codes should not be used, should not be accepted and
should not be received.

This message:
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014336.html
explains it clearly.

There seems to be also a few security concerns:
http://unicode.org/reports/tr36/tr36-8.html

Moreover, section "3.7.4 Interoperability" indicates that it would be
better to use private-use codes to encode those invalid bytes.

So when you receive a code from #xD800 to #xDFFF, you can encode it into
a unicode character with:

    (defun code-to-char (code)
      (if (<= #xd800 code #xdfff)
         (code-char (+ code #.(- #xf0800 #xD800)))
         (code-char code)))

    (defun char-to-code (char)
      (if (char<= #\uf0800 char #\uf0fff)
          (- (char-code char)  #.(- #xf0800 #xD800))
          (char-code char)))

    (format t "~X~%" (char-to-code (code-to-char #xd912)))
    D912
    --> NIL

> Is there any efficient strategy which side-steps this issue?

HTTP is a binary protocol.  So you should not need to use characters
there anyways.  You could just keep the data in binary form.  Do you
really need to convert to characters?  Now if you insist in sending
invalid codes, you can always do that, in a binary stream.

> At the moment I am intercepting character codes in this area and
> replacing them with #\Replacement_Character or #\null, but in so doing
> losing the character code.  Hence I would be passing material through
> CCL such that some characters were eliminated in transit, hence
> changing the original meaning/intent of the material.

I think CCL does well to follow strictly the standard.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
A bad day in () is better than a good day in {}.