[Openmcl-devel] code-char from #xD800 to #xDFFF

Tue Jul 31 10:45:00 PDT 2012

On Jul 31, 2012, at 12:48 PM, peter <p2.edoc at googlemail.com> wrote:

> It seems that code-char returns nil from #xD800 to #xDFFF, otherwise it returns characters from 0 to (- (lsh 1 16) 3). I take it as defined in ccl::fixnum->char.
> 
> <http://www.unicode.org/charts/PDF/UDC00.pdf> and <http://www.unicode.org/charts/PDF/UD800.pdf> say
> "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range."
> 
> <http://ccl.clozure.com/manual/chapter4.5.html#Unicode> says these codes: "will never be valid character codes and will return NIL for arguments in that range".
> 
> When using CCL to run a dynamic web service, this can be inconvenient when passing material from external sources through CCL to a remote browsers (for instance, Japanese Emoji icon characters occupy this code area, sources use them and web browsers render them).

According to Mac OS X's Character Viewer, the emoji symbol "smiling face with smiling eyes" (😊) has the code point u+1f60a.  To encode this character in UTF-16, we must use surrogates, and would end up with (u+d83d u+de0a).

UTF-16 is actually a variable length encoding.  So, maybe what you're seeing with the emoji characters is this.  You have some UTF-16 encoded data.  The emoji characters don't actually occupy the surrogate region, but perhaps when you're processing the UTF-16 data you treat it as a fixed-length 16-bit encoding.  Many characters fit into 16 bits, but some don't.  You encounter a 16-bit unit that is half of a surrogate pair, try to treat it as a code point, and CCL complains.

CCL's character type corresponds to a code point.  In other systems, like in Cocoa, for example, there's a type called unichar, which is a 16-bit unit.  So, if we have that smiling face emoji above in an NSString s, and then call [s characterAtIndex:0], what you get is u+d83d.  That's not a code point, of course.

> I cannot understand why CCL should behave as it does in this, but assume there is good reason. Ie. would it not make sense to return a character with appropriate code value even if CCL has no  use for that.

Again, a character corresponds to a Unicode code point.  If the data you're trying to interpret as characters is just binary junk, then there's a good chance that there's not going to be a one-to-one mapping between the junk and lisp characters.

> Is there any efficient strategy which side-steps this issue?
> At the moment I am intercepting character codes in this area and replacing them with #\Replacement_Character or #\null, but in so doing losing the character code.  Hence I would be passing material through CCL such that some characters were eliminated in transit, hence changing the original meaning/intent of the material.

If you're dealing with binary data, you should operate on bytes. If you are dealing with characters, make sure that you're using the correct external format.  For instance, if you know you've got utf-16 data, open your stream with :external-format :utf-16, and you'll get appropriately decoded characters (like #\u+1f60a) instead of isolated surrogates on which CCL will look with disdain.  If you know that a certain range of your binary data contains a string, use decode-string-from-octets (with an appropriate external-format argument) to get it.