[Openmcl-devel] %ioblock-read-u16-encoded-char

Wed Apr 25 01:33:00 PDT 2007

On Wed, 25 Apr 2007, Takehiko Abe wrote:

> Gary Byers wrote:
>
>> My first reaction is that IOBLOCK-LITERAL-CHAR-CODE-LIMIT - which is
>> inherited from a similarly-named field in the CHARACTER-ENCODING -
>> should be #xd800 for all variants of UTF-16 (e.g.,
>> CHARACTER-ENCODING-LITERAL-CHAR-CODE-LIMIT should be set to #xd800 in
>> the UTF-16 CHARACTER-ENCODINGs.)
>>
>> What I called the "literal-char-code-limit" is supposed to be the
>> exclusive upper bound on char-codes that can be passed through without
>> translation.  (The result of calling the decode/encode function should
>> be the same as the result of directly using the argument in those cases.)
>
> ioblock-literal-char-code-limit is also used by the writer -
> %ioblock-write-u16-encoded-char, and the current value of #x10000
> is correct for it.
>
> Dropping the test (in %ioblock-read-u16-encoded-char) should be ok
> if we can assume that there is no other encoding that may use
> %ioblock-read-u16-encoded-char (I cannot think of any).
>
> regards,
> T.
>
>>
>>
>> (I haven't looked at this yet and might be misremembering something;
>> it's clearly the case that we want to do translation on some code
>> units >= #xd800 when decoding UTF-16, but I'm not 100% sure that the
>> parenthesised assertion above is correct.)
>>
>

I looked at the code a bit, and I believe that I can be a little more
confident in saying the following:

- if those tests against the "literal-code-limit" (perhaps a better
   name would be "verbatim-code-limit") were all correct, the effect
   of doing what's done when the test is true should be exactly the
   same as the effect of what's done if the test failed.

   (In other words, the whole purpose of the test is to save a FUNCALL
   to the encode or decode function.  In at least some cases, saving
   that FUNCALL at that point sped up repeated calls to READ-CHAR and/or
   WRITE-CHAR measurably; how much depends on the data being read/written,
   but it's desirable for READ-CHAR and WRITE-CHAR to have as few moving
   parts as possible.)

- You're absolutely correct in noting that %IOBLOCK-READ-U16-ENCODED-CHAR
   should not be using a literal/verbatim code limit of #x10000 when
   reading UTF-16-encoded text.  Since the test is optional, working
   around the bug by removing the test from that function would not
   affect the correctness of any character-encoding that uses it (and
   it's certainly the case that the only -implemented- encodings that
   use that function in OpenMCL are the various flavors of UTF-16 and
   UCS-2.  I'm not aware of any other widely used encoding schemes that
   use 16-bit code units.)

   Removing the test might have a (probably small, possibly noticable)
   affect on performance, and the real problem is that the value being
   tested against is incorrect for UTF-16 input.

- I think that your observation that that value (#x10000) is OK for
   for encoding UTF-16 output is also correct.  That sort of suggests
   that character-encodings need to define distinct "verbatim" limits for
   the input and output cases and that streams which use these encodings
   should inherit distinct limit values.  (The "distinct" values would
   be the same for every case that I can think of except for UTF-16.)
   The "right" fix might therefore involve changing some fields in
   the structures that define CHARACTER-ENCODINGs and stream internals
   (IOBLOCKs and methods/functions that operate on them.)  That probably
   involves some bootstrapping issues and should probably wait for the
   next set of snapshots.

- The problem with the #x10000 UTF-16 limit is that it's too lenient
   for input (causes us to miss surrogate pairs.  I swear that this
   stuff worked at some point, but it was also a constantly moving
   target.)  If we change UTF-16 encodings to specify the stricter
   limit of #xd800, input would behave correctly.  Until separate
   limits are introduced, UTF-16 output functions would also use
   that shared limit; a limit of #xd800 would be stricter than
   necessary and would mean that an extra funcall was done on each
   WRITE-CHAR of a character whose code was in the #xe000-#xffff.
   Whether this would affect preformance depends on how often characters
   with codes in that range are written; all other things being
   equal, it shouldn't affect correctness.

- I'm tempted to prefer the short-term-fix of changing the limits
   for the UTF-16 encodings to #xd800 over the short-term fix of
   removing the test in %IOBLOCK-READ-U16-ENCODED-CHAR, since I
   think that the test would probably want to be put back once
   the issues of separate encode/decode limits are unscrambled.

   If you've already removed the test, then that should also work
   around the bug and shouldn't affect correctness.

[I still haven't tried/tested this and may still be missing something,
but I'm basing this on a reading of the code and I -think- that the
above is essentially correct.]