[Openmcl-devel] %ioblock-read-u16-encoded-char
Gary Byers
gb at clozure.com
Wed Apr 25 01:33:00 PDT 2007
On Wed, 25 Apr 2007, Takehiko Abe wrote:
> Gary Byers wrote:
>
>> My first reaction is that IOBLOCK-LITERAL-CHAR-CODE-LIMIT - which is
>> inherited from a similarly-named field in the CHARACTER-ENCODING -
>> should be #xd800 for all variants of UTF-16 (e.g.,
>> CHARACTER-ENCODING-LITERAL-CHAR-CODE-LIMIT should be set to #xd800 in
>> the UTF-16 CHARACTER-ENCODINGs.)
>>
>> What I called the "literal-char-code-limit" is supposed to be the
>> exclusive upper bound on char-codes that can be passed through without
>> translation. (The result of calling the decode/encode function should
>> be the same as the result of directly using the argument in those cases.)
>
> ioblock-literal-char-code-limit is also used by the writer -
> %ioblock-write-u16-encoded-char, and the current value of #x10000
> is correct for it.
>
> Dropping the test (in %ioblock-read-u16-encoded-char) should be ok
> if we can assume that there is no other encoding that may use
> %ioblock-read-u16-encoded-char (I cannot think of any).
>
> regards,
> T.
>
>>
>>
>> (I haven't looked at this yet and might be misremembering something;
>> it's clearly the case that we want to do translation on some code
>> units >= #xd800 when decoding UTF-16, but I'm not 100% sure that the
>> parenthesised assertion above is correct.)
>>
>
I looked at the code a bit, and I believe that I can be a little more
confident in saying the following:
- if those tests against the "literal-code-limit" (perhaps a better
name would be "verbatim-code-limit") were all correct, the effect
of doing what's done when the test is true should be exactly the
same as the effect of what's done if the test failed.
(In other words, the whole purpose of the test is to save a FUNCALL
to the encode or decode function. In at least some cases, saving
that FUNCALL at that point sped up repeated calls to READ-CHAR and/or
WRITE-CHAR measurably; how much depends on the data being read/written,
but it's desirable for READ-CHAR and WRITE-CHAR to have as few moving
parts as possible.)
- You're absolutely correct in noting that %IOBLOCK-READ-U16-ENCODED-CHAR
should not be using a literal/verbatim code limit of #x10000 when
reading UTF-16-encoded text. Since the test is optional, working
around the bug by removing the test from that function would not
affect the correctness of any character-encoding that uses it (and
it's certainly the case that the only -implemented- encodings that
use that function in OpenMCL are the various flavors of UTF-16 and
UCS-2. I'm not aware of any other widely used encoding schemes that
use 16-bit code units.)
Removing the test might have a (probably small, possibly noticable)
affect on performance, and the real problem is that the value being
tested against is incorrect for UTF-16 input.
- I think that your observation that that value (#x10000) is OK for
for encoding UTF-16 output is also correct. That sort of suggests
that character-encodings need to define distinct "verbatim" limits for
the input and output cases and that streams which use these encodings
should inherit distinct limit values. (The "distinct" values would
be the same for every case that I can think of except for UTF-16.)
The "right" fix might therefore involve changing some fields in
the structures that define CHARACTER-ENCODINGs and stream internals
(IOBLOCKs and methods/functions that operate on them.) That probably
involves some bootstrapping issues and should probably wait for the
next set of snapshots.
- The problem with the #x10000 UTF-16 limit is that it's too lenient
for input (causes us to miss surrogate pairs. I swear that this
stuff worked at some point, but it was also a constantly moving
target.) If we change UTF-16 encodings to specify the stricter
limit of #xd800, input would behave correctly. Until separate
limits are introduced, UTF-16 output functions would also use
that shared limit; a limit of #xd800 would be stricter than
necessary and would mean that an extra funcall was done on each
WRITE-CHAR of a character whose code was in the #xe000-#xffff.
Whether this would affect preformance depends on how often characters
with codes in that range are written; all other things being
equal, it shouldn't affect correctness.
- I'm tempted to prefer the short-term-fix of changing the limits
for the UTF-16 encodings to #xd800 over the short-term fix of
removing the test in %IOBLOCK-READ-U16-ENCODED-CHAR, since I
think that the test would probably want to be put back once
the issues of separate encode/decode limits are unscrambled.
If you've already removed the test, then that should also work
around the bug and shouldn't affect correctness.
[I still haven't tried/tested this and may still be missing something,
but I'm basing this on a reading of the code and I -think- that the
above is essentially correct.]
More information about the Openmcl-devel
mailing list