[Openmcl-devel] how many angels can dance on a unicode character?
Gary Byers
gb at clozure.com
Sat Apr 21 12:48:22 PDT 2007
On Sat, 21 Apr 2007, Takehiko Abe wrote:
> Agh. Gary, you are too fast...
No; I spent most of yesterday writing replies to these messages.
I need to learn to write more quickly.
>
> A quick response..
>
>> If wanted to exchange the first and last characters in that
>> string, I might use something (stupid) like:
>>
>> (defun exchange-first-and-last-characters (string)
>> (let* ((len (length string)))
>> (when (> len 1)
>> (let* ((temp (char string (1- len))))
>> (setf (char string (1- len)) (char string 0)
>> (char string 0) temp)))
>> string))
>
> You win. UTF-16 version would be hairy.
> But this isn't fair because you don't do this in practice.
(defun copy-string (source &optional (len (length source))
(let* ((dest (make-string len)))
(dotimes (i len dest)
(setf (schar dest i) (char source i)))))
How many code-units should (MAKE-STRING len) allocate ? If it
didn't allocate enough, should (SETF SCHAR) allocate more ?
(BTW, something like this happens all the time in the reader/INTERN:
we READ-CHAR constituent characters into a temporary string, call
a low-level variant of FIND-SYMBOL on the characters in the temporary
string, and and call INTERN on a copy if FIND-SYMBOL failed.)
You might be right in observing that destructive operations on strings
- other than in cases involving initialization - are uncommon. As a
rule of thumb, assignment (of anything) tends to happen about 1/3 as
often as reference in most programs, and since many strings are in
fact immutable (pnames, program literals) destructive modification of
strings is probably a less common operation than destructive
operations on other sequence types is. It's a little hard to measure
how common it is, since (SETF SCHAR) is a trivially open-coded operation.
>
> But if I really need it I'm sure I'll write it. And once I
> have exchange-first-and-last-characters I'll never have
> to look back.
>
Please send me the code for (SETF CHAR) that works on varaiable-width
encodings like UTF-16 and UTF-8 when you get that written. I keep
thinking that it's hard to handle the general case without
consing/copying, but I might be missing something.
>> Suppose we were to instead say that - formally or not - these 16-bit
>> strings were really UTF-16-encoded; we could allow the use of
>> surrogate pairs inside 16-bit strings. If we did this "informally",
>> functions like SCHAR would either return true CHARACTER objects or the
>> high or low half of a surrogate pair. Since we aren't inventing a new
>> language, the values returned by CHAR and SCHAR would have to be
>> CHARACTERs,
>
> Yes, but the CL standard does not say what CHARACTERS are
> other than the standard characters.
Is #\u+12345 a character ? Unless we restrict ourselves to the BMP,
I'd say "yes."
About the only real definition of what a CHARACTER is is "an object
that you can put in a STRING and subsequently access." A STRING is a
VECTOR whose alements are guaranteed to be CHARACTERs. STANDARD-CHARs
are CHARACTERs, and there are tens of thousands of other things out
there in the world that we'd like to be able to treat as CHARACTERs.
If a STRING is a vector specialized to hold any CHARACTER, then
(SETF (CHAR S I) C) should work for any legal values of S, I, and C;
a subsequent (CHAR S I) should return C.
A UTF-16 encoded STRING containing the character #\u+12345 would
contain the code units:
#xd808 #xdf45
There are two ways of looking at this that I can think of:
1) The length of that string is 1; calling (AREF/ELT/CHAR/SCHAR s 0)
returns #\u+12345.
2) The length of that string is 2; calling (AREF/ELT/CHAR/SCHAR s 0)
returns #\u+d808 and accessing the second element returns #\u+df45.
(1) has the property that STRINGs are objects that can contain any
CHARACTER supported by the implementation. (2) does not have
this property.
If you're advocating (2), I don't think that you're allowing
#\u+12345 to be a CHARACTER, and you're effectively saying that
CHAR-CODE-LIMIT is no greater than #x10000. (Yes, of course you
can put the sequence of "characters" #\u+d808 and #\u+df45 in
a "string" yourself, BLT that string to somewhere where some
flavor of #_DrawUTF16String can see it, and if you have the
right font installed you might see the (cuneiform, as it happens)
glyph for #\u+12345 on the screen.)
You can't (under (2)) do things like:
(defun cuneiform-p (c)
(and (>= (char-code c) #x12000)
(< (char-code c) #x12474)))
(defun string-contains-cunieiform-p (s)
(not (null (position-if #'cuneform-p s))))
but of course that's a moot point, because under (2) you can't really
allow anything with a CHAR-CODE that doesn't fit in 16 bits.
(1) would allow arbitrary Unicode characters to be encoded in UTF-16
strings (I think that we all agree that UTF-16 can encode arbitrary
Unicode characters). Relative to the current implementation, it
means that WITH-UTF-16-STRING could be a fairly cheap BLT operation
(rather than the "time/space tradeoff" involved in encode/decode),
but that the complexity of encode/decode be passed to MAKE-ARRAY and
MAKE-STRING and AREF and SCHAR and REPLACE and LENGTH and dozens of
other CL functions. That seems completely backwards to me.
Paying more in space (32-bit internal representation) to save time
(unit-cost operations) isn't free either. You could pay less of
a space cost (24-bit internal representation) and more of a time
cost (a few extra loads and/or shifts per SCHAR), and that might have been
somewhat less drastic than introducing a 4x increase in string memory
size has been.
More information about the Openmcl-devel
mailing list