[Openmcl-devel] how many angels can dance on a unicode character?

Gary Byers gb at clozure.com
Sat Apr 21 12:48:22 PDT 2007



On Sat, 21 Apr 2007, Takehiko Abe wrote:

> Agh. Gary, you are too fast...

No; I spent most of yesterday writing replies to these messages.
I need to learn to write more quickly.


>
> A quick response..
>
>> If wanted to exchange the first and last characters in that
>> string, I might use something (stupid) like:
>>
>> (defun exchange-first-and-last-characters (string)
>>    (let* ((len (length string)))
>>      (when (> len 1)
>>        (let* ((temp (char string (1- len))))
>>          (setf (char string (1- len)) (char string 0)
>>                (char string 0) temp)))
>>       string))
>
> You win. UTF-16 version would be hairy.
> But this isn't fair because you don't do this in practice.

(defun copy-string (source &optional (len (length source))
  (let* ((dest (make-string len)))
    (dotimes (i len dest)
      (setf (schar dest i) (char source i)))))

How many code-units should (MAKE-STRING len) allocate ?  If it
didn't allocate enough, should (SETF SCHAR) allocate more ?

(BTW, something like this happens all the time in the reader/INTERN:
we READ-CHAR constituent characters into a temporary string, call
a low-level variant of FIND-SYMBOL on the characters in the temporary
string, and and call INTERN on a copy if FIND-SYMBOL failed.)

You might be right in observing that destructive operations on strings
- other than in cases involving initialization - are uncommon.  As a
rule of thumb, assignment (of anything) tends to happen about 1/3 as
often as reference in most programs, and since many strings are in
fact immutable (pnames, program literals) destructive modification of
strings is probably a less common operation than destructive
operations on other sequence types is.  It's a little hard to measure
how common it is, since (SETF SCHAR) is a trivially open-coded operation.


>
> But if I really need it I'm sure I'll write it. And once I
> have exchange-first-and-last-characters I'll never have
> to look back.
>

Please send me the code for (SETF CHAR) that works on varaiable-width
encodings like UTF-16 and UTF-8 when you get that written.  I keep
thinking that it's hard to handle the general case without
consing/copying, but I might be missing something.


>> Suppose we were to instead say that - formally or not - these 16-bit
>> strings were really UTF-16-encoded; we could allow the use of
>> surrogate pairs inside 16-bit strings.  If we did this "informally",
>> functions like SCHAR would either return true CHARACTER objects or the
>> high or low half of a surrogate pair.  Since we aren't inventing a new
>> language, the values returned by CHAR and SCHAR would have to be
>> CHARACTERs,
>
> Yes, but the CL standard does not say what CHARACTERS are
> other than the standard characters.

Is #\u+12345 a character ?  Unless we restrict ourselves to the BMP,
I'd say "yes."

About the only real definition of what a CHARACTER is is "an object
that you can put in a STRING and subsequently access."  A STRING is a
VECTOR whose alements are guaranteed to be CHARACTERs.  STANDARD-CHARs
are CHARACTERs, and there are tens of thousands of other things out
there in the world that we'd like to be able to treat as CHARACTERs.


If a STRING is a vector specialized to hold any CHARACTER, then
(SETF (CHAR S I) C) should work for any legal values of S, I, and C;
a subsequent (CHAR S I) should return C.

A UTF-16 encoded STRING containing the character #\u+12345 would
contain the code units:

#xd808 #xdf45

There are two ways of looking at this that I can think of:

1) The length of that string is 1; calling (AREF/ELT/CHAR/SCHAR s 0)
    returns #\u+12345.

2) The length of that string is 2; calling (AREF/ELT/CHAR/SCHAR s 0)
    returns #\u+d808 and accessing the second element returns #\u+df45.

(1) has the property that STRINGs are objects that can contain any
     CHARACTER supported by the implementation.  (2) does not have
     this property.

If you're advocating (2), I don't think that you're allowing 
#\u+12345 to be a CHARACTER, and you're effectively saying that
CHAR-CODE-LIMIT is no greater than #x10000.  (Yes, of course you
can put the sequence of "characters" #\u+d808 and #\u+df45 in
a "string" yourself, BLT that string to somewhere where some
flavor of #_DrawUTF16String can see it, and if you have the
right font installed you might see the (cuneiform, as it happens)
glyph for #\u+12345 on the screen.)

You can't (under (2)) do things like:

(defun cuneiform-p (c)
   (and (>= (char-code c) #x12000)
        (< (char-code c) #x12474)))

(defun string-contains-cunieiform-p (s)
   (not (null (position-if #'cuneform-p s))))

but of course that's a moot point, because under (2) you can't really
allow anything with a CHAR-CODE that doesn't fit in 16 bits.

(1) would allow arbitrary Unicode characters to be encoded in UTF-16
strings (I think that we all agree that UTF-16 can encode arbitrary
Unicode characters).  Relative to the current implementation, it
means that WITH-UTF-16-STRING could be a fairly cheap BLT operation
(rather than the "time/space tradeoff" involved in encode/decode),
but that the complexity of encode/decode be passed to MAKE-ARRAY and
MAKE-STRING and AREF and SCHAR and REPLACE and LENGTH and dozens of
other CL functions. That seems completely backwards to me.

Paying more in space (32-bit internal representation) to save time
(unit-cost operations) isn't free either.  You could pay less of
a space cost (24-bit internal representation) and more of a time
cost (a few extra loads and/or shifts per SCHAR), and that might have been
somewhat less drastic than introducing a 4x increase in string memory
size has been.



More information about the Openmcl-devel mailing list