[Openmcl-devel] how many angels can dance on a unicode character?

Sun Apr 22 06:07:30 PDT 2007

Gary Byers wrote:

> No; I spent most of yesterday writing replies to these messages.
> I need to learn to write more quickly.

Thanks you. I hope this is not exactly like discussing about angels
for you.

> [...]
>
> (defun copy-string (source &optional (len (length source))
>   (let* ((dest (make-string len)))
>     (dotimes (i len dest)
>       (setf (schar dest i) (char source i)))))
> 
> How many code-units should (MAKE-STRING len) allocate ? If it
> didn't allocate enough, should (SETF SCHAR) allocate more ?

len code-units. And it is exactly the necesary size. What I meant by
using UTF-16 means that CHARACTER uses UTF-16 code-units as its
char-code and CHAR-CODE-LIMIT be set at #xFFFF. There would be no
CHARACTER for supplementary characters.

> [...]
>
> Is #\u+12345 a character ?  Unless we restrict ourselves to the BMP,
> I'd say "yes."

If CHAR-CODE-LIMIT is #xFFFF, the answer is no. But that doesn't
mean we are limited to BMP. The characters outside of BMP will
be represented by surrogate pair -- by two CHARACTERs or a
string.

Back to the exchange-first-and-last-characters example, these
may not what one wants:

(exchange-first-and-last-characters
 (coerce '(#\a #\u #\Combining_Diaeresis) 'string))

(let ((string (coerce '(#\a #\U+1111 #\U+1162 #\U+11B7) 'string)))
  (print string)
  (exchange-first-and-last-characters string))

More likely one wants:

;;; sorry this doesn't run on OpenMCL. (I tested it on MCL 5.1)
;;;
;;; It assumes there is a function boundary-p that
;;; takes 2 characters and returns T if it is safe to
;;; seperate the two.
;;;
;;; And it may have a bug. I assembled it in haste.

(defun exchange-first-and-last-characters (string)
  (let ((len (length string))
        first
        last)
    (loop for i from 0 to len
          do
          (let ((ch1 (schar string i))
                (ch2 (ignore-errors (schar string (1+ i)))))
            (when (ats:boundary-p ch1 ch2)
              (setq first (1+ i))
              (return))))
    (loop for i from (1- len) downto 0
          do
          (let ((ch1 (ignore-errors (schar string (1- i))))
                (ch2 (schar string i)))
            (when (ats:boundary-p ch1 ch2)
              (setq last i)
              (return))))
    (if (<= first last)
      (concatenate 'string
                   (subseq string last)
                   (subseq string first last)
                   (subseq string 0 first))
      string)))

The point I want to make here is that it is often not right
to manipulate unicode string at character level. You want
do it at text unit level instead.

Also it is trivial to extend the boundary-p function to check
surrogate values.

Mark Davis of Unicode consortium wrote in his "Unicode Myth":
<http://macchiato.com/slides/UnicodeMyths.pdf>

| Myth: You will have to rewrite all your code for surrogates.
| 
| - surrogates don't overlap.
| - Most codes not sensitive to surrogates
| - Good code accounts for strings, not just code points

> [...]
>
> About the only real definition of what a CHARACTER is is "an object
> that you can put in a STRING and subsequently access."  A STRING is a
> VECTOR whose alements are guaranteed to be CHARACTERs.  STANDARD-CHARs
> are CHARACTERs, and there are tens of thousands of other things out
> there in the world that we'd like to be able to treat as CHARACTERs.
> 
> 
> If a STRING is a vector specialized to hold any CHARACTER, then
> (SETF (CHAR S I) C) should work for any legal values of S, I, and C;
> a subsequent (CHAR S I) should return C.
> 
> A UTF-16 encoded STRING containing the character #\u+12345 would
> contain the code units:
> 
> #xd808 #xdf45
> 
> There are two ways of looking at this that I can think of:
> 
> 1) The length of that string is 1; calling (AREF/ELT/CHAR/SCHAR s 0)
>     returns #\u+12345.
> 
> 2) The length of that string is 2; calling (AREF/ELT/CHAR/SCHAR s 0)
>     returns #\u+d808 and accessing the second element returns #\u+df45.
> 
> (1) has the property that STRINGs are objects that can contain any
>      CHARACTER supported by the implementation.  (2) does not have
>      this property.
> 
> If you're advocating (2), I don't think that you're allowing 
> #\u+12345 to be a CHARACTER, and you're effectively saying that
> CHAR-CODE-LIMIT is no greater than #x10000. 

Yes. (2) is my position. UTF-16 and CHAR-CODE-LIMIT greater than
#xFFFF don't mix.

> (Yes, of course you
> can put the sequence of "characters" #\u+d808 and #\u+df45 in
> a "string" yourself, BLT that string to somewhere where some
> flavor of #_DrawUTF16String can see it, and if you have the
> right font installed you might see the (cuneiform, as it happens)
> glyph for #\u+12345 on the screen.)
> 
> You can't (under (2)) do things like:
> 
> (defun cuneiform-p (c)
>    (and (>= (char-code c) #x12000)
>         (< (char-code c) #x12474)))

No. You need to use codepoint.

(defun cuneiform-p (code-point)
    (and (>= code-point #x12000)
         (< code-point #x12474)))

> 
> (defun string-contains-cunieiform-p (s)
>    (not (null (position-if #'cuneform-p s))))
> 
> but of course that's a moot point, because under (2) you can't really
> allow anything with a CHAR-CODE that doesn't fit in 16 bits.

string-contains-cuneiform-p will be uglier but is implementable.

(defun %surrogate-p (code)
  (when (<= #xD800 code #xDFFF)
    (if (< code #xDC00) :high :low)))

(defun surrogate-p (char)
  (%surrogate-p (char-code char)))

(defun do-codepoint (f string)
  (let ((length (length string)))
    (do ((i 0 (1+ i)))
        ((= i length))
      (let ((char-1 (char-code (schar string i))))
        (if (eq :high (%surrogate-p char-1))
          (if (= i (- length 1))
            (progn (funcall f char-1)
                   (return))
            (let ((char-2 (char-code (schar string (incf i)))))
              (case (%surrogate-p char-2)
                ((:low)
                 (funcall
                  f
                  (logior
                   (+ (ash (logand #x3FF char-1) 10) #x10000)
                   (logand #x3FF char-2))))
                ((:high)
                 (funcall f char-1)
                 (decf i))
                (t
                 (funcall f char-1)
                 (funcall f char-2)))))
          (funcall f char-1))))))

(defun string-contains-cuneiform-p (s)
   (when (not (null (position-if #'surrogate-p s)))
     (do-codepoint
      #'(lambda (code)
          (when (cuneiform-p code)
            (return-from string-contains-cuneiform-p t)))
      s)
     nil))

;; won't run on OpenMCL because of surrogate values.
(string-contains-cuneiform-p
 (concatenate 'string
              "U+1207E"
              (string (code-char #xD808))
              (string (code-char #xDC7E))))

--> T

> 
> (1) would allow arbitrary Unicode characters to be encoded in UTF-16
> strings (I think that we all agree that UTF-16 can encode arbitrary
> Unicode characters).  Relative to the current implementation, it
> means that WITH-UTF-16-STRING could be a fairly cheap BLT operation
> (rather than the "time/space tradeoff" involved in encode/decode),
> but that the complexity of encode/decode be passed to MAKE-ARRAY and
> MAKE-STRING and AREF and SCHAR and REPLACE and LENGTH and dozens of
> other CL functions. That seems completely backwards to me.
> 
> Paying more in space (32-bit internal representation) to save time
> (unit-cost operations) isn't free either.  You could pay less of
> a space cost (24-bit internal representation) and more of a time
> cost (a few extra loads and/or shifts per SCHAR), and that might have been
> somewhat less drastic than introducing a 4x increase in string memory
> size has been.

I was told the other day that worring about FFI performance is
"premature optimization" and that "memory is cheap".

I am not convinced. (I feel that UTF-32 is the premature
optimization.) But if most people do not care about space issue (and
that seems to be the case -- I am surprised), I guess I should go
along. 

but no, I'm not convinced. So I may whine again.

regards,
T.

--
"A very small object      Its center."