[Openmcl-devel] ICU sketch

Fri Apr 20 21:00:27 PDT 2007

On Fri, 20 Apr 2007, Takehiko Abe wrote:

> ;;;
> ;;; I played with ICU. <http://icu-project.org/>
> ;;;
> ;;; The codes below define:
> ;;;
> ;;;   unorm-normalize
> ;;;   unorm-compare
> ;;;
> ;;; for normalization and
> ;;;
> ;;;    ucol-strcoll
> ;;;
> ;;; for collation.
> ;;;
> ;;;
> ;;; Written and (only very casually) tested on Mac OSX 3.9 + OpenMCL
> ;;; Version 1.1-pre-070214 (DarwinPPC32)
>
>
> ;;; About libicucore.dylib:
> ;;;
> ;;; ICU is preinstalled on Mac OSX Panther as
> ;;; /usr/lib/libicucore.dylib (I don't know about Tiger or Leopard)

It seems to be there in in later systems as well; I haven't looked
extensively, but haven't been able to find any header files that
describe what's there.

(defun icu-version ()
   (rlet ((version (:array :uint8_t 4)))
     (external-call "_u_getVersion" :address version :void)
     (list (paref version (:array :uint8_t) 0)
           (paref version (:array :uint8_t) 1)
           (paref version (:array :uint8_t) 2)
           (paref version (:array :uint8_t) 3))))

The one machine I had that ran Panther got zapped by static electricity
a few months ago.  Could you please let me know what (ICU-VERSION) returns
on Panther ?

On OSX, CoreFoundation also offers a lot of utilities for dealing
with Unicode.

> ;;;
> ;;; It is probably compiled as 32-bit library. So it won't work for
> ;;; OpenMCL 64bit version.
> ;;;
> ;;; The source of this apple version is available at
> ;;; <http://www.opensource.apple.com/darwinsource/>
>
> (open-shared-library "/usr/lib/libicucore.dylib")
>
> ;;; The standard ICU is available at icu-project.org.  The straight
> ;;; build produces several shared libs and they should be placed in a
> ;;; standard location (e.g. /usr/local/lib). There are some
> ;;; dependencies among them.
> ;;;
>
> ;; ccl::%copy-ivector-to-ptr would be much simpler and more
> ;; efficient.
> (defun %init-utf-16-ptr (ptr string)
>  (loop for char across string
> 	with ptr = (%inc-ptr ptr 0)
> 	do
> 	(let ((code (char-code char)))
> 	  (if (< #xFFFF code)
> 	    (progn
> 	      (setf (%get-word ptr)
> 		    (logior #xD800
> 			    (ash (- code #x10000) -10))
> 		    (%get-word ptr 2)
> 		    (logior #xDC00
> 			    (logand #x3FF code)))
> 	      (%incf-ptr ptr 2))
> 	    (setf (%get-word ptr) code)))
> 	(%incf-ptr ptr 2)))

It's true: a lot of stream code (for instance) that used to use
ccl::%copy-ivector-to-ptr to copy octets between 8-bit strings and I/O
buffers associated with 8-bit streams now has to do something else.
Doing "something else" in the iso-8859-1 case doesn't seem to have had
a measurable effect on stream I/O performance (I was concerned about
that.)  Dealing with variable-length encodings (UTF-8, UTF-16)
introduces some overhead; I think that that's often referred to as
a time/space tradeoff.

>
> (defmacro with-utf-16-ptrs ((&rest specs) &body body)
>  (let ((ptr-vars ())
> 	(strings ())
> 	(size-vars ())
> 	(inits ())
> 	(i 0))
>    (dolist (spec specs)
>      (destructuring-bind (ptr-var size-var string)
> 			  spec
> 	(let ((string-var (make-symbol (format nil "STRING-~D" i))))
> 	  (push `(,ptr-var ,size-var) ptr-vars)
> 	  (push `(,string-var ,string) strings)
> 	  (push `(,size-var (1- (ccl::cstring-encoded-length-in-bytes
> 				   ,(get-character-encoding :utf-16)
> 				   ,string-var nil nil)))
> 		size-vars)
> 	  (push `(%init-utf-16-ptr ,ptr-var ,string-var) inits))
> 	(incf i)))
>    `(let ,strings
>      (let ,size-vars
> 	(%stack-block ,ptr-vars
> 	  , at inits
> 	  , at body)))))

CCL::CSTRING-ENCODED-LENGTH-IN-BYTES has to traverse the string,
determining how many (8-bit) bytes are needed to encode its characters.
Something (your function or a dressed-up version of something like
CCL::ENCODE-STRING-TO-MEMORY) then has to traverse the string again,
actually doing the encoding.

Neither of these things is ready for prime time; among other things,
neither of them deals reasonably with non-SIMPLE-STRINGs, and there
isn't a higher layer that does.  As its name suggests,
CCL::CSTRING-ENCODED-LENGTH-IN-BYTES tries to account for the length
of a trailing encoded #\Nul character; that length should be 2 for
UTF-16.

At least in the case of decoding, there are cases where data encoded
in UTF-16 has a leading byte-order-mark.

Handling all of these cases sanely is tricky (in the sense that I
think that coming up with sane/useful/general interfaces is tricky.)
The implementation of the lowest level part of encode-to-memory/
decode-to-memory isn't rocket science: it's the sort of thing that
needs to be done well once, so that you don't have to write your
own %STRING-FROM-UTF-16-PTR and WITH-UTF-16-PTRs every time you
want to access fuctionality in a shared library.

>
> ;; %copy-ptr-to-ivector would be much simpler and more
> ;; effcient.
> (defun %string-from-utf-16-ptr (ptr byte-size)
>  (let ((len 0))
>    ;; check length
>    (do* ((i 0 (+ i 1))
> 	  (c 0 (+ c 2))
> 	  (code (%get-unsigned-word ptr c)
> 		(%get-unsigned-word ptr c))
> 	  (word-size (ash byte-size -1)))
> 	 ((= i word-size))
>      (cond
>       ;; malformed inputs:
>       ;; What to do when we get a malformed sequence?
>       ;; Should provide restarts:
>       ;; -- put the standard replacement char #\U+FFFD
>       ;; -- let user to choose replacement char.
>       ;; -- ??
>       ((<= #xD800 code #xDBFF)
>        (incf c 2)
>        (decf len)
>        (unless (<= #xDC00 (%get-unsigned-word ptr c)  #xDFFF)
>          (error "malformed utf-16")))
>       ((<= #xDC00 code #xDFFF)
>        (error "malformed utf-16")))
>      (incf len))
>    ;; fill string
>    (let ((string (make-string len)))
>      (do ((i 0 (1+ i))
> 	   (c 0 (+ c 2)))
> 	  ((= i len) string)
> 	(setf (schar string i)
> 	      (let ((code (%get-unsigned-word ptr c)))
> 		(if (<= #xD800 code #xDBFF)
> 		  (code-char (logior
> 			      (+ (ash (logand #x3FF code) 10) #x10000)
> 			      (logand #x3FF
> 				      (%get-unsigned-word
>                                       ptr
>                                       (incf c 2)))))
> 		  (code-char code))))))))
>
> ;;;
> ;;; normalization -- unorm.h
> ;;; <http://www.icu-project.org/apiref/icu4c/unorm_8h.html>
> ;;;
>
> (defun unorm-normalize (source form)
>  (with-utf-16-ptrs ((in-str in-length source))
>    (let ((dst-length 0)
> 	  (nf (ecase form
> 		((:nfd) 2)
> 		((:nfkd) 3)
> 		((:nfc) 4)
> 		((:nfkc) 5))))
>      (rlet ((status :signed-long 0))
> 	(setq in-length (ash in-length -1))
> 	(setq dst-length
> 	      (external-call "_unorm_normalize"
> 			     :address in-str
> 			     :long in-length
> 			     :long nf
> 			     :long 0 ; #x20
> 			     :address (%null-ptr)
> 			     :long dst-length
> 			     :address status
> 			     :long))
> 	(setf (%get-long status) 0)
> 	(%stack-block ((out-str (* dst-length 2)))
> 	  (external-call "_unorm_normalize"
> 			 :address in-str
> 			 :long in-length
> 			 :long nf
> 			 :long 0 ; #x20
> 			 :address out-str
> 			 :long dst-length
> 			 :address status
> 			 :long)
> 	  (let ((error (%get-signed-long status)))
> 	    (if (or (zerop error)
> 		    (= error -124))
> 	      (%string-from-utf-16-ptr out-str (+ dst-length dst-length))
> 	      (error "icu error : ~D" error))))))))
>
> (defun unorm-compare (string1 string2 &optional ignore-case)
>  (with-utf-16-ptrs ((str1 len1 string1)
> 		     (str2 len2 string2))
>    (rlet ((error :signed-long 0))
>      (external-call "_unorm_compare"
> 		     :Address str1
> 		     :long (ash len1 -1)
> 		     :address str2
> 		     :long (ash len2 -1)
> 		     :long (if ignore-case #x10000 0)
> 		     :address error
> 		     :long))))
>
> ;;;
> ;;; collation - ucol.h
> ;;; <http://www.icu-project.org/apiref/icu4c/ucol_8h.html>
> ;;;
>
> ;;
> ;; locale - uloc.h
> ;; <http://www.icu-project.org/apiref/icu4c/uloc_8h.html>
>
> (defun uloc-set-default (locale-string)
>  (with-cstrs ((locale locale-string))
>    (rlet ((status :signed-long 0))
>      (external-call "_uloc_setDefault" :address locale
> 		     :address status
> 		     nil)
>      (zerop (%get-long status)))))
>
> (defun uloc-get-default ()
>  (%get-cstring (external-call "_uloc_getDefault" :address)))
>
> ;;
> ;; ucol.h
> ;;
>
> (defun ucol-open (locale)
>  (with-cstrs ((loc locale))
>    (rlet ((error :signed-long 0))
>      (let ((collator (external-call "_ucol_open"
> 				     :address loc
> 				     :address error
> 				     :address)))
> 	(let ((code (%get-long error)))
> 	  (if (or (zerop code)
> 		  (= -128 code)
> 		  (= -127 code))
> 	    (values collator
> 		    code)
> 	    (progn
> 	      (ucol-close collator)
> 	      (error "icu-error: ~D" code))))))))
>
> (defun ucol-close (collator)
>  (external-call "_ucol_close"
> 		 :address collator
> 		 nil))
>
> (defmacro with-ucollator ((var &optional (locale (uloc-get-default)))
>                          &body body)
>  `(let ((,var (ucol-open ,locale)))
>    (unwind-protect
>      (progn , at body)
>      (ucol-close ,var))))
>
> (defun ucol-strcoll (string1 string2 collator)
>  (with-utf-16-ptrs ((str1 size1 string1)
> 		     (str2 size2 string2))
>    (external-call "_ucol_strcoll"
> 		   :address collator
> 		   :address str1
> 		   :long (ash size1 -1)
> 		   :address str2
> 		   :long (ash size2 -1)
> 		   :long)))
>
> ;;;
> ;;; uchar.h
> ;;; <http://www.icu-project.org/apiref/icu4c/uchar_8h.html>
> ;;;
>
> (defun u-char-name (codepoint)
>  (%stack-block ((buffer 128))
>    (rlet ((error :signed-long 0))
>      (external-call "_u_charName"
> 		     :long codepoint
> 		     :long 2
> 		     :address buffer
> 		     :long 256
> 		     :address error)
>      (if (zerop (%get-long error))
> 	(%get-cstring buffer)))))
>
>
>
> #|
> ;;; tests
>
> ;; unorm-normalize
>
> (unorm-normalize (concatenate 'string "abc" (string #\u+FA)) :nfd)
>
> (unorm-normalize (string #\U+D338) :nfd)
>
> (char (unorm-normalize (unorm-normalize (string #\U+D338) :nfd)
> 		       :nfc)
>      0)
>
> (length (unorm-normalize (string #\Latin_Capital_Letter_A_With_Grave)
> 			 :nfd))
>
>
> ;; unorm-compare
>
> (unorm-compare "ABC" "abc" t)
>
> (let ((string (string (code-char #xac7a))))
>  (unorm-compare (unorm-normalize string :nfd)
> 	         string))
>
> (unorm-compare
> (unorm-normalize (concatenate 'string
>                               (string #\U+2000b)
>                               "ABc"
>                               (string #\U+ac7a))
>                  :nfd)
> (concatenate 'string
>              (string #\U+2000b)
>              "abc"
>              (string #\U+ac7a))
> t)
>
> ;; collation with ucol-strcoll
>
> ;; not very interesting
> (let ((names (list "DIABATE" "balke" "Bartok" "N'Doye")))
>  (with-ucollator (coll "en_US")
>    (sort names #'(lambda (str1 str2)
> 		    (minusp (ucol-strcoll str1 str2 coll))))))
>
>
> ;;; character name
>
> (u-char-name #xFBF9)
> (u-char-name #xD800)
>
> |#
>
> --
> "Always first steps."
>

Great, thanks.  It'd be nice if Apple supplied the ICU headers.