[Openmcl-devel] Extracting unicode from an external source via FFI

Sat Feb 21 18:09:05 PST 2009

On Sun, 22 Feb 2009, John McAleely wrote:

> Hi,
>
> I'm attempting to get some unicode strings from an external source (a
> MySQL database) into a form I can use within CCL (This would be most
> convenient if the c data 'became' a native lisp string). I am having
> problems with reading them in, and want to ask what the options are
> within the CCL FFI. If anyone's been down this route before, I'd be
> grateful for pointers.
>
> I'm using:
>
> Welcome to Clozure Common Lisp Version 1.2-r72:73M-ccl  (DarwinX8664)!

Some of this stuff has changed and/or had bugs fixed since then.

>
> (Note that the revision number reflects storage in my own subversion
> repository. I'm using an unmodified, locally built, version synced
> about a month ago.)
>
> My investigations to date (I'm also using clsql 4.0.3/uffi 1.6.0)
> suggest that data can make it from a lisp string into the SQL database
> (how I've not looked into yet - but the mysql command line sees the
> data correctly). When strings come back across the connection, they
> arrive garbled. A two character Chinese string in the SQL table
> becomes a six character lisp string.
>
> Rummaging into CLSQL/UFFI, I think that ultimately this bit of code
> reads strings from the mysql c interface:
>
>   #+openmcl ,@(if length
> 	   `((ccl:%str-from-ptr ,stored-obj ,length))
> 	   `((ccl:%get-cstring ,stored-obj)))
>
> Having looked at the ccl code, there is a function near %get-cstring
> called:
>
> (defun %get-utf-8-cstring (pointer) ....)
>
> This seems interesting. I speculate:
>
> + The mysql_c interface is sending over c-style strings, in a
> character set of its choice.
> + The uffi code chooses to read this with %get-cstring, which chops
> the string into 8 bit bytes, and assumes each is one is a character in
> some 256 element character set.
> + The CLSQL code then takes this and passes it back to me as a lisp
> string.
> + I wonder if I could convince mysql to use utf-8 within its c
> strings, and the uffi code to use %get-utf-8-cstring, then I could
> successfully read unicode from the database into lisp strings?
>
> So, if you've been down a similar path, does my speculation sound
> correct?
>

Yes.

> Does my gentle use of grep and google appear to have tumbled on the
> 'right' CCL functions for this work. Is there a 'better' CCL API for
> reading a foreign string in some unicode character set?

The most general thing (in 1.3-rc1) is exported but not documented:

(ccl:get-encoded-string encoding pointer noctets)

where

ENCODING is either a CHARACTER-ENCODING object or a keyword that names
  such an object
POINTER is a foriegn pointer, presumed to point to a string encoded in
  that character encoding
NOCTETS is the number of octets (8-bit-bytes) in the encoded string,
  not including any #\nul octets that may be used as end-of-string markers.

This function returns a lisp (SIMPLE-)string.

There's also

(ccl::get-encoded-cstring encoding pointer)

which determines the number of octets (scans forward from POINTER looking
for a 0-valued 8/16/32-bit element depending on the encoding) and calls
GET-ENCODED-STRING for you.

(CCL:%GET-CSTRING ptr) is functionally equivalent to
(CCL::GET-ENCODED-CSTRING :ISO-8859-1 ptr), and

(CCL::%GET-UTF-8-CSTRING ptr) is functionally equivalent to
(CCL::GET-ENCODED-CSTRING :UTF-8 ptr)

CCL:%GET-CSTRING and CCL::%GET-UTF-8-CSTRING exist for some combination
of reasons involving -
   - legacy issues
   - bootstrapping issues
   - brevity
   - performance, though I don't know how significant this is.

>
> Looking at the API docs, telling MySQL to use UTF8 seems
> straightforward, and I'm willing to hack UFFI/CLSQL to make this work.
> Before I start hacking, I thought I'd ask what my options are for
> interfacing into CCL's unicode support.
>

CCL::%GET-UTF-8-CSTRING needs to exist for bootstrapping reasons
and it's more concise than other things; it really should be exported
and it's almost certainly the right thing to use in situations like
the one that you describe.

> Thanks,
>
> J
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>