[Openmcl-devel] Tracking Down CFFI Problem

Sun Aug 10 03:12:34 PDT 2008

I just glanced at this and may be missing something, but the code
didn't look like it'd ever been built on a 64-bit system.  Just trying
to build the C library on a 64-bit Linux system (where the C compiler/
toolchain assume a 64-bit world) led to

/usr/bin/ld: betabase.o: relocation R_X86_64_32 against `a local symbol' 
can not be used when making a shared object; recompile with -fPIC
betabase.o: could not read symbols: Bad value
collect2: ld returned 1 exit status
make: *** [liblispstat.so] Error 1

and there didn't seem to be anything in the Makefile that addressed this.
(Adding "-fPIC" to CFLAGS probably isn't the last change you'd have to
make; I don't know how sensitive the rest of the code is to word-size
issues, but the fact that the linker was complaining strongly suggests
that the C code has only ever been built or run in a 32-bit environment.)

If that assumption's correct, then it may also be the case that the
lisp code (and/or CFFI glue) have word-size issues.  Someone who is
interested in this should certainly look at the code carefully and
try to determine if it's making assumptions about the size or alignment
of foreign objects.

There are lots of things to go wrong, and it's certainly possible that
foreign memory access primitives in CCL could have bugs in them (though
I don't know of any such bugs.)  It seems likely that word-size issues
would be obscuring the issue enough that any such bugs would be very
difficult to isolate.

On Sat, 9 Aug 2008, Brent Fulgham wrote:

> Hi Gary,
>
> I've been trying to get CommonLispStat working on Clozure CL 
> (http://repo.or.cz/w/CommonLispStat.git).  By and large it works fine with a 
> current copy of CFFI.  However, there is one call that is failing LU-SOLVE:
>
> (defun lu-solve (lu lb)
> "Args: (lu b)
> LU is the result of (LU-DECOMP A) for a square matrix A, B is a sequence.
> Returns the solution to the equation Ax = B. Signals an error if A is
> singular."
> (let ((la (first lu))
> 	(lidx (second lu)))
>   (check-square-matrix la)
>   (check-sequence lidx)
>   (check-sequence lb)
>   (check-fixnum lidx)
>   (let* ((n (num-rows la))
> 	   (result (make-sequence (if (consp lb) 'list 'vector) n))
> 	   (a-mode (la-data-mode la))
> 	   (b-mode (la-data-mode lb)))
>     (if (/= n (length lidx)) (error "index sequence is wrong length"))
>     (if (/= n (length lb)) (error "right hand side is wrong length"))
>     (let* ((mode (max +mode-re+ a-mode b-mode))
> 	     (a (la-data-to-matrix la mode))
> 	     (indx (la-data-to-vector lidx +mode-in+))
> 	     (b (la-data-to-vector lb mode))
> 	     (singular 0))
> 	(unwind-protect
> 	    (progn
> 	      (setf singular (lu-solve-front a n indx b mode))
> 	      (la-vector-to-data b n mode result))
> 	  (la-free-matrix a n)
> 	  (la-free-vector indx)
> 	  (la-free-vector b))
> 	(if (/= 0.0 singular) (error "matrix is (numerically) singular"))
> 	result))))
>
> ? (defvar *ABC* (lu-decomp #2A((2 3 4) (1 2 4) (2 4 5))))
> *ABC*
>
> ? *ABC*
> (#2A((2.0 3.0 4.0) (1.0 1.0 1.0) (0.5 0.5 1.5)) #(0 2 2) -1.0 NIL)
>
> This produces a 3x3 matrix with some additional stuff cons'd onto the end. 
> If I call LU-SOLVE on this, I should get the following (from SBCL):
>
> * (lu-solve *ABC* #(2 3 4))
>
> #(-2.333333333333333 1.3333333333333335 0.6666666666666666)
>
> Unfortunately, in Clozure CL I get an error.  After turning on some tracing, 
> I get the following behavior:
>
> ===============Clozure CL x86_64 
> ====================================================
> ? (lu-solve *ABC* #(2 3 4))
> 0> Calling (LU-SOLVE (#2A((2.0 3.0 4.0) (1.0 1.0 1.0) (0.5 0.5 1.5)) #(0 2 2) 
> -1.0 NIL) #(2 3 4))
> 1> Calling (LISP-STAT-MATRIX:CHECK-MATRIX #2A((2.0 3.0 4.0) (1.0 1.0 1.0) 
> (0.5 0.5 1.5)))
> <1 LISP-STAT-MATRIX:CHECK-MATRIX returned T
> 1> Calling (NUM-ROWS #2A((2.0 3.0 4.0) (1.0 1.0 1.0) (0.5 0.5 1.5)))
> <1 NUM-ROWS returned 3
> 1> Calling (LISP-STAT-LINALG-DATA:LA-DATA-TO-MATRIX #2A((2.0 3.0 4.0) (1.0 
> 1.0 1.0) (0.5 0.5 1.5)) 1)
> 2> Calling (LISP-STAT-MATRIX:CHECK-MATRIX #2A((2.0 3.0 4.0) (1.0 1.0 1.0) 
> (0.5 0.5 1.5)))
> <2 LISP-STAT-MATRIX:CHECK-MATRIX returned T
> 2> Calling (NUM-ROWS #2A((2.0 3.0 4.0) (1.0 1.0 1.0) (0.5 0.5 1.5)))
> <2 NUM-ROWS returned 3
> 2> Calling (NUM-COLS #2A((2.0 3.0 4.0) (1.0 1.0 1.0) (0.5 0.5 1.5)))
> <2 NUM-COLS returned 3
> 2> Calling (LISP-STAT-LINALG-DATA:LA-MATRIX 3 3 1)
> <2 LISP-STAT-LINALG-DATA:LA-MATRIX returned #<A Foreign Pointer #x10BF60>
> 2> Calling (LISP-STAT-LINALG-DATA::LA-GET-POINTER #<A Foreign Pointer 
> #x10BF60> 0)
> <2 LISP-STAT-LINALG-DATA::LA-GET-POINTER returned #<A Foreign Pointer 
> #x10BF70>
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x10BF70> 0 2.0)
> <2 LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned NIL
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x10BF70> 1 3.0)
> <2 LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned NIL
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x10BF70> 2 4.0)
> <2 LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned NIL
> 2> Calling (LISP-STAT-LINALG-DATA::LA-GET-POINTER #<A Foreign Pointer 
> #x10BF60> 1)
> <2 LISP-STAT-LINALG-DATA::LA-GET-POINTER returned #<A Foreign Pointer 
> #x104E70>
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x104E70> 0 1.0)
> <2 LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned NIL
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x104E70> 1 1.0)
> <2 LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned NIL
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x104E70> 2 1.0)
> <2 LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned NIL
> 2> Calling (LISP-STAT-LINALG-DATA::LA-GET-POINTER #<A Foreign Pointer 
> #x10BF60> 2)
> <2 LISP-STAT-LINALG-DATA::LA-GET-POINTER returned #<A Foreign Pointer 
> #x4000000000000000>
> 2> Calling (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #<A Foreign Pointer 
> #x4000000000000000> 0 0.5)
> Unhandled exception 11 at 0x78dd22, context->regs at #xb029b980
> Exception occurred while executing foreign code
> at la_put_double + 39
> ? for help
> ====================================================================================
>
> Based on this, it looks like the problem is that the third and final 
> iteration over the matrix object, where it returns a bogus pointer.  I'm not 
> sure if the values returned from the LA-GET-POINTER calls are reasonable 
> (there seems to be a large space between them).  Row 0 is 16 bytes past the 
> pointer to the matrix structure, which sounds reasonable.  But the 1st row is 
> is 28,928 bytes past the 0th row.
>
> To compare, running this same code under SBCL produces seemingly more 
> expected behavior.  Row 0 is 128 bytes past the pointer to the matrix 
> structure, which I don't quite get.  But the 1st row is only 32 bytes past 
> the 0th row.  For 32-bit values, this would sound okay for three doubles (3 x 
> 16 ->
>
> =============== SBCL x86 32-bit 
> ====================================================
>     2: (LISP-STAT-LINALG-DATA::LA-GET-POINTER #.(SB-SYS:INT-SAP #X00100DC0) 
> 1)
>     2: LISP-STAT-LINALG-DATA::LA-GET-POINTER returned
>          #.(SB-SYS:INT-SAP #X00100D40)
>     2: (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #.(SB-SYS:INT-SAP #X00100D40) 0
>                                             1.0)
>     2: LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned
>     2: (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #.(SB-SYS:INT-SAP #X00100D40) 1
>                                             1.0)
>     2: LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned
>     2: (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #.(SB-SYS:INT-SAP #X00100D40) 2
>                                             1.0)
>     2: LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned
>     2: (LISP-STAT-LINALG-DATA::LA-GET-POINTER #.(SB-SYS:INT-SAP #X00100DC0) 
> 2)
>     2: LISP-STAT-LINALG-DATA::LA-GET-POINTER returned
>          #.(SB-SYS:INT-SAP #X00100D60)
>     2: (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #.(SB-SYS:INT-SAP #X00100D60) 0
>                                             0.5)
>     2: LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned
>     2: (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #.(SB-SYS:INT-SAP #X00100D60) 1
>                                             0.5)
>     2: LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE returned
>     2: (LISP-STAT-LINALG-DATA:LA-PUT-DOUBLE #.(SB-SYS:INT-SAP #X00100D60) 2
>                                             1.5)
> ====================================================================================
>
> I suspect this is some kind of error with the use of the 64-bit runtime; the 
> "get-pointer" call accessed through the CFFI is fairly simple.  Could the 
> "int" I'm passing from CCL be a different size than expected in the 64-bit C 
> side?
>
> typedef char *PTR;
>
> [... stuff ...]
>
> PTR
> la_get_pointer(PTR p, int i)
> {
> return(*(((PTR *) p) + i));
> }
>
> Unfortunately, the kernel failure is happening due to an earlier error, so 
> I'm afraid the backtrace is probably useless.  Still, I'll include it in case 
> it helps.
>
> =============== Clozure CL x86_64 kernel backtrace 
> ======================================
> [44938] OpenMCL kernel debugger: b
> current thread: tcr = 0x104ad0, native thread ID = 0xe703, interrupts enabled
>
>
> (#x000000000065A948) #x00003000411540FC : #<Function CCL-LA-PUT-DOUBLE 
> #x0000300041153FFF> + 253
> (#x000000000065A970) #x00003000412BCB2C : #<Function (TRACED LA-PUT-DOUBLE) 
> #x00003000412BC8CF> + 605
> (#x000000000065A9B8) #x00003000411518F4 : #<Function LA-DATA-TO-MATRIX 
> #x000030004115169F> + 597
> (#x000000000065AA08) #x000030004127101C : #<Function (TRACED 
> LA-DATA-TO-MATRIX) #x0000300041270DBF> + 605
> (#x000000000065AA50) #x0000300041172A0C : #<Function LU-SOLVE 
> #x000030004117277F> + 653
> (#x000000000065AAB0) #x00003000412AB43C : #<Function (TRACED LU-SOLVE) 
> #x00003000412AB1DF> + 605
> (#x000000000065AAF8) #x00003000404915D4 : #<Function CALL-CHECK-REGS 
> #x00003000404914EF> + 229
> (#x000000000065AB30) #x00003000404B121C : #<Function TOPLEVEL-EVAL 
> #x00003000404B0F3F> + 733
> (#x000000000065ABD0) #x00003000404B300C : #<Function READ-LOOP 
> #x00003000404B293F> + 1741
> (#x000000000065ADD8) #x00003000404910AC : #<Function TOPLEVEL-LOOP 
> #x000030004049102F> + 125
> (#x000000000065AE08) #x0000300040447094 : #<Function (:INTERNAL 
> (TOPLEVEL-FUNCTION (LISP-DEVELOPMENT-SYSTEM T))) #x000030004044702F> + 101
> (#x000000000065AE20) #x0000300040548A34 : #<Function (:INTERNAL 
> MAKE-MCL-LISTENER-PROCESS) #x00003000405487AF> + 645
> (#x000000000065AEB8) #x000030004045A21C : #<Function RUN-PROCESS-INITIAL-FORM 
> #x0000300040459F4F> + 717
> (#x000000000065AF48) #x0000300040438294 : #<Function (:INTERNAL 
> %PROCESS-PRESET-INTERNAL) #x000030004043810F> + 389
> (#x000000000065AF98) #x0000300040400E84 : #<Function (:INTERNAL 
> THREAD-MAKE-STARTUP-FUNCTION) #x0000300040400D5F> + 293
> [44938] OpenMCL kernel debugger:
> =================================================================================
>
> The first thing I tried to debug this was to run the CFFI test suite under 
> 64-bit CCL.  I don't see any errors (though there doesn't seem to be a 
> comparable test case).  The tests that do test getting and setting double 
> values all seem to work fine, but there are no 'matrix' or 'array of array' 
> type tests that I can see.
>
> Can you suggest anything as a place to start looking for the problem?
>
> Thanks,
>
> -Brent