[Openmcl-devel] Advice needed to debug shared library problem on Mac and Linux

Gary Byers gb at clozure.com
Tue Mar 29 02:15:57 PDT 2011


I'm confused. On the Mac, we don't try too hard to decode memory fault
info and so wouldn't say something like:

invalid address alignment

I think that we used to try to do that until a few versions ago.
Let's ignore that mystery for now; see below for obsessive fixation
on other mysteries.

On Tue, 29 Mar 2011, Paul Meurer wrote:

> Hi,
> I am trying to interface a third-party shared library (libcfsm, a finite
> state transducer implementation by Xerox Parc) to Clozure. The library works
> without problems in Allegro and SBCL, both on Intel Mac and Linux, both 32
> and 64 bit.
> 
> On Clozure, however, there are problems. I can load the library, initialize
> the necessary global structures and run certain functions, but others make
> Clozure crash like this (on the Mac):
>
>       Unhandled exception 10 at 0x491a20e, context->regs at #xb09cfb60
> Exception occurred while executing foreign code
> ?at cfsm_push + 65
> received signal 10; faulting address: 0x0
> invalid address alignment
> ? for help
> [899] Clozure CL kernel debugger: r
> %rax = 0x0000000000000000 ? ? ?%r8 ?= 0x00000000ffffffff
> %rcx = 0x00007fff870f1eda ? ? ?%r9 ?= 0x0000000000000000
> %rdx = 0x0000000002012000 ? ? ?%r10 = 0x0000000000001002
> %rbx = 0x0000000001900790 ? ? ?%r11 = 0x0000000000000206
> %rsp = 0x00000000b09d0030 ? ? ?%r12 = 0x000030004006372e
> %rbp = 0x00000000b09d0060 ? ? ?%r13 = 0x0000302000b62bcf
> %rsi = 0x0000000006800000 ? ? ?%r14 = 0x0000000000000020
> %rdi = 0x0000000002012000 ? ? ?%r15 = 0x0000000004bcbc9d
> %rip = 0x000000000491a20e ? %rflags = 0x00010206
> 
> 
> where %rax shouldn't be zero. The exception occurs always at the same
> function.
> 
> On Linux, trying to load the library already results in a Kernel memory
> allocation failure, it tries to allocate a pointer of size 140332065189888
> bytes. I doubt that much memory is really needed.

Are you saying that doing:

? (open-shared-library "/path/to/libcfsm.so")

causes CCL to think that a very large amount of memory needs to be allocated ?
(If so, what exactly does the error message say ?)

If so, if you do:

? (with-cstrs ((name "/path/to/libcfsm.so"))
     (#_dlopen name (logior #$RTLD_GLOBAL #$RTLD_NOW)))

does that return a non-null pointer and does the lisp seem generally stable
afterward ?  (It'd be generally stable if you could manually invoke the GC
and compile a few files/functions without getting weird memory errors.)
Or does it lead to a confused memory allocation failure ?

OPEN-SHARED-LIBRARY calls #_dlopen and if that returns a non-null
pointer pokes around in some shared-library data structures to find
the canonical name ("soname") of the library and its base address,
then uses that info to create a little structure-like object of type
CCL::SHLIB.  That poking around is a little tricky: there are some
fields in the ELF file's linker section that can be treated as either
absolute addresses or relative offsets, and Linux is a little cavalier
about following the ELF spec with regard to at least one of them (and
isn't always consistent about this between architectures and
releases.) I've seen CCL guess wrong about this, but the symptom of
guessing wrong is almost always a memory fault while trying to find the
address of a NUL-terminated C string; it's a little hard to believe that
misinterpreting things here would cause it to find a NUL-terminated C string
that was apparently several terabytes long ...

#_dlopen tries to map a shared library (and any dependent libraries) into
the caller's address space.  Shared libraries can have initialization functions
that're called (at least) when the library is first loaded and these calls
occur before #_dlopen returns.  I guess that it's at least theoretically possible
that a library init function somehow calls into the CCL kernel and this leads
to the error that you see; that's a little handwavy, but it's actually about
the best guess that I can come up with.

One way in which a library's init routine could "somehow call into the CCL
kernel" is if the kernel defines some symbol (C function name) that the library
considers to be public.  (In other words, that could happen because of a name
conflict.)  If #_dlopen fails, one way to see if this is why is to strip the
lisp kernel of all public symbols via:

$ strip /path/to/lx86cl64

and loading the shared library with that (stripped) kernel.

If I'm correct in interpreting what you said as "things fail in or soon after
the call to OPEN-SHARED-LIBRARY", then I'd generally want to try and find and
fix the cause of that before worry about anything else; if it's dying because
an init routine is calling into the CCL kernel due to a name conflict, it's
believable that that could cause a random-looking error on Linux and "just"
incorrect initialization of the library on Darwin.

If I'm misinterpreting what you said, then please ignore the last several
paragraphs.


> 
> I can imagine that this is difficult to make sense of.?But maybe you have
> some ideas how on could try to debug the problem. It seems like a bug in
> Clozure.

Yup.  Either that, or something else ...

> 
> Unfortunately, I don't have access to the source code. (I will eventually,
> in half a year or so, but I don't want to wait that long.)
> 
> --?
> Paul
> 
> 
>



More information about the Openmcl-devel mailing list