[Openmcl-devel] Advice needed to debug shared library problem on Mac and Linux

Tue Mar 29 08:17:57 PDT 2011

So (hopefully without divulging any secrets) we can note that you make
2 foreign function calls into this library and die in the second.

The first call takes no arguments and returns a pointer to some data structure
that the library uses; the second call takes a C string pointer to a filename
(which presumably contains a textual definition of an FSM) and the pointer
returned in the first call as arguments and should return another pointer,
but it dies in foreign code with a memory fault.

There isn't anything unusual or complicated about either of those ff-calls,
and it'd be pretty hard to believe that any of cffi/uffi/the native FFI got
something wrong here.  That second ff-call was the last time that Lisp had
any direct influence on what was going on, and if we assume that the foreign
function got reasonable arguments then the reason that it would crash under
CCL and not under other implementations would be "environmental factors".  One
environmental factor that I'd be suspicious of in this case is the size of
the stack that the foreign code is running on.

CCL doesn't try to detect or recover from stack overflow in foreign code.
Recovery's a very hard problem in this case, but if we punt on it we can
do a better job of identifying the cause of a memory fault as being "stack
overflow".  ("You know, you dropped into the kernel debugger with some sort
of memory fault, but now that I think about it the faulting address is very
near the stack pointer and the stack pointer's a ways past the lowest address
in the stack, so let's call that 'stack overflow'.")

If I call a little C function that does infinite recursion on a fairly up-to-date
64-bit Linux system, I get:
? (external-call "recurse" :int 0 :int)
Unhandled exception 11 at 0x7f4f69e5356c, context->regs at #x7f4f6a4d13d8
Exception occurred while executing foreign code
  at recurse + 16
received signal 11; faulting address: 0x7f4f6a06dff8
invalid permissions for mapped object
? for help
[7699] Clozure CL kernel debugger: r
%rax = 0x0000000000000000      %r8  = 0x0000000000000000
%rcx = 0x00007f4f69e53300      %r9  = 0x00007f4f6a4d1b3d
%rdx = 0x0000000000000000      %r10 = 0x000030000010e47c
%rbx = 0x00007f4f6a4d1b10      %r11 = 0x00007f4f69e5355c
%rsp = 0x00007f4f6a06e000      %r12 = 0x00007f4f6a4d1acd
%rbp = 0x00007f4f6a06e010      %r13 = 0x000030000010d4df
%rsi = 0x0000000000000000      %r14 = 0x000000000001300b
%rdi = 0x0000000000000000      %r15 = 0x0000000000000000
%rip = 0x00007f4f69e5356c   %rflags = 0x00010206

Note that the stack pointer (%rsp) happens to be right on a page boundary
and the faulting address is 8 bytes beyond that.  If we look at the current
thread's stack's bounds:

[7699] Clozure CL kernel debugger: t
Current Thread Context Record (tcr) = 0x7f4f6a4d2570
Control (C) stack area:  low = 0x7f4f6a27f000, high = 0x7f4f6a4d3000
Value (lisp) stack area: low = 0x7f4f6a054000, high = 0x7f4f6a26e000
Exception stack pointer = 0x7f4f6a06e000

we can see that the stack pointer's a bit past the low bound of the C stack
area.  (In other words, we've stack overflowed.)

A stack overflow will generally cause a fault like this when the stack pointer
eventually runs into an unmapped or write-protected page.  It can cause other
kinds of misbehavior if the stack pointer runs into some other data and starts
scribbling over it.  (Who knows ?  That other data might actually be important ...)

In your original message, the register dump showed the stack pointer to be very
near a page boundary; the faulting address wasn't reported correctly on Darwin.
If you can get it to die with a memory fault, you might want to see if the
value of the stack pointer, the faulting address, and the C stack bounds as
reported by the T command all seem to suggest a stack overflow.

This is a plausible environmental factor because threads' stacks are likely
somewhat smaller in CCL than in other implementations.  The initial thread's
stack's size is determined by the "stacksize" resource limit; that seems to
be about 8MB on Darwin and 10MB on Linux, but CCL will ordinarily only try
to use about 1-2MB of that; threads other than the initial one will generally
use the smaller size by default.

The -Z command line option affects the size of the listener thread's stack;
starting the lisp via

$ ccl64 -Z 10M

would cause the listener thread's C stack to be about as large as the initial
thread's would ordinarily be on Linux.  If your test case runs in that environment,
then I think that we've found the environmental factor that prevents it from
working by default.

On Tue, 29 Mar 2011, Paul Meurer wrote:

>       Is the library publically available ? ?Googling found an API
>       reference document,
>       but I didn't see a link to the library/headers anywhere.
> 
> 
> It is not (yet). They plan to make it open source, but that's a lengthy
> process involving lawyers. I can ask if I am allowed to make it available to
> you if necessary. Lauri Karttunen, the author of the lib, should be
> interested in getting this debugged.
>
>       Whether the library's freely available or not, what does the C
>       prototype for
>       the function that you're trying to call look like ? ?What does
>       the lisp code
>       that you're using to make that call look like ? ?Are you just
>       using the CCL
>       FFI, or a portability package like cffi or uffi ?
> 
> 
> I am using cffi. I also tested uffi, with the same result. I'll send you a
> test case and the header file in a separate mail.
> 
> --?
> Paul
> 
> 
>