[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Gary Byers gb at clozure.com
Tue Oct 29 16:25:29 PDT 2013


The point where the crash occurs - set_n_bits() - is trying to set a number
of bits in a bit vector where that number is proportional to the number of
elements in a lisp vector (or vector-like object).  Such objects are preceded
in memory by a header word, and the typical cause of this crash is something
corrupting that header word (so that number of elements appears to be ridiculously
large.)

That in turn is usually caused by (unsafe) code that stores outside of the bounds of
an object:

(defvar *a* (make-array 3))

(defun foo (a)
   (declare (optimize (speed 3) (safety 0)))
   (setf (aref a 3) (large-random-number)))

(foo *a*)


That code will step on whatever happens to be sitting in memory just beyond
the vector *a*, and the clobbering may make it look like the thing that got
clobbbered has an impossibly large number of elements and that leads to the
crash that you're seeing.

Whatever's actually causing the crash for you isn't likely to be as simple
as the code above.  There may be something that causes smaller inconsistencies,
and as things run (including the GC) and chew on memory, those small inconsitencies
become larger.

It can be very hard to find problems like this, and the problem could be anywhere
(in CCL itself, in your code, in an libraries that you use.)  The GC ordinarily
assumes that memory is in a consistent state, but it can be made to do some 
additional consistency checks that can often find problems soon after they happen.
The global value of the variable CCL::*LISP-EVENT-STATUS-BITS* is ordinarily a
fixnum, and setting bit 2 of that fixnum:

(setq ccl::*gc-event-status-bits* (logior 4 ccl::*gc-event-status-bits*))

causes the GC to run some extra integrity checks before and after it runs.

It can still be difficult to find the root cause of this kind of memory corruption,
but I don't know of anything else that makes it easier.


On Tue, 29 Oct 2013, Paul Meurer wrote:

> Hi,
>
> I need some advice on how to further debug the following.
>
> I am consistently observing crashes when I do run concurrent database selects using clsql against a PostgreSQL backend. I am running the newest ccl-1.9 64bit on CentOS, the PostgreSQL library advertises itself as being thread safe. Here is the code I am running:
>
> (dotimes (i 16)
>  (ccl:process-run-function
>   (format nil "test~d" i)
>   (lambda (i)
>     (with-database (*default-database* *connection-spec* :if-exists :new)
>       (select [text] :from [text-table] :limit 10000)
>       (print i)))
>   i))
>
> This form can be run several times without problems, but eventually I get a segfault. I tried to debug in gdb, where I see that the crash seems to be GC-related (see below). The crash always happens at the same place in bits.c.
>
> I am aware that this is a complex scenario, where either the db lib, or uffi/clsql, or clozure could be the culprit, and it does not seem to be trivial to boil this down to a minimal case. So I would be grateful if somebody could give me some advice as to what would be the most promising way of nailing down this bug.
>
> ----------
>
> ? Unhandled exception 11 at 0x412360, context->regs at #x7f3ea52ed538
> Exception occurred while executing foreign code
> received signal 11; faulting address: 0x307e3f94d000
> invalid permissions for mapped object
> ?
>
> and in gdb:
>
> (gdb) br *0x0000000000412360
> Breakpoint 2 at 0x412360: file ../bits.c, line 45.
> (gdb) continue
> Continuing.
> [Switching to Thread 0x7f3ea52ef700 (LWP 3974)]
>
> Breakpoint 2, set_n_bits (bits=<value optimized out>,
>    first=<value optimized out>, n=<value optimized out>) at ../bits.c:45
> 45	        *wstart++ = ALL_ONES;
> 1: x/i $pc
> => 0x412360 <set_n_bits+112>:	movq   $0xffffffffffffffff,(%rax)
> (gdb) bt
> #0  set_n_bits (bits=<value optimized out>, first=<value optimized out>,
>    n=<value optimized out>) at ../bits.c:45
> #1  0x000000000041111c in rmark (n=52914162892765) at ../x86-gc.c:770
> #2  0x00000000004116fd in mark_root (n=<value optimized out>) at ../x86-gc.c:516
> #3  0x0000000000411b05 in mark_ephemeral_root (n=<value optimized out>)
>    at ../x86-gc.c:650
> #4  0x000000000040bfa2 in mark_memoized_area (a=0x1e926e0,
>    num_memo_dnodes=10288289) at ../gc-common.c:1473
> #5  0x000000000040d9f0 in gc (tcr=<value optimized out>,
>    param=<value optimized out>) at ../gc-common.c:1688
> #6  0x0000000000412c9b in gc_from_tcr (tcr=<value optimized out>,
>    param=<value optimized out>) at ../x86-exceptions.c:2924
> #7  0x0000000000413358 in gc_like_from_xp (xp=<value optimized out>,
>    fun=0x412c70 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2881
> #8  0x000000000041341e in gc_from_xp (xp=<value optimized out>,
>    param=<value optimized out>) at ../x86-exceptions.c:2936
> #9  0x0000000000414ad1 in allocate_object (xp=0x7f3ea52ee440, bytes_needed=32,
>    disp_from_allocptr=19, tcr=0x7f3ea52ef570,
>    crossed_threshold=<value optimized out>) at ../x86-exceptions.c:204
> #10 0x0000000000414b9d in handle_alloc_trap (xp=0x7f3ea52ee440,
>    tcr=0x7f3ea52ef570, notify=0x7f3ea52ee1cc) at ../x86-exceptions.c:644
> #11 0x0000000000415552 in handle_exception (signum=<value optimized out>,
>    info=<value optimized out>, context=0x7f3ea52ee440,
>    tcr=<value optimized out>, old_valence=<value optimized out>)
>    at ../x86-exceptions.c:1193
> #12 0x00000000004157fa in signal_handler (signum=11, info=0x7f3ea52ee7f0,
>    context=0x7f3ea52ee440) at ../x86-exceptions.c:1466
> #13 <signal handler called>
> #14 0x0000302000bdca65 in ?? ()
> #15 0x0000000000000052 in ?? ()
>
> -- 
> Best wishes,
> Paul
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>



More information about the Openmcl-devel mailing list