[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Wed Oct 30 08:14:24 PDT 2013

> The point where the crash occurs - set_n_bits() - is trying to set a number
> of bits in a bit vector where that number is proportional to the number of
> elements in a lisp vector (or vector-like object).  Such objects are preceded
> in memory by a header word, and the typical cause of this crash is something
> corrupting that header word (so that number of elements appears to be ridiculously
> large.)
> 
> That in turn is usually caused by (unsafe) code that stores outside of the bounds of
> an object:
> 
> (defvar *a* (make-array 3))
> 
> (defun foo (a)
>  (declare (optimize (speed 3) (safety 0)))
>  (setf (aref a 3) (large-random-number)))
> 
> (foo *a*)
> 
> 
> That code will step on whatever happens to be sitting in memory just beyond
> the vector *a*, and the clobbering may make it look like the thing that got
> clobbbered has an impossibly large number of elements and that leads to the
> crash that you're seeing.
> 
> Whatever's actually causing the crash for you isn't likely to be as simple
> as the code above.  There may be something that causes smaller inconsistencies,
> and as things run (including the GC) and chew on memory, those small inconsitencies
> become larger.

Thanks for the explanations, Gary.

> It can be very hard to find problems like this, and the problem could be anywhere
> (in CCL itself, in your code, in an libraries that you use.)  The GC ordinarily
> assumes that memory is in a consistent state, but it can be made to do some additional consistency checks that can often find problems soon after they happen.
> The global value of the variable CCL::*LISP-EVENT-STATUS-BITS* is ordinarily a
> fixnum, and setting bit 2 of that fixnum:
> 
> (setq ccl::*gc-event-status-bits* (logior 4 ccl::*gc-event-status-bits*))
> 
> causes the GC to run some extra integrity checks before and after it runs.

I tried this, but didn't see anything new.

I did do several other tests:

The code runs without problems both in ACL 9.0-smp, and in SBCL (both 64 bit Linux, 16 cores).

Then I tried to build a minimal example that doesn't use any extra libraries, just plain CCL and a very basic C lib, and I think I somehow succeeded.

Here is the C library:

---
// threadtest.c
// Compilation:
// gcc -Bsymbolic -shared threadtest.c -fPIC -L/usr/local/lib -o threadtest.so;
#include <stdio.h>

char *getstring() { return "asdfasdfasdfasdfasdfasdfasdfasdf" ; }
---

and here the Lisp code.

---
(open-shared-library "threadtest.so")

(defun getstring () (%get-cstring (external-call "getstring" :address)))

(dotimes (i 16)
  (process-run-function
   (format nil "getstring~a" i)
   (lambda (i)
     (let ((list ())
	   (size 0))
       (dotimes (i 100000)
	 #-consing
	 (incf size (length (getstring)))
	 #+consing
	 (push (getstring) list))))
   i))
---

Again, the equivalent code (using cffi) runs fine in Allegro, and crashes in CCL. It crashes immediately with the consing variant, and only on the second or third run with the non- (or less)-consing variant. The crashes are of the same type as with the Postgres lib.

You might object that my simple-minded library is not threaded. If this is a valid objection, I will try to write a threaded lib with a dedicated thread/connection for each lisp process. (I still have to learn how this is done.) Yet I am wondering why I don't see similar behavior in ACL and SBCL.

- Paul

> It can still be difficult to find the root cause of this kind of memory corruption,
> but I don't know of anything else that makes it easier.
> 
> 
> On Tue, 29 Oct 2013, Paul Meurer wrote:
> 
>> Hi,
>> 
>> I need some advice on how to further debug the following.
>> 
>> I am consistently observing crashes when I do run concurrent database selects using clsql against a PostgreSQL backend. I am running the newest ccl-1.9 64bit on CentOS, the PostgreSQL library advertises itself as being thread safe. Here is the code I am running:
>> 
>> (dotimes (i 16)
>> (ccl:process-run-function
>>  (format nil "test~d" i)
>>  (lambda (i)
>>    (with-database (*default-database* *connection-spec* :if-exists :new)
>>      (select [text] :from [text-table] :limit 10000)
>>      (print i)))
>>  i))
>> 
>> This form can be run several times without problems, but eventually I get a segfault. I tried to debug in gdb, where I see that the crash seems to be GC-related (see below). The crash always happens at the same place in bits.c.
>> 
>> I am aware that this is a complex scenario, where either the db lib, or uffi/clsql, or clozure could be the culprit, and it does not seem to be trivial to boil this down to a minimal case. So I would be grateful if somebody could give me some advice as to what would be the most promising way of nailing down this bug.
>> 
>> ----------
>> 
>> ? Unhandled exception 11 at 0x412360, context->regs at #x7f3ea52ed538
>> Exception occurred while executing foreign code
>> received signal 11; faulting address: 0x307e3f94d000
>> invalid permissions for mapped object
>> ?
>> 
>> and in gdb:
>> 
>> (gdb) br *0x0000000000412360
>> Breakpoint 2 at 0x412360: file ../bits.c, line 45.
>> (gdb) continue
>> Continuing.
>> [Switching to Thread 0x7f3ea52ef700 (LWP 3974)]
>> 
>> Breakpoint 2, set_n_bits (bits=<value optimized out>,
>>   first=<value optimized out>, n=<value optimized out>) at ../bits.c:45
>> 45	        *wstart++ = ALL_ONES;
>> 1: x/i $pc
>> => 0x412360 <set_n_bits+112>:	movq   $0xffffffffffffffff,(%rax)
>> (gdb) bt
>> #0  set_n_bits (bits=<value optimized out>, first=<value optimized out>,
>>   n=<value optimized out>) at ../bits.c:45
>> #1  0x000000000041111c in rmark (n=52914162892765) at ../x86-gc.c:770
>> #2  0x00000000004116fd in mark_root (n=<value optimized out>) at ../x86-gc.c:516
>> #3  0x0000000000411b05 in mark_ephemeral_root (n=<value optimized out>)
>>   at ../x86-gc.c:650
>> #4  0x000000000040bfa2 in mark_memoized_area (a=0x1e926e0,
>>   num_memo_dnodes=10288289) at ../gc-common.c:1473
>> #5  0x000000000040d9f0 in gc (tcr=<value optimized out>,
>>   param=<value optimized out>) at ../gc-common.c:1688
>> #6  0x0000000000412c9b in gc_from_tcr (tcr=<value optimized out>,
>>   param=<value optimized out>) at ../x86-exceptions.c:2924
>> #7  0x0000000000413358 in gc_like_from_xp (xp=<value optimized out>,
>>   fun=0x412c70 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2881
>> #8  0x000000000041341e in gc_from_xp (xp=<value optimized out>,
>>   param=<value optimized out>) at ../x86-exceptions.c:2936
>> #9  0x0000000000414ad1 in allocate_object (xp=0x7f3ea52ee440, bytes_needed=32,
>>   disp_from_allocptr=19, tcr=0x7f3ea52ef570,
>>   crossed_threshold=<value optimized out>) at ../x86-exceptions.c:204
>> #10 0x0000000000414b9d in handle_alloc_trap (xp=0x7f3ea52ee440,
>>   tcr=0x7f3ea52ef570, notify=0x7f3ea52ee1cc) at ../x86-exceptions.c:644
>> #11 0x0000000000415552 in handle_exception (signum=<value optimized out>,
>>   info=<value optimized out>, context=0x7f3ea52ee440,
>>   tcr=<value optimized out>, old_valence=<value optimized out>)
>>   at ../x86-exceptions.c:1193
>> #12 0x00000000004157fa in signal_handler (signum=11, info=0x7f3ea52ee7f0,
>>   context=0x7f3ea52ee440) at ../x86-exceptions.c:1466
>> #13 <signal handler called>
>> #14 0x0000302000bdca65 in ?? ()
>> #15 0x0000000000000052 in ?? ()
>> 
>> -- 
>> Best wishes,
>> Paul
>> 
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>> 

-- 
Paul