[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Paul Meurer paul.meurer at mac.com
Wed Oct 30 14:42:59 PDT 2013


Am 30.10.2013 um 21:59 schrieb Gary Byers <gb at clozure.com>:

> On Wed, 30 Oct 2013, Paul Meurer wrote:
> [...]
> 
> I couldn't get the test case below to fail.  I tried a fairly current version
> of the CCL trunk and 1.9, and tried with and without :consing on *FEATURES*.
> I was using a Core i7 laptop with 4 true cores and hyperthreading (so it
> looks and behaves mostly like an 8-core machine.)

I can confirm that – I tested on a Core i7 MacBook with 4 cores and CCL 1.7, which I happen to have on the machine.

> Do things behave differently for you if you run without loading your init
> file ("$ ccl[64] -n" or "$ ccl[64] --no-init") ?  If you use SLIME, do
> they behave differently if you run CCL in the shell ?

I run it now with --no-init and in the shell, with no difference. Immediate failure with :consing in *features*, bogus objects etc. after several rounds without :consing.

My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL 1.9 (I did svn update + rebuild of everything yesterday).

I also observed that the problem goes away when I replace the constant string in the library by a freshly allocated string:

char *getstring() { 
  int index;
  char *buffer = (char *)calloc(100 + 1, sizeof(char));
  for (index = 0; index < 100; index++) {
      buffer[index] = 'a';
    }
  buffer[100] = '\0';
  return buffer ;
}

One should expect the strings in the Postgres library to be freshly allocated, but nevertheless they behave like the constant string example.


>> 
>> I did do several other tests:
>> 
>> The code runs without problems both in ACL 9.0-smp, and in SBCL (both 64 bit Linux, 16 cores).
>> 
>> Then I tried to build a minimal example that doesn't use any extra libraries, just plain CCL and a very basic C lib, and I think I somehow succeeded.
>> 
>> Here is the C library:
>> 
>> ---
>> // threadtest.c
>> // Compilation:
>> // gcc -Bsymbolic -shared threadtest.c -fPIC -L/usr/local/lib -o threadtest.so;
>> #include <stdio.h>
>> 
>> char *getstring() { return "asdfasdfasdfasdfasdfasdfasdfasdf" ; }
>> ---
>> 
>> and here the Lisp code.
>> 
>> ---
>> (open-shared-library "threadtest.so")
>> 
>> (defun getstring () (%get-cstring (external-call "getstring" :address)))
>> 
>> (dotimes (i 16)
>> (process-run-function
>>  (format nil "getstring~a" i)
>>  (lambda (i)
>>    (let ((list ())
>> 	   (size 0))
>>      (dotimes (i 100000)
>> 	 #-consing
>> 	 (incf size (length (getstring)))
>> 	 #+consing
>> 	 (push (getstring) list))))
>>  i))
>> ---
>> 
>> Again, the equivalent code (using cffi) runs fine in Allegro, and crashes in CCL. It crashes immediately with the consing variant, and only on the second or third run with the non- (or less)-consing variant. The crashes are of the same type as with the Postgres lib.
>> 
>> You might object that my simple-minded library is not threaded. If this is a valid objection, I will try to write a threaded lib with a dedicated thread/connection for each lisp process. (I still have to learn how this is done.) Yet I am wondering why I don't see similar behavior in ACL and SBCL.
>> 
>> - Paul
>> 
>>> It can still be difficult to find the root cause of this kind of memory corruption,
>>> but I don't know of anything else that makes it easier.
>>> 
>>> 
>>> On Tue, 29 Oct 2013, Paul Meurer wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I need some advice on how to further debug the following.
>>>> 
>>>> I am consistently observing crashes when I do run concurrent database selects using clsql against a PostgreSQL backend. I am running the newest ccl-1.9 64bit on CentOS, the PostgreSQL library advertises itself as being thread safe. Here is the code I am running:
>>>> 
>>>> (dotimes (i 16)
>>>> (ccl:process-run-function
>>>> (format nil "test~d" i)
>>>> (lambda (i)
>>>>   (with-database (*default-database* *connection-spec* :if-exists :new)
>>>>     (select [text] :from [text-table] :limit 10000)
>>>>     (print i)))
>>>> i))
>>>> 
>>>> This form can be run several times without problems, but eventually I get a segfault. I tried to debug in gdb, where I see that the crash seems to be GC-related (see below). The crash always happens at the same place in bits.c.
>>>> 
>>>> I am aware that this is a complex scenario, where either the db lib, or uffi/clsql, or clozure could be the culprit, and it does not seem to be trivial to boil this down to a minimal case. So I would be grateful if somebody could give me some advice as to what would be the most promising way of nailing down this bug.
>>>> 
>>>> ----------
>>>> 
>>>> ? Unhandled exception 11 at 0x412360, context->regs at #x7f3ea52ed538
>>>> Exception occurred while executing foreign code
>>>> received signal 11; faulting address: 0x307e3f94d000
>>>> invalid permissions for mapped object
>>>> ?
>>>> 
>>>> and in gdb:
>>>> 
>>>> (gdb) br *0x0000000000412360
>>>> Breakpoint 2 at 0x412360: file ../bits.c, line 45.
>>>> (gdb) continue
>>>> Continuing.
>>>> [Switching to Thread 0x7f3ea52ef700 (LWP 3974)]
>>>> 
>>>> Breakpoint 2, set_n_bits (bits=<value optimized out>,
>>>>  first=<value optimized out>, n=<value optimized out>) at ../bits.c:45
>>>> 45	        *wstart++ = ALL_ONES;
>>>> 1: x/i $pc
>>>> => 0x412360 <set_n_bits+112>:	movq   $0xffffffffffffffff,(%rax)
>>>> (gdb) bt
>>>> #0  set_n_bits (bits=<value optimized out>, first=<value optimized out>,
>>>>  n=<value optimized out>) at ../bits.c:45
>>>> #1  0x000000000041111c in rmark (n=52914162892765) at ../x86-gc.c:770
>>>> #2  0x00000000004116fd in mark_root (n=<value optimized out>) at ../x86-gc.c:516
>>>> #3  0x0000000000411b05 in mark_ephemeral_root (n=<value optimized out>)
>>>>  at ../x86-gc.c:650
>>>> #4  0x000000000040bfa2 in mark_memoized_area (a=0x1e926e0,
>>>>  num_memo_dnodes=10288289) at ../gc-common.c:1473
>>>> #5  0x000000000040d9f0 in gc (tcr=<value optimized out>,
>>>>  param=<value optimized out>) at ../gc-common.c:1688
>>>> #6  0x0000000000412c9b in gc_from_tcr (tcr=<value optimized out>,
>>>>  param=<value optimized out>) at ../x86-exceptions.c:2924
>>>> #7  0x0000000000413358 in gc_like_from_xp (xp=<value optimized out>,
>>>>  fun=0x412c70 <gc_from_tcr>, param=0) at ../x86-exceptions.c:2881
>>>> #8  0x000000000041341e in gc_from_xp (xp=<value optimized out>,
>>>>  param=<value optimized out>) at ../x86-exceptions.c:2936
>>>> #9  0x0000000000414ad1 in allocate_object (xp=0x7f3ea52ee440, bytes_needed=32,
>>>>  disp_from_allocptr=19, tcr=0x7f3ea52ef570,
>>>>  crossed_threshold=<value optimized out>) at ../x86-exceptions.c:204
>>>> #10 0x0000000000414b9d in handle_alloc_trap (xp=0x7f3ea52ee440,
>>>>  tcr=0x7f3ea52ef570, notify=0x7f3ea52ee1cc) at ../x86-exceptions.c:644
>>>> #11 0x0000000000415552 in handle_exception (signum=<value optimized out>,
>>>>  info=<value optimized out>, context=0x7f3ea52ee440,
>>>>  tcr=<value optimized out>, old_valence=<value optimized out>)
>>>>  at ../x86-exceptions.c:1193
>>>> #12 0x00000000004157fa in signal_handler (signum=11, info=0x7f3ea52ee7f0,
>>>>  context=0x7f3ea52ee440) at ../x86-exceptions.c:1466
>>>> #13 <signal handler called>
>>>> #14 0x0000302000bdca65 in ?? ()
>>>> #15 0x0000000000000052 in ?? ()
>>>> 
>>>> --
>>>> Best wishes,
>>>> Paul
>>>> 
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>> 
>>>> 
>> 
>> -- 
>> Paul
>> 
>> 

-- 
Paul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131030/b4b8f917/attachment.htm>


More information about the Openmcl-devel mailing list