[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Thu Nov 7 10:06:15 PST 2013

I didn't have any luck reproducing the problem on that Xeon (or on another
one that rme pointed out we had), but I think that I found a way to make
the problem occur much more reliably than it has been.

What I've been doing (based on your code) is something like:

(dotimes (i n)  ; n is some large number
   (dotimes (j m) ; m is proportional to the number of cores
    (process-run-function ...)))

where each thread does 100000 iterations of a simple foreign function call.

What generally happens here is that we create threads faster than they can
run to completion; I think that then number of active threads gets up into
the 100s in some cases, and the term "active" is a bit of a misnomer, since
most of them sit idle while the few that get scheduled do enough consing to
trigger the GC, where we spend most of our time.

I changed the code to:

(let* ((sem (make-semaphore)))
   (dotimes (i n)
     (dotimes (j m)
       (process-run-function whatever
         (lambda ()
           (dotimes (k 100000)
             ..)
           (signal-semaphore sem))))
     (dotimes (j m) (wait-on-semaphore sem))))

e.g., to create M threads on each iteration of the loop and wait for them to
run to completion before creating more.  Those M threads should spend most
of their time running (and entering or returning from foreign function calls
when the GC runs), and that seems to trigger the problem more reliably.

I'll try to look at this as time permits and I don't know how long it'll
take me to see the problem when I do, but I think that I can at least reproduce
the problem much more reliably than I've been able to.

On Mon, 4 Nov 2013, Gary Byers wrote:

> Thanks.
>
> Coincidentally, clozure.com crashed today; the hosting service moved it to
> a new machine (an 8-core Xeon).  Hmmm ...
>
>
> On Mon, 4 Nov 2013, Paul Meurer wrote:
>
>> I did do the experiment you proposed.
>> 
>> On the older Xeon 4-core machine, the crashes get still somewhat less 
>> frequent, but this?might be insignificant
>> because I didn't run enough iterations. A crash occurs not more than?every 
>> 50th iteration in average. Perhaps not
>> often enough for convenient debugging.
>> 
>> Here are the specs:
>> 
>> processor ? ? ? : 0
>> vendor_id ? ? ? : GenuineIntel
>> cpu family ? ? ?: 6
>> model ? ? ? ? ? : 15
>> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
>> stepping ? ? ? ?: 6
>> cpu MHz ? ? ? ? : 2327.528
>> cache size ? ? ?: 4096 KB
>> physical id ? ? : 0
>> siblings ? ? ? ?: 2
>> core id ? ? ? ? : 0
>> cpu cores ? ? ? : 2
>> fpu ? ? ? ? ? ? : yes
>> fpu_exception ? : yes
>> cpuid level ? ? : 10
>> wp ? ? ? ? ? ? ?: yes
>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
>> cmov pat?pse36 clflush dts acpi mmx
>> fxsr sse sse2 ss ht tm syscall nx lm co
>> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> bogomips ? ? ? ?: 4659.22
>> clflush size ? ?: 64
>> cache_alignment : 64
>> address sizes ? : 36 bits physical, 48 bits virtual
>> power management:
>> 
>> 
>> On the 16-core machine (16 inc. hyperthreading), nothing seems to have 
>> changed. The?latter has two CPUs with these
>> specs:
>> 
>> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
>> 
>> or from /proc/cpuinfo:
>> 
>> processor ? ? ? : 0
>> vendor_id ? ? ? : GenuineIntel
>> cpu family ? ? ?: 6
>> model ? ? ? ? ? : 45
>> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
>> stepping ? ? ? ?: 7
>> cpu MHz ? ? ? ? : 3301.000
>> cache size ? ? ?: 10240 KB
>> physical id ? ? : 0
>> siblings ? ? ? ?: 8
>> core id ? ? ? ? : 0
>> cpu cores ? ? ? : 4
>> apicid ? ? ? ? ?: 0
>> initial apicid ?: 0
>> fpu ? ? ? ? ? ? : yes
>> fpu_exception ? : yes
>> cpuid level ? ? : 13
>> wp ? ? ? ? ? ? ?: yes
>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca 
>> cmov?pat pse36 clflush dts acpi mmx fxsr
>> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc 
>> arch_perfmon pebs bts rep_good?xtopology
>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est tm2 
>> ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
>> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts 
>> tpr_shadow vnmi?flexpriority ept vpid
>> bogomips ? ? ? ?: 6583.99
>> clflush size ? ?: 64
>> cache_alignment : 64
>> address sizes ? : 46 bits physical, 48 bits virtual
>> power management:
>> 
>> I also run the code on an AMD Opteron machine, but there, no crashes 
>> occur,?as far as I can see (after 50
>> iterations).
>>
>>       I tried running your example in an infinite loop on another Core-i7 
>> machine.
>>       After about an hour, it crashed in the way that you describe. ?I 
>> poked around
>>       a bit in GDB but wasn't sure what I was seeing; the C code in the CCL 
>> kernel
>>       (including the GC) is usually compiled with the -O2 option, which 
>> makes it
>>       run faster but makes debugging harder.
>>
>>       I figured that things would go faster if debugging was easier, so I 
>> rebuilt
>>       the kernel without -O2 and tried again. ?It's been running for over 
>> 24 hours
>>       at this point without incident.
>>
>>       Aside from being yet another example of the famous Heisenbug 
>> Uncertainty
>>       Principle, this suggests that how the C code is compiled (by what 
>> version
>>       of what compiler and at what optimization settings) may have 
>> something to
>>       do with the problem (or at least the frequency at which it occurs.)
>>
>>       I'm curious as to whether building the kernel without -O2 causes 
>> things to
>>       behave differently for you. ?To test this:
>>
>>       $ cd ccl/lisp-kernel/linuxx8664
>>       Edit the Makefile in that directory, changing the line:
>>
>>       COPT = -O2
>>
>>       to
>>
>>       COPT = #-02
>>
>>       $ make clean
>>       $ make
>>
>>       If the problem still occurs for you with the same frequency that it's 
>> been occurring
>>       on your Xeons, that'd tell us something (the the differences between 
>> the Xeon and
>>       other x8664 machines have more to do with the frequency with which 
>> the problem
>>       occurs than compiler issues do.) ?If that change masks or avoids the 
>> problem, that'd
>>       tell us a bit less. ?In either case, if you can try this experiment 
>> it'd be good to
>>       know the results.
>>
>>       If the processor difference remains a likely candidate, it'd be 
>> helpful to know
>>       the exact model number of the (smaller, 4-core) Xeon machine where 
>> the problem
>>       occurs (frequently) for you. ?Doing
>>
>>       $ cat /proc/cpuinfo
>>
>>       may list this info under "model name" for each core.
>>
>>       I've been able to reprouduce the problem twice on Core i7 machines in 
>> a few days
>>       of trying, and it'd likely be easiest for me to understand an fix if 
>> it was easier
>>       for me to reproduce.
>>
>>       On Thu, 31 Oct 2013, Paul Meurer wrote:
>>
>>             Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>>
>>             ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
>>             ? ? ? ? ? ?I run it now with --no-init and in the shell, with 
>> no difference. Immediate
>>             failure?with
>>             ? ? ? ? ? ?:consing in *features*,
>>             ? ? ? ? ? ?bogus objects etc. after several rounds without 
>> :consing.
>>
>>             ? ? ?So, I can't rant and rave about the sorry state of 
>> 3rd-party CL libraries, and
>>             ? ? ?anyone reading this won't be subjected to me doing so ?
>>
>>             ? ? ?Oh well.
>>
>>             ? ? ?I was able to reproduce the problem by running your test 
>> 100 times,
>>             I am not able to provoke it at all on the MacBook, and I tried 
>> a lot.
>>
>>             ? ? ?so apparently
>>             ? ? ?I won't be able to blame this on some aspect of your 
>> machine. ?(Also unfortunate,
>>             ? ? ?since my ability to diagnose problems that only occur on 
>> 16-core machines depends
>>             ? ? ?on my ability to borrow such machines for a few months.)
>>             I think you can do without a 16-core machine. I am able to 
>> reproduce the failure
>>             quite?reliably on an older 4-core
>>             machine with Xeon CPUs and SuSE, with slightly different code 
>> (perhaps to get the?timing
>>             right):
>>             (dotimes (j 100)
>>             ? (print (ccl::all-processes))
>>             ? (dotimes (i 8)
>>             ? ? (process-run-function
>>             ? ? ?(format nil "getstring-~a-~a" j i)
>>             ? ? ?(lambda (i)
>>             ? ? ? ?(let ((list ()))
>>             ? ? ? ? ?(dotimes (i 500000)
>>             ? ? ? ? ? ?(push (getstring) list)))
>>             ? ? ? ?(print i))
>>             ? ? ?i))
>>             ? (print (list :done j))
>>             ? (sleep 1))
>>             If you really need a 16-core machine to debug this I can give 
>> you access to mine. :-)
>>
>>             ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I 
>> am running CentOS 6.0,?and a
>>             recent CCL
>>             ? ? ? ? ? ?1.9 (I did svn update +
>>             ? ? ? ? ? ?rebuild of everything yesterday).
>>             ? ? ? ? ? ?I also observed that the problem goes away when I 
>> replace the constant string
>>             in?the
>>             ? ? ? ? ? ?library by a freshly
>>             ? ? ? ? ? ?allocated string:
>>             ? ? ? ? ? ?char *getstring() {?
>>             ? ? ? ? ? ?? int index;
>>             ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, 
>> sizeof(char));
>>             ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
>>             ? ? ? ? ? ?? ? ? buffer[index] = 'a';
>>             ? ? ? ? ? ?? ? }
>>             ? ? ? ? ? ?? buffer[100] = '\0';
>>             ? ? ? ? ? ?? return buffer ;
>>             ? ? ? ? ? ?}
>>             ? ? ? ? ? ?One should expect the strings in the Postgres 
>> library to be freshly allocated,
>>             but
>>             ? ? ? ? ? ?nevertheless they behave like
>>             ? ? ? ? ? ?the constant string example.
>>
>>             ? ? ?It's unlikely that this change directly avoids the bug 
>> (whatever it is); it's more
>>             ? ? ?likely that it affects timing (exactly what happens when.) 
>> ?I don't yet know what
>>             ? ? ?the bug is, but I think that it's likely that it's fair to 
>> characterize the bug
>>             ? ? ?as being "timing-sensitive". ?(For example: from the GC's 
>> point of view, whether
>>             ? ? ?a thread is running Lisp or foreign code when that thread 
>> is suspended by the GC.
>>             ? ? ?The transition between Lisp and foreign code takes a few 
>> instructions, and if
>>             ? ? ?a thread is suspended in the middle of that instruction 
>> sequence and the GC
>>             ? ? ?misintrprets its state, very bad things like what you're 
>> seeing could occur.
>>             ? ? ?That's not supposed to be possible, but something broadly 
>> similar seems to be
>>             ? ? ?happening.)
>>             --?
>>             Paul
>> 
>> 
>> --?
>> Paul
>> 
>> 
>> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>