[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Gary Byers gb at clozure.com
Mon Dec 9 12:23:29 PST 2013


Rumors that I've become distracted and forgotten about this are at least
slightly exaggerated.

Could you (Paul) please try to reproduce the problem in the current 
trunk (r15975 or later) ?  I haven't been able to do so, but I often found
the problem harder to reproduce than you did.

On Thu, 7 Nov 2013, Gary Byers wrote:

> [Sorry if I'm bombarding this list with a discussion that may not be of
> wide interest.  If you aren't interested in this and don't read this message,
> rest assured that you aren't likely to be missing anything ...]
>
> The bug that Paul Meurer found seems to have two classes of symptoms:
>
> 1) a hard crash into the kernel debugger, usually due to a GC function
>   (set_n_bits()) being confused about the number of elements in a 
> vector-like
>   object.  That's typically a sign of memory corruption (unsafe lisp code
>   storing outside the bounds of an object.)
>
> 2) a (small) variety of memory faults and type errors signaled as lisp 
> conditions.
>   (Sometimes. these are followed by case(1) soon after.)  These kinds of 
> errors
>   can also be symptoms of the same kind of "memory corruption".
>
> When it was first reported, I suggested enabling some consistency checks in
> the GC that often find problems like this near the time they occur.  Paul
> said that he tried enabling those checks but they didn't seem to detect
> any problems but the crashes and faults continued to occur.
>
> I can reproduce the problems more reliably now than I'd been able to and can
> look at things in GDB when they occur.  One of my favorite class 2 symptoms
> so far is:
>
>> Error: The value #<A Foreign Pointer #x7FFFD6C15690> is not of the expected 
>> type MACPTR.
>> While executing: %GET-CSTRING, in process getstring1(71).
>
> which is sort of saying "that thing that should have been a foreign pointer
> (aka MACPTR) wasn't of that type when I looked at it a few cycles ago.  Never
> mind the fact that it is of that type now."  Of course that makes no sense,
> and of course it "can't happen."
>
> This is vaguely reminiscent of a situation that arises on machines that have
> seperate I and D caches:  we write some code to memory and execute it 
> (without
> properly synchronizing the caches first), and what actually gets executed
> doesn't have anything to do with the code in memory.  If that happens often
> enough or has happened recently, it's possible to be suspicious of that 
> issue;
> otherwise, it's a very confusing situation.
>
> So: some number of cores are running Paul's test case; they're doing 
> speculative
> memory accesses and doing things out-of-order and doing all of the things 
> that
> a modern CPU does (all the while maintaining the illusion - mostly - that 
> things
> are happening in "program" or "architectural" order.  Some thread needs to
> run the GC, so that thread suspends all other threads.  When a thread is 
> suspended,
> its "context" (architecturally-visible register values) are store someplace 
> where
> the GC can find it; the GC will use those register values to identify 
> reachable
> objects and if it moves such objects it updates the register values in the 
> saved
> context info.  When the GC finishes, it causes the suspended thread to resume
> execution with the updated (architecturally visible) register values; the 
> behavior
> that we're seeing is as if the core that the thread was running on resumes 
> the
> speculative, out-of-order execution that it was doing (that was based on old
> register values and we see things that can't happen happen.
>
> Some x86 instructions force things to synchronize with the "architectural" 
> state.
> When the GC suspends a thread, it atomically increments a counter associated 
> with
> the thread and only actually does the OS-level suspension if that counter 
> changed
> from 0 to 1.  Only one thread at a time can increment that counter (that 
> didn't
> used to be the case), but it still does an "atomic" incf of the counter 
> because
> the instructions involved are supposed to have this synchronizing effect. 
> They
> may do so, but the thread can resume its wacky out-of-order speculative 
> execution
> as soon as the counter has incremented and it won't actually be suspended 
> until
> some number of cycles later.
>
> I've tried to force things to synchronize after all other threads have 
> suspended
> and haven't seen the problem(s) occur since that change. That's not 
> conclusive
> (since none of this is entirely deterministic), but it seems encouraging.
>
> If this does explain the problems, it's a little odd that we haven't seen the
> problem occur more often (assuming that we haven't.)  I assume that a number
> of things have to happen just right (or "just wrong") in order for the 
> symptoms
> above to occur.
>
> I'll keep looking at this as time permits and if the results hold I'll try to
> check in the change in the next few days.
>
> On Thu, 7 Nov 2013, Gary Byers wrote:
>
>> I didn't have any luck reproducing the problem on that Xeon (or on another
>> one that rme pointed out we had), but I think that I found a way to make
>> the problem occur much more reliably than it has been.
>> 
>> What I've been doing (based on your code) is something like:
>> 
>> (dotimes (i n)  ; n is some large number
>>  (dotimes (j m) ; m is proportional to the number of cores
>>   (process-run-function ...)))
>> 
>> where each thread does 100000 iterations of a simple foreign function call.
>> 
>> What generally happens here is that we create threads faster than they can
>> run to completion; I think that then number of active threads gets up into
>> the 100s in some cases, and the term "active" is a bit of a misnomer, since
>> most of them sit idle while the few that get scheduled do enough consing to
>> trigger the GC, where we spend most of our time.
>> 
>> I changed the code to:
>> 
>> (let* ((sem (make-semaphore)))
>>  (dotimes (i n)
>>    (dotimes (j m)
>>      (process-run-function whatever
>>        (lambda ()
>>          (dotimes (k 100000)
>>            ..)
>>          (signal-semaphore sem))))
>>    (dotimes (j m) (wait-on-semaphore sem))))
>> 
>> e.g., to create M threads on each iteration of the loop and wait for them 
>> to
>> run to completion before creating more.  Those M threads should spend most
>> of their time running (and entering or returning from foreign function 
>> calls
>> when the GC runs), and that seems to trigger the problem more reliably.
>> 
>> I'll try to look at this as time permits and I don't know how long it'll
>> take me to see the problem when I do, but I think that I can at least 
>> reproduce
>> the problem much more reliably than I've been able to.
>> 
>> 
>> 
>> On Mon, 4 Nov 2013, Gary Byers wrote:
>> 
>>> Thanks.
>>> 
>>> Coincidentally, clozure.com crashed today; the hosting service moved it to
>>> a new machine (an 8-core Xeon).  Hmmm ...
>>> 
>>> 
>>> On Mon, 4 Nov 2013, Paul Meurer wrote:
>>> 
>>>> I did do the experiment you proposed.
>>>> 
>>>> On the older Xeon 4-core machine, the crashes get still somewhat less 
>>>> frequent, but this?might be insignificant
>>>> because I didn't run enough iterations. A crash occurs not more 
>>>> than?every 50th iteration in average. Perhaps not
>>>> often enough for convenient debugging.
>>>> 
>>>> Here are the specs:
>>>> 
>>>> processor ? ? ? : 0
>>>> vendor_id ? ? ? : GenuineIntel
>>>> cpu family ? ? ?: 6
>>>> model ? ? ? ? ? : 15
>>>> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
>>>> stepping ? ? ? ?: 6
>>>> cpu MHz ? ? ? ? : 2327.528
>>>> cache size ? ? ?: 4096 KB
>>>> physical id ? ? : 0
>>>> siblings ? ? ? ?: 2
>>>> core id ? ? ? ? : 0
>>>> cpu cores ? ? ? : 2
>>>> fpu ? ? ? ? ? ? : yes
>>>> fpu_exception ? : yes
>>>> cpuid level ? ? : 10
>>>> wp ? ? ? ? ? ? ?: yes
>>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>>>> mca cmov pat?pse36 clflush dts acpi mmx
>>>> fxsr sse sse2 ss ht tm syscall nx lm co
>>>> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>>>> bogomips ? ? ? ?: 4659.22
>>>> clflush size ? ?: 64
>>>> cache_alignment : 64
>>>> address sizes ? : 36 bits physical, 48 bits virtual
>>>> power management:
>>>> 
>>>> 
>>>> On the 16-core machine (16 inc. hyperthreading), nothing seems to have 
>>>> changed. The?latter has two CPUs with these
>>>> specs:
>>>> 
>>>> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
>>>> 
>>>> or from /proc/cpuinfo:
>>>> 
>>>> processor ? ? ? : 0
>>>> vendor_id ? ? ? : GenuineIntel
>>>> cpu family ? ? ?: 6
>>>> model ? ? ? ? ? : 45
>>>> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
>>>> stepping ? ? ? ?: 7
>>>> cpu MHz ? ? ? ? : 3301.000
>>>> cache size ? ? ?: 10240 KB
>>>> physical id ? ? : 0
>>>> siblings ? ? ? ?: 8
>>>> core id ? ? ? ? : 0
>>>> cpu cores ? ? ? : 4
>>>> apicid ? ? ? ? ?: 0
>>>> initial apicid ?: 0
>>>> fpu ? ? ? ? ? ? : yes
>>>> fpu_exception ? : yes
>>>> cpuid level ? ? : 13
>>>> wp ? ? ? ? ? ? ?: yes
>>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca 
>>>> cmov?pat pse36 clflush dts acpi mmx fxsr
>>>> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc 
>>>> arch_perfmon pebs bts rep_good?xtopology
>>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est 
>>>> tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
>>>> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts 
>>>> tpr_shadow vnmi?flexpriority ept vpid
>>>> bogomips ? ? ? ?: 6583.99
>>>> clflush size ? ?: 64
>>>> cache_alignment : 64
>>>> address sizes ? : 46 bits physical, 48 bits virtual
>>>> power management:
>>>> 
>>>> I also run the code on an AMD Opteron machine, but there, no crashes 
>>>> occur,?as far as I can see (after 50
>>>> iterations).
>>>>
>>>>       I tried running your example in an infinite loop on another Core-i7 
>>>> machine.
>>>>       After about an hour, it crashed in the way that you describe. ?I 
>>>> poked around
>>>>       a bit in GDB but wasn't sure what I was seeing; the C code in the 
>>>> CCL kernel
>>>>       (including the GC) is usually compiled with the -O2 option, which 
>>>> makes it
>>>>       run faster but makes debugging harder.
>>>>
>>>>       I figured that things would go faster if debugging was easier, so I 
>>>> rebuilt
>>>>       the kernel without -O2 and tried again. ?It's been running for over 
>>>> 24 hours
>>>>       at this point without incident.
>>>>
>>>>       Aside from being yet another example of the famous Heisenbug 
>>>> Uncertainty
>>>>       Principle, this suggests that how the C code is compiled (by what 
>>>> version
>>>>       of what compiler and at what optimization settings) may have 
>>>> something to
>>>>       do with the problem (or at least the frequency at which it occurs.)
>>>>
>>>>       I'm curious as to whether building the kernel without -O2 causes 
>>>> things to
>>>>       behave differently for you. ?To test this:
>>>>
>>>>       $ cd ccl/lisp-kernel/linuxx8664
>>>>       Edit the Makefile in that directory, changing the line:
>>>>
>>>>       COPT = -O2
>>>>
>>>>       to
>>>>
>>>>       COPT = #-02
>>>>
>>>>       $ make clean
>>>>       $ make
>>>>
>>>>       If the problem still occurs for you with the same frequency that 
>>>> it's been occurring
>>>>       on your Xeons, that'd tell us something (the the differences 
>>>> between the Xeon and
>>>>       other x8664 machines have more to do with the frequency with which 
>>>> the problem
>>>>       occurs than compiler issues do.) ?If that change masks or avoids 
>>>> the problem, that'd
>>>>       tell us a bit less. ?In either case, if you can try this experiment 
>>>> it'd be good to
>>>>       know the results.
>>>>
>>>>       If the processor difference remains a likely candidate, it'd be 
>>>> helpful to know
>>>>       the exact model number of the (smaller, 4-core) Xeon machine where 
>>>> the problem
>>>>       occurs (frequently) for you. ?Doing
>>>>
>>>>       $ cat /proc/cpuinfo
>>>>
>>>>       may list this info under "model name" for each core.
>>>>
>>>>       I've been able to reprouduce the problem twice on Core i7 machines 
>>>> in a few days
>>>>       of trying, and it'd likely be easiest for me to understand an fix 
>>>> if it was easier
>>>>       for me to reproduce.
>>>>
>>>>       On Thu, 31 Oct 2013, Paul Meurer wrote:
>>>>
>>>>             Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>>>>
>>>>             ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
>>>>             ? ? ? ? ? ?I run it now with --no-init and in the shell, with 
>>>> no difference. Immediate
>>>>             failure?with
>>>>             ? ? ? ? ? ?:consing in *features*,
>>>>             ? ? ? ? ? ?bogus objects etc. after several rounds without 
>>>> :consing.
>>>>
>>>>             ? ? ?So, I can't rant and rave about the sorry state of 
>>>> 3rd-party CL libraries, and
>>>>             ? ? ?anyone reading this won't be subjected to me doing so ?
>>>>
>>>>             ? ? ?Oh well.
>>>>
>>>>             ? ? ?I was able to reproduce the problem by running your test 
>>>> 100 times,
>>>>             I am not able to provoke it at all on the MacBook, and I 
>>>> tried a lot.
>>>>
>>>>             ? ? ?so apparently
>>>>             ? ? ?I won't be able to blame this on some aspect of your 
>>>> machine. ?(Also unfortunate,
>>>>             ? ? ?since my ability to diagnose problems that only occur on 
>>>> 16-core machines depends
>>>>             ? ? ?on my ability to borrow such machines for a few months.)
>>>>             I think you can do without a 16-core machine. I am able to 
>>>> reproduce the failure
>>>>             quite?reliably on an older 4-core
>>>>             machine with Xeon CPUs and SuSE, with slightly different code 
>>>> (perhaps to get the?timing
>>>>             right):
>>>>             (dotimes (j 100)
>>>>             ? (print (ccl::all-processes))
>>>>             ? (dotimes (i 8)
>>>>             ? ? (process-run-function
>>>>             ? ? ?(format nil "getstring-~a-~a" j i)
>>>>             ? ? ?(lambda (i)
>>>>             ? ? ? ?(let ((list ()))
>>>>             ? ? ? ? ?(dotimes (i 500000)
>>>>             ? ? ? ? ? ?(push (getstring) list)))
>>>>             ? ? ? ?(print i))
>>>>             ? ? ?i))
>>>>             ? (print (list :done j))
>>>>             ? (sleep 1))
>>>>             If you really need a 16-core machine to debug this I can give 
>>>> you access to mine. :-)
>>>>
>>>>             ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I 
>>>> am running CentOS 6.0,?and a
>>>>             recent CCL
>>>>             ? ? ? ? ? ?1.9 (I did svn update +
>>>>             ? ? ? ? ? ?rebuild of everything yesterday).
>>>>             ? ? ? ? ? ?I also observed that the problem goes away when I 
>>>> replace the constant string
>>>>             in?the
>>>>             ? ? ? ? ? ?library by a freshly
>>>>             ? ? ? ? ? ?allocated string:
>>>>             ? ? ? ? ? ?char *getstring() {?
>>>>             ? ? ? ? ? ?? int index;
>>>>             ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, 
>>>> sizeof(char));
>>>>             ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
>>>>             ? ? ? ? ? ?? ? ? buffer[index] = 'a';
>>>>             ? ? ? ? ? ?? ? }
>>>>             ? ? ? ? ? ?? buffer[100] = '\0';
>>>>             ? ? ? ? ? ?? return buffer ;
>>>>             ? ? ? ? ? ?}
>>>>             ? ? ? ? ? ?One should expect the strings in the Postgres 
>>>> library to be freshly allocated,
>>>>             but
>>>>             ? ? ? ? ? ?nevertheless they behave like
>>>>             ? ? ? ? ? ?the constant string example.
>>>>
>>>>             ? ? ?It's unlikely that this change directly avoids the bug 
>>>> (whatever it is); it's more
>>>>             ? ? ?likely that it affects timing (exactly what happens 
>>>> when.) ?I don't yet know what
>>>>             ? ? ?the bug is, but I think that it's likely that it's fair 
>>>> to characterize the bug
>>>>             ? ? ?as being "timing-sensitive". ?(For example: from the 
>>>> GC's point of view, whether
>>>>             ? ? ?a thread is running Lisp or foreign code when that 
>>>> thread is suspended by the GC.
>>>>             ? ? ?The transition between Lisp and foreign code takes a few 
>>>> instructions, and if
>>>>             ? ? ?a thread is suspended in the middle of that instruction 
>>>> sequence and the GC
>>>>             ? ? ?misintrprets its state, very bad things like what you're 
>>>> seeing could occur.
>>>>             ? ? ?That's not supposed to be possible, but something 
>>>> broadly similar seems to be
>>>>             ? ? ?happening.)
>>>>             --?
>>>>             Paul
>>>> 
>>>> 
>>>> --?
>>>> Paul
>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>> 
>>> 
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>



More information about the Openmcl-devel mailing list