[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Gary Byers gb at clozure.com
Thu Nov 7 21:50:48 PST 2013


[Sorry if I'm bombarding this list with a discussion that may not be of
wide interest.  If you aren't interested in this and don't read this message,
rest assured that you aren't likely to be missing anything ...]

The bug that Paul Meurer found seems to have two classes of symptoms:

1) a hard crash into the kernel debugger, usually due to a GC function
    (set_n_bits()) being confused about the number of elements in a vector-like
    object.  That's typically a sign of memory corruption (unsafe lisp code
    storing outside the bounds of an object.)

2) a (small) variety of memory faults and type errors signaled as lisp conditions.
    (Sometimes. these are followed by case(1) soon after.)  These kinds of errors
    can also be symptoms of the same kind of "memory corruption".

When it was first reported, I suggested enabling some consistency checks in
the GC that often find problems like this near the time they occur.  Paul
said that he tried enabling those checks but they didn't seem to detect
any problems but the crashes and faults continued to occur.

I can reproduce the problems more reliably now than I'd been able to and can
look at things in GDB when they occur.  One of my favorite class 2 symptoms
so far is:

> Error: The value #<A Foreign Pointer #x7FFFD6C15690> is not of the expected type MACPTR.
> While executing: %GET-CSTRING, in process getstring1(71).

which is sort of saying "that thing that should have been a foreign pointer
(aka MACPTR) wasn't of that type when I looked at it a few cycles ago.  Never
mind the fact that it is of that type now."  Of course that makes no sense,
and of course it "can't happen."

This is vaguely reminiscent of a situation that arises on machines that have
seperate I and D caches:  we write some code to memory and execute it (without
properly synchronizing the caches first), and what actually gets executed
doesn't have anything to do with the code in memory.  If that happens often
enough or has happened recently, it's possible to be suspicious of that issue;
otherwise, it's a very confusing situation.

So: some number of cores are running Paul's test case; they're doing speculative
memory accesses and doing things out-of-order and doing all of the things that
a modern CPU does (all the while maintaining the illusion - mostly - that things
are happening in "program" or "architectural" order.  Some thread needs to
run the GC, so that thread suspends all other threads.  When a thread is suspended,
its "context" (architecturally-visible register values) are store someplace where
the GC can find it; the GC will use those register values to identify reachable
objects and if it moves such objects it updates the register values in the saved
context info.  When the GC finishes, it causes the suspended thread to resume
execution with the updated (architecturally visible) register values; the behavior
that we're seeing is as if the core that the thread was running on resumes the
speculative, out-of-order execution that it was doing (that was based on old
register values and we see things that can't happen happen.

Some x86 instructions force things to synchronize with the "architectural" state.
When the GC suspends a thread, it atomically increments a counter associated with
the thread and only actually does the OS-level suspension if that counter changed
from 0 to 1.  Only one thread at a time can increment that counter (that didn't
used to be the case), but it still does an "atomic" incf of the counter because
the instructions involved are supposed to have this synchronizing effect.  They
may do so, but the thread can resume its wacky out-of-order speculative execution
as soon as the counter has incremented and it won't actually be suspended until
some number of cycles later.

I've tried to force things to synchronize after all other threads have suspended
and haven't seen the problem(s) occur since that change. That's not conclusive
(since none of this is entirely deterministic), but it seems encouraging.

If this does explain the problems, it's a little odd that we haven't seen the
problem occur more often (assuming that we haven't.)  I assume that a number
of things have to happen just right (or "just wrong") in order for the symptoms
above to occur.

I'll keep looking at this as time permits and if the results hold I'll try to
check in the change in the next few days.

On Thu, 7 Nov 2013, Gary Byers wrote:

> I didn't have any luck reproducing the problem on that Xeon (or on another
> one that rme pointed out we had), but I think that I found a way to make
> the problem occur much more reliably than it has been.
>
> What I've been doing (based on your code) is something like:
>
> (dotimes (i n)  ; n is some large number
>  (dotimes (j m) ; m is proportional to the number of cores
>   (process-run-function ...)))
>
> where each thread does 100000 iterations of a simple foreign function call.
>
> What generally happens here is that we create threads faster than they can
> run to completion; I think that then number of active threads gets up into
> the 100s in some cases, and the term "active" is a bit of a misnomer, since
> most of them sit idle while the few that get scheduled do enough consing to
> trigger the GC, where we spend most of our time.
>
> I changed the code to:
>
> (let* ((sem (make-semaphore)))
>  (dotimes (i n)
>    (dotimes (j m)
>      (process-run-function whatever
>        (lambda ()
>          (dotimes (k 100000)
>            ..)
>          (signal-semaphore sem))))
>    (dotimes (j m) (wait-on-semaphore sem))))
>
> e.g., to create M threads on each iteration of the loop and wait for them to
> run to completion before creating more.  Those M threads should spend most
> of their time running (and entering or returning from foreign function calls
> when the GC runs), and that seems to trigger the problem more reliably.
>
> I'll try to look at this as time permits and I don't know how long it'll
> take me to see the problem when I do, but I think that I can at least 
> reproduce
> the problem much more reliably than I've been able to.
>
>
>
> On Mon, 4 Nov 2013, Gary Byers wrote:
>
>> Thanks.
>> 
>> Coincidentally, clozure.com crashed today; the hosting service moved it to
>> a new machine (an 8-core Xeon).  Hmmm ...
>> 
>> 
>> On Mon, 4 Nov 2013, Paul Meurer wrote:
>> 
>>> I did do the experiment you proposed.
>>> 
>>> On the older Xeon 4-core machine, the crashes get still somewhat less 
>>> frequent, but this?might be insignificant
>>> because I didn't run enough iterations. A crash occurs not more than?every 
>>> 50th iteration in average. Perhaps not
>>> often enough for convenient debugging.
>>> 
>>> Here are the specs:
>>> 
>>> processor ? ? ? : 0
>>> vendor_id ? ? ? : GenuineIntel
>>> cpu family ? ? ?: 6
>>> model ? ? ? ? ? : 15
>>> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
>>> stepping ? ? ? ?: 6
>>> cpu MHz ? ? ? ? : 2327.528
>>> cache size ? ? ?: 4096 KB
>>> physical id ? ? : 0
>>> siblings ? ? ? ?: 2
>>> core id ? ? ? ? : 0
>>> cpu cores ? ? ? : 2
>>> fpu ? ? ? ? ? ? : yes
>>> fpu_exception ? : yes
>>> cpuid level ? ? : 10
>>> wp ? ? ? ? ? ? ?: yes
>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
>>> cmov pat?pse36 clflush dts acpi mmx
>>> fxsr sse sse2 ss ht tm syscall nx lm co
>>> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>>> bogomips ? ? ? ?: 4659.22
>>> clflush size ? ?: 64
>>> cache_alignment : 64
>>> address sizes ? : 36 bits physical, 48 bits virtual
>>> power management:
>>> 
>>> 
>>> On the 16-core machine (16 inc. hyperthreading), nothing seems to have 
>>> changed. The?latter has two CPUs with these
>>> specs:
>>> 
>>> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
>>> 
>>> or from /proc/cpuinfo:
>>> 
>>> processor ? ? ? : 0
>>> vendor_id ? ? ? : GenuineIntel
>>> cpu family ? ? ?: 6
>>> model ? ? ? ? ? : 45
>>> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
>>> stepping ? ? ? ?: 7
>>> cpu MHz ? ? ? ? : 3301.000
>>> cache size ? ? ?: 10240 KB
>>> physical id ? ? : 0
>>> siblings ? ? ? ?: 8
>>> core id ? ? ? ? : 0
>>> cpu cores ? ? ? : 4
>>> apicid ? ? ? ? ?: 0
>>> initial apicid ?: 0
>>> fpu ? ? ? ? ? ? : yes
>>> fpu_exception ? : yes
>>> cpuid level ? ? : 13
>>> wp ? ? ? ? ? ? ?: yes
>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca 
>>> cmov?pat pse36 clflush dts acpi mmx fxsr
>>> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc 
>>> arch_perfmon pebs bts rep_good?xtopology
>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est tm2 
>>> ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
>>> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts 
>>> tpr_shadow vnmi?flexpriority ept vpid
>>> bogomips ? ? ? ?: 6583.99
>>> clflush size ? ?: 64
>>> cache_alignment : 64
>>> address sizes ? : 46 bits physical, 48 bits virtual
>>> power management:
>>> 
>>> I also run the code on an AMD Opteron machine, but there, no crashes 
>>> occur,?as far as I can see (after 50
>>> iterations).
>>>
>>>       I tried running your example in an infinite loop on another Core-i7 
>>> machine.
>>>       After about an hour, it crashed in the way that you describe. ?I 
>>> poked around
>>>       a bit in GDB but wasn't sure what I was seeing; the C code in the 
>>> CCL kernel
>>>       (including the GC) is usually compiled with the -O2 option, which 
>>> makes it
>>>       run faster but makes debugging harder.
>>>
>>>       I figured that things would go faster if debugging was easier, so I 
>>> rebuilt
>>>       the kernel without -O2 and tried again. ?It's been running for over 
>>> 24 hours
>>>       at this point without incident.
>>>
>>>       Aside from being yet another example of the famous Heisenbug 
>>> Uncertainty
>>>       Principle, this suggests that how the C code is compiled (by what 
>>> version
>>>       of what compiler and at what optimization settings) may have 
>>> something to
>>>       do with the problem (or at least the frequency at which it occurs.)
>>>
>>>       I'm curious as to whether building the kernel without -O2 causes 
>>> things to
>>>       behave differently for you. ?To test this:
>>>
>>>       $ cd ccl/lisp-kernel/linuxx8664
>>>       Edit the Makefile in that directory, changing the line:
>>>
>>>       COPT = -O2
>>>
>>>       to
>>>
>>>       COPT = #-02
>>>
>>>       $ make clean
>>>       $ make
>>>
>>>       If the problem still occurs for you with the same frequency that 
>>> it's been occurring
>>>       on your Xeons, that'd tell us something (the the differences between 
>>> the Xeon and
>>>       other x8664 machines have more to do with the frequency with which 
>>> the problem
>>>       occurs than compiler issues do.) ?If that change masks or avoids the 
>>> problem, that'd
>>>       tell us a bit less. ?In either case, if you can try this experiment 
>>> it'd be good to
>>>       know the results.
>>>
>>>       If the processor difference remains a likely candidate, it'd be 
>>> helpful to know
>>>       the exact model number of the (smaller, 4-core) Xeon machine where 
>>> the problem
>>>       occurs (frequently) for you. ?Doing
>>>
>>>       $ cat /proc/cpuinfo
>>>
>>>       may list this info under "model name" for each core.
>>>
>>>       I've been able to reprouduce the problem twice on Core i7 machines 
>>> in a few days
>>>       of trying, and it'd likely be easiest for me to understand an fix if 
>>> it was easier
>>>       for me to reproduce.
>>>
>>>       On Thu, 31 Oct 2013, Paul Meurer wrote:
>>>
>>>             Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>>>
>>>             ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
>>>             ? ? ? ? ? ?I run it now with --no-init and in the shell, with 
>>> no difference. Immediate
>>>             failure?with
>>>             ? ? ? ? ? ?:consing in *features*,
>>>             ? ? ? ? ? ?bogus objects etc. after several rounds without 
>>> :consing.
>>>
>>>             ? ? ?So, I can't rant and rave about the sorry state of 
>>> 3rd-party CL libraries, and
>>>             ? ? ?anyone reading this won't be subjected to me doing so ?
>>>
>>>             ? ? ?Oh well.
>>>
>>>             ? ? ?I was able to reproduce the problem by running your test 
>>> 100 times,
>>>             I am not able to provoke it at all on the MacBook, and I tried 
>>> a lot.
>>>
>>>             ? ? ?so apparently
>>>             ? ? ?I won't be able to blame this on some aspect of your 
>>> machine. ?(Also unfortunate,
>>>             ? ? ?since my ability to diagnose problems that only occur on 
>>> 16-core machines depends
>>>             ? ? ?on my ability to borrow such machines for a few months.)
>>>             I think you can do without a 16-core machine. I am able to 
>>> reproduce the failure
>>>             quite?reliably on an older 4-core
>>>             machine with Xeon CPUs and SuSE, with slightly different code 
>>> (perhaps to get the?timing
>>>             right):
>>>             (dotimes (j 100)
>>>             ? (print (ccl::all-processes))
>>>             ? (dotimes (i 8)
>>>             ? ? (process-run-function
>>>             ? ? ?(format nil "getstring-~a-~a" j i)
>>>             ? ? ?(lambda (i)
>>>             ? ? ? ?(let ((list ()))
>>>             ? ? ? ? ?(dotimes (i 500000)
>>>             ? ? ? ? ? ?(push (getstring) list)))
>>>             ? ? ? ?(print i))
>>>             ? ? ?i))
>>>             ? (print (list :done j))
>>>             ? (sleep 1))
>>>             If you really need a 16-core machine to debug this I can give 
>>> you access to mine. :-)
>>>
>>>             ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I 
>>> am running CentOS 6.0,?and a
>>>             recent CCL
>>>             ? ? ? ? ? ?1.9 (I did svn update +
>>>             ? ? ? ? ? ?rebuild of everything yesterday).
>>>             ? ? ? ? ? ?I also observed that the problem goes away when I 
>>> replace the constant string
>>>             in?the
>>>             ? ? ? ? ? ?library by a freshly
>>>             ? ? ? ? ? ?allocated string:
>>>             ? ? ? ? ? ?char *getstring() {?
>>>             ? ? ? ? ? ?? int index;
>>>             ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, 
>>> sizeof(char));
>>>             ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
>>>             ? ? ? ? ? ?? ? ? buffer[index] = 'a';
>>>             ? ? ? ? ? ?? ? }
>>>             ? ? ? ? ? ?? buffer[100] = '\0';
>>>             ? ? ? ? ? ?? return buffer ;
>>>             ? ? ? ? ? ?}
>>>             ? ? ? ? ? ?One should expect the strings in the Postgres 
>>> library to be freshly allocated,
>>>             but
>>>             ? ? ? ? ? ?nevertheless they behave like
>>>             ? ? ? ? ? ?the constant string example.
>>>
>>>             ? ? ?It's unlikely that this change directly avoids the bug 
>>> (whatever it is); it's more
>>>             ? ? ?likely that it affects timing (exactly what happens 
>>> when.) ?I don't yet know what
>>>             ? ? ?the bug is, but I think that it's likely that it's fair 
>>> to characterize the bug
>>>             ? ? ?as being "timing-sensitive". ?(For example: from the GC's 
>>> point of view, whether
>>>             ? ? ?a thread is running Lisp or foreign code when that thread 
>>> is suspended by the GC.
>>>             ? ? ?The transition between Lisp and foreign code takes a few 
>>> instructions, and if
>>>             ? ? ?a thread is suspended in the middle of that instruction 
>>> sequence and the GC
>>>             ? ? ?misintrprets its state, very bad things like what you're 
>>> seeing could occur.
>>>             ? ? ?That's not supposed to be possible, but something broadly 
>>> similar seems to be
>>>             ? ? ?happening.)
>>>             --?
>>>             Paul
>>> 
>>> 
>>> --?
>>> Paul
>>> 
>>> 
>>> 
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>



More information about the Openmcl-devel mailing list